Module 3: Clean Your Data

Q: Which of the following terms are used to describe missing data? Select all that apply.

  • Zero
  • Blank 
  • NaN 
  • N/A 
Explanation: Used often to indicate fields in datasets that are either missing or empty.Utilized often in the fields of computers and data analysis to represent values that are either undefined or unrepresentable, including data that is absent. To indicate the absence of data, especially in situations where a value cannot be ascertained or is irrelevant, this word is sometimes employed.

Q: Stakeholders at a film studio hire a data analytics firm to provide insights about the best locations for film shoots. However, the film studio’s datasets contain missing data. Which of the following strategies can help the data analytics firm solve this problem? Select all that apply.

  • Use their best judgment to add in values themselves.
  • Create a NaN category. 
  • Add in the missing values by taking the average values from the existing data. 
  • Ask the film studio to fill in the missing values. 
Explanation: If you want to identify missing values without making any changes to the original dataset, you may use this method, which entails classifying missing data as NaN (not a number) or using another placeholder.It is possible to offer an acceptable approximation for the missing values by imputing them using statistical techniques such as averaging the available data points. This helps to maintain the integrity of the dataset.It is possible to assure accuracy and completeness by working together with stakeholders or data owners (in this example, the film studio) to offer missing data. This is particularly true if the stakeholders or data owners have extra information or records that can fill in the gaps.

Q: A data professional writes the following code:

df.merge(df_zip, how='left',

    on=['date','center_point_geom'])

Which section of the code refers to the dataframe to be merged with df?

  • df_zip
  • how=’left’
  • merge
  • center_point_geom
Explanation: This section makes reference to the data frame (df_zip) that is now undergoing the process of merging with the data frame df. The data frame that will be used to merge the data into df is specified by this expression.

Q: What pandas function is used to pull all of the missing values from a data frame?

  • pd.getnull()
  • pd.ofnull()
  • pd.findnull()
  • pd.isnull() 
Explanation: The output of this method is a boolean DataFrame that has the same form as the DataFrame that was sent in as input (True for values that are missing and False for values that are present). A typical use for it is in the process of data cleaning and analysis, to locate and effectively manage missing data. 

Q: What type of outliers are values that are completely different from the overall data group and have no association with any other outliers?

  • Collective outliers
  • Global outliers 
  • Contextual outliers
  • Dissimilar outliers
Explanation: It does not matter if there is any local context or relationship with other outliers; these are the data points that are considerably different from all of the other data points in the dataset. The dataset contains extreme values that do not adhere to the overall pattern of the data, and these values stand out as particular examples.

Q: A data professional works for a car insurance company. To gain insights about the popularity of electric vehicles, they study categorical data about cars. They add a 0 to their dataset to indicate if a car is gas-powered and a 1 if a car is electric. What does this scenario describe?

  • Applying a variable character
  • Changing a floating point
  • Using dummy variables 
  • Removing a data operator
Explanation: The purpose of these variables is to provide a numerical representation of category data. Assigning the values 0 and 1 to represent the various categories (gas-powered automobiles vs electric cars) in this scenario makes it possible to do numerical analysis on categorical data. There are situations in statistical modeling and analysis that call for the incorporation of categorical data into models that need numerical inputs. Dummy variables are especially helpful in these situations.

Q: What type of data visualization shows the concentration of values between two data points by illustrating their magnitude with two colors?

  • Heat map 
  • Treemap
  • Scatter plot
  • Density map
Explanation: If you want to illustrate the density or intensity of data points over a geographical or spatial region, you may use color gradients to create a density map, which is also known as a heat map or an intensity map. To illustrate regions of high and low density or intensity of data values between two locations, it commonly makes use of two colors (or a color gradient) to display the information. The visualization of geographical data patterns, such as population density, temperature fluctuations, or concentrations of events, is a typical use of density maps in geographic information systems (GIS) and data analysis respectively.

Q: What does the pandas function pd.duplicated() return to indicate that a data value does not have a duplicate value within the same dataset?

  • True
  • Duplicate
  • Unique
  • False 
Explanation: False is the right response to indicate that a data value does not have a duplicate value inside the same dataset. This is because the correct answer is accurate. This indicates that the row is one of a kind and does not occur somewhere else in the collection as a representation of a duplicate.

Q: Fill in the blank: The pandas function _____ enables data professionals to create a new dataframe with all duplicate rows removed.

  • drop_duplicates() 
  • deduplicate()
  • de_duplication()
  • deduplication()
Explanation: It is possible to use this method to eliminate duplicate rows based on a particular selection of columns or, by default, all columns. The function then returns a data frame that contains distinctive rows.

Q: Which of the following terms can be used to describe a value that is not stored for a variable in a set of data? Select all that apply.

  • Zero
  • N/A 
  • NaN 
  • Blank 
Explanation: When a certain value is not appropriate or relevant to the situation, this comes into play. In numerical situations, this serves the purpose of indicating numerical values that are either absent or undefined.

Q: A data professional writes the following code:

df.merge(df_zip, how='left',

    on=['date','center_point_geom'])

Which of the following is a parameter for the merge?

  • df_joined
  • how=’left’ 
  • df.merge()
  • df.head()

Q: What tasks could the pandas function pd.isnull() be used for? Select all that apply.

  • To delete all of the values from a data frame
  • To change all values to nulls in a data frame
  • To identify when a value is missing from a data frame 
  • To pull all of the missing values from a data frame 
Explanation: The pd.isnull() function outputs a boolean DataFrame that indicates True in the case of missing values and False in all other cases.To retrieve rows or columns that are lacking values, this function might be of assistance.

Q: Fill in the blank: Contextual outliers are normal data points under certain conditions but become _____ under most other conditions.

  • Insignificant
  • Samples
  • Anomalies 
  • Standard
Explanation: The term "contextual outliers" refers to data points that, when seen in a larger or different context, may look like anomalies or outliers, although they are regarded as typical or expected within a particular context or section of data. Therefore, contrary to the majority of other situations or points of view, they are anomalies.

Q: A data professional works for a veterinary office. To gain insights about the most common household pets, they study categorical data about pet adoptions over the past five years. They assign the number 1 to dogs, 2 to cats, 3 to hamsters, and so on. What does this scenario describe?

  • Data blending
  • Label encoding 
  • Data partitioning
  • Aliasing
Explanation: In the process of data preparation, label encoding is a method that involves converting category variables into numerical form. Attributing a distinct integer name to each category is the norm. Examples of label encoding include assigning the numbers 1 to dogs, 2 to cats, 3 to hamsters, and so on. This is an example of label encoding, which is the process of encoding the categories of pets with numerical markers.

Q: Fill in the blank: A _____ is a data visualization that displays the magnitude of a set of values using two colors to show the concentration of the values.

  • heat map 
  • bubble chart
  • bar graph
  • line chart
Explanation: Heat maps are especially useful for displaying data in situations where it is necessary to emphasize the frequency or frequency of occurrences. To make it simple to identify patterns or concentrations within the data, they use a color gradient to depict the magnitude of values across a variety of categories or dimensions.

Q: Fill in the blank: A data professional should _____ a duplicate when its value is clearly a mistake or will misrepresent the remaining unique values within the dataset.

  • Eliminate 
  • keep
  • filter
  • replicate
Explanation: In order to guarantee precision and dependability in data analysis, one of the most important steps in data cleaning and preprocessing is the removal of duplicates. Duplicates can skew statistical measurements, influence the results of modeling, and overall lead to a distortion in the representation of the underlying data.

Q: Fill in the blank: N/A and NaN are terms used to describe _____ data.

  • Missing 
  • nominal
  • qualitative
  • string
Explanation: This indicates that a certain value is not appropriate or relevant to the context in which it is being considered.To indicate numerical values that are either absent or undefined, used specifically in numerical situations.

Q: What does the pandas function pd.duplicated() return to indicate that a data value is a duplicate of another value within the same dataset?

  • Duplicate
  • Unique
  • False
  • True 
Explanation: It produces a boolean Series, with each item indicating if the accompanying value is or is not a duplicate (True or False, respectively). Additionally, it indicates as True the duplicates that follow the first occurrence of each value. This is the default setting. Using the retain option allows you to choose whether or not the first occurrence should be considered a duplicate.

Q: A data professional at a garden center researches data related to ideal growing climates. As they familiarize themselves with the datasets, they discover some data is missing. Which of the following strategies can help them solve this problem? Select all that apply.

  • Change the missing values to Boolean data that is either true or false.
  • Create a NaN category. 
  • Derive new representative values based on available data. 
  • Add in the missing values by taking the average values from the existing data. 
Explanation: To do this, missing values must be marked as NaN (not a number) or another category that has been specifically specified. In this way, it is possible to differentiate between missing data and real values, which may make handling during analysis much simpler.At times, it is possible to estimate or infer missing numbers by using the trends or patterns that are already present in the data. By using this technique, it is possible to get an acceptable approximation and guarantee that the dataset will continue to be valuable for analysis.It is possible to keep the statistical features of the dataset while filling in gaps by impute missing values with statistical measures such as the mean or median of the data that is available.

Q: What pandas function enables a data professional to determine if duplicate values are present in a dataset?

  • pd.deduplication() 
  • pd.duplicated()
  • pd.dupe()
  • pd.deduplicates()
Explanation: Use the pd.duplicated() function to determine whether or not a pandas DataFrame has duplicate rows (or values in a particular column).It gives back a boolean Series, in which each element specifies whether the row that corresponds to it is a duplicate (True) or not (False at the same time).Additionally, it indicates as True the duplicates that follow the first occurrence of each value. This is the default setting. In the event that it is necessary, you may modify this behavior by using the retain option.

Q: A data team for an investment banker works on a project related to interest rates. As they familiarize themselves with the datasets, they discover some data is missing. Which of the following strategies can help them solve this problem? Select all that apply.

  • Change the missing values to zeros.
  • Ask the owner of the data to fill in the missing values. 
  • Derive new representative values based on available data. 
  • Add in the missing values by taking the average values from the existing data. 
Explanation: It is a basic method to guarantee that all of the data is full if it is feasible to retrieve the missing data from the source or the owner of the data.Utilizing statistical techniques to estimate missing values based on the data that is currently accessible is what this term refers to. The use of statistics such as the mean, median, or other measures of central tendency are examples of common procedures.

 

Post a Comment

Previous Post Next Post