Q: A data team working for an online magazine uses a regression technique
to learn about advertising sales in different sections of the publication. They
estimate the linear relationship between one continuous dependent variable and
four independent variables. What technique are they using?
- Multiple linear regression
- Simple linear regression
- Interaction regression
- Coefficient regression
Explanation: In situations when there is one continuous dependent variable and two or more independent variables, multiple linear regression is used. This technique enables the team to concurrently model the linear connection that exists between the dependent variable and numerous predictors for the dependent variable.
Q: What technique turns one categorical variable into several binary
variables?
- Multiple linear regression
- Overfitting
- One hot encoding
- Adjusted R squared
Explanation: Categorical variables are converted into a form that can be used in machine learning algorithms or statistical models via the use of a method known as one hot encoding, which is utilized in the process of data preparation. It does this by generating binary variables, often known as dummy variables, for each category of the initial categorical variable. When it comes to a specific observation, each binary variable indicates whether a certain category is present (encoded as 1) or missing (encoded as 0).
Q: Which of the following is true regarding variance inflation factors?
Select all that apply.
- The larger the variable inflation factor, the less multicollinearity in
the model.
- The minimum value is 0.
- The larger the variable inflation factor, the more multicollinearity in
the model.
- The minimum value is 1.
Explanation: Because the variance inflation factor (VIF) reflects the degree to which the variance of a regression coefficient is inflated as a result of multicollinearity with other predictors, this statement is accurate. Values of the VIF that are higher will suggest a greater multicollinearity. On the other hand, where there is no multicollinearity, the VIF for a predictor variable is equal to 1. A score of 1 for the Variance Inflation Factor (VIF) implies that the variable in question is totally unrelated to the other predictors in the model.
Q: What term represents how the relationship between two independent
variables is associated with changes in the mean of the dependent variable?
- Normality term
- Selection term
- Interaction term
- Coefficient term
Explanation: Multiplying two independent variables together results in the creation of an interaction term from the perspective of regression analysis. It makes it possible to investigate if the correlation between the value of one independent variable and the influence of another independent variable on the dependent variable is reliant on it.To capture the combined impact or interaction between variables, interaction terms are used. These words indicate whether the connection between the predictors and the outcome variable changes as a result of their combined influence.
Q: Which of the following statements accurately describe adjusted R
squared? Select all that apply.
- It is greater than 1.
- It is a regression evaluation metric.
- It can vary from 0 to 1.
- It penalizes unnecessary explanatory variables.
Explanation: The range of values for adjusted R-squared is the same as that of normal R-squared: 0 to 1. When the number is closer to 1, it shows that the model is working more closely with the data. By penalizing the inclusion of predictors that are not essential in the model, adjusted R-squared is a useful statistic. With the inclusion of new predictors that do not substantially enhance the model's fit, it makes a downward adjustment to the R-squared value.
Q: Which of the following statements accurately describe forward
selection and backward elimination? Select all that apply.
- Forward selection begins with the full model with all possible
independent variables.
- Forward selection begins with the full model with all possible dependent
variables.
- Forward selection begins with the null model and zero independent
variables.
- Backward elimination begins with the full model with all possible
independent variables.
Explanation: Forward selection begins with a null model, which is a model that does not include any predictors, and then goes on to add predictors one by one depending on the contribution they make to the model until a stopping condition is satisfied. Beginning with a model that includes all of the predictors, the process of backward elimination involves removing each predictor one at a time depending on the contribution they make to the model until a stopping condition is satisfied.
Q: A data professional reviews model predictions for a human resources
project. They discover that the model performs poorly on both the training data
and the test holdout data, consistently predicting figures that are too low.
This leads to inaccurate estimates about employee retention. What quality does
this model have too much of?
- Bias
- Entropy
- Variance
- Leakage
Explanation: In the context of approximating a real-world situation with a simplified model, the term "bias" refers to the mistake that emerges. In general, a biased model has a tendency to either anticipate outcomes incorrectly or forecast them incorrectly.
Q: What regularization technique completely removes variables that are
less important to predicting the y variable of interest?
- Elastic net regression
- Independent regression
- Lasso regression
- Ridge regression
Explanation: During the process of model training, lasso regression applies a penalty to the regression coefficients. This penalty causes some of the coefficients to decrease in value until they are closer to zero. This approach essentially does variable selection by setting certain coefficients to zero. All of the variables that are deemed to be of less significance, namely those with coefficients that are reduced to zero, are eliminated from the model completely.
Q: A data team with a restaurant group uses a regression technique to
learn about customer loyalty and ratings. They estimate the linear relationship
between one continuous dependent variable and two independent variables. What
technique are they using?
- Coefficient regression
- Simple linear regression
- Interaction regression
- Multiple linear regression
Explanation: It is necessary to use multiple linear regression in situations when there is a single continuous dependent variable and two or more independent variables. Additionally, it provides an estimate of the linear connection that exists between the collection of independent factors and the dependent variable.
Q: A data professional confirms that no two independent variables are
highly correlated with each other. Which assumption are they testing for?
- No multicollinearity assumption
- No linearity assumption
- No normality assumption
- No homoscedasticity assumption
Explanation: According to this presumption, the independent variables that make up a regression model should not have a strong correlation with one another. In regression analysis, a high level of multicollinearity may lead to problems, such as unstable estimations of regression coefficients and difficulties in comprehending the model.
Q: What term represents the relationship for how two variables’ values
affect each other?
- Underfitting term
- Linearity term
- Interaction term
- Feature selection term
Explanation: To describe how the influence of one independent variable on the dependent variable shifts based on the value of another independent variable, an interaction term is used in the context of regression analysis. This term is employed to model the relationship between the two variables. It indicates that the connection between the predictors and the outcome variable is not simply additive but rather varies depending on their joint impact. This is because it captures the combined effect or interaction between variables.
Q: Which regression evaluation metric penalizes unnecessary explanatory
variables?
- Holdout sampling
- Adjusted R squared
- Overfitting
- Regression sampling
Explanation: By punishing the inclusion of new predictors that do not substantially enhance the model's fit, adjusted R-squared takes the R-squared measure and modifies it accordingly. Not only does it take into consideration the amount of predictors in the model, but it also penalizes needless complexity, which results in a more accurate evaluation of the model's goodness of fit.
Q: A data professional tells you that their model fails to adequately
capture the relationship between the target variable and independent variables
because it has too much bias. What is the most likely cause of the bias?
- Underfitting
- Overfitting
- Leakage
- Entropy
Explanation: If a model is too simplistic in its ability to capture the underlying patterns in the data, underfitting occurs. In most cases, it is the consequence of a model that is not sufficiently complicated to accurately depict the actual connection that exists between the variables. An underfit model may have a large bias and a low variance as a consequence of this, which may result in poor performance on both the training data and the test data used.
Q: What regularization technique minimizes the impact of less relevant
variables, but drops none of the variables from the equation?
- Lasso regression
- Forward regression
- Elastic net regression
- Ridge regression
Explanation: During the training of the model, Ridge regression applies a penalty on the regression coefficients. This penalty causes the coefficients to decrease in value until they are closer to zero, but it does not completely set them to zero. As a result, all of the variables are still included in the model; however, their coefficients are decreased to diminish the influence that they have on the predictions made by the model. Ridge regression is an efficient method for lowering the frequency of multicollinearity and bringing the coefficients into stability.
Q: Fill in the blank: The no multicollinearity assumption states that
no two _____ variables can be highly correlated with each other.
- Dependent
- categorical
- independent
- continuous
Explanation: For the purpose of a regression model, the assumption of no multicollinearity asserts that two independent variables can't have a strong correlation with each other. When there is a high degree of multicollinearity between the independent variables, it might result in unstable estimates of the regression coefficients, which in turn can hinder the interpretation of the model. As a result, while fitting a regression model, it is essential to make certain that the independent variables do not have a strong correlation with one another.
Q: Fill in the blank: An interaction term represents how the
relationship between two independent variables is associated with changes in
the _____ of the dependent variable.
- category
- multicollinearity
- assumption
- mean
Explanation: Within the context of regression analysis, an interaction term is a word that describes how the influence of one independent variable on the dependent variable changes based on the quantity of another independent variable. It demonstrates how the combined impact of the predictors has an effect on the outcome variable by capturing the modification or interaction that occurs between the predictors.
Q: A data professional uses an evaluation metric that penalizes
unnecessary explanatory variables. Which metric are they using?
- Link function
- Adjusted R squared
- Ordinary least squares
- Holdout sampling
Explanation: By punishing the inclusion of new predictors that do not substantially enhance the model's fit, adjusted R-squared takes the R-squared measure and modifies it accordingly. Not only does it take into consideration the amount of predictors in the model, but it also penalizes needless complexity, which results in a more accurate evaluation of the model's goodness of fit.
Q: What stepwise variable selection process begins with the full model
with all possible independent variables?
- Forward selection
- Backward elimination
- Extra-sum-of-squares F-test
- Overfit selection
Explanation: Forward selection begins with a null model, which is a model that does not include any predictors. It then iteratively adds predictors, depending on the contribution that they make to the model until a stopping condition is satisfied. The first stage is to include all of the possible predictors, and then at each subsequent step, it assesses the relevance of those predictors in terms of increasing the model fit.
Q: A data analytics team creates a model for a project supporting their
company’s sales department. The model performs very well on the training data,
but it scores much worse when used to predict on new, unseen data. What does
this model have too much of?
- Entropy
- Bias
- Leakage
- Variance
Explanation: In the context of training data, the term "variance" refers to the degree to which a model is sensitive to very few variations. By capturing noise or random fluctuations in the training data, a model with high variance is too complicated and results in poor generalization to new data. This is because the model catches noise. Overfitting is the proper term for this issue.
Q: A data professional at a car rental agency uses a regression
technique to learn about how customers engage with various sections of the
company website. They estimate the linear relationship between one continuous
dependent variable and three independent variables. What technique are they
using?
- One hot encoding
- Multiple linear regression
- Simple linear regression
- Interaction terms
Explanation: It is necessary to use multiple linear regression in situations when there is a single continuous dependent variable and two or more independent variables. Additionally, it provides an estimate of the linear connection that exists between the collection of independent factors and the dependent variable.