A Tour to Linear Regression

40 min readJun 15, 2020

Machine learning can be summarized as learning a function (f) that maps input variables (X) to output variables (Y).

Y = f(x) An algorithm learns this target mapping function from training data.

The form of the function is unknown, so our job as machine learning practitioners is to evaluate different machine learning algorithms and see which is better at approximating the underlying function. Different algorithms make different assumptions or biases about the form of the function and how it can be learned.

Lets see what is parametric and non parametric learning algorithm ..

Parametric Machine Learning Algorithms

Assumptions can greatly simplify the learning process, but can also limit what can be learned. Algorithms that simplify the function to a known form are called parametric machine learning algorithms. A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs.

Benefits of Parametric Machine Learning Algorithms: Simpler: These methods are easier to understand and interpret results. Speed: Parametric models are very fast to learn from data. Less Data: They do not require as much training data and can work well even if the fit to the data is not perfect.

Limitations of Parametric Machine Learning Algorithms: Constrained: By choosing a functional form these methods are highly constrained to the specified form. Limited Complexity: The methods are more suited to simpler problems. Poor Fit: In practice the methods are unlikely to match the underlying mapping function.

Nonparametric Machine Learning Algorithms

Algorithms that do not make strong assumptions about the form of the mapping function are called nonparametric machine learning algorithms. By not making assumptions, they are free to learn any functional form from the training data. Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you don’t want to worry too much about choosing just the right features. Nonparametric methods seek to best fit the training data in constructing the mapping function, whilst maintaining some ability to generalize to unseen data. As such, they are able to fit a large number of functional forms. An easy to understand nonparametric model is the k-nearest neighbors algorithm that makes predictions based on the k most similar training patterns for a new data instance. The method does not assume anything about the form of the mapping function other than patterns that are close are likely to have a similar output variable.

Some more examples of popular nonparametric machine learning algorithms are: k-Nearest Neighbors Decision Trees like CART and C4.5 Support Vector Machines

Benefits of Nonparametric Machine Learning Algorithms: Flexibility: Capable of fitting a large number of functional forms. Power: No assumptions (or weak assumptions) about the underlying function. Performance: Can result in higher performance models for prediction.

Limitations of Nonparametric Machine Learning Algorithms: More data: Require a lot more training data to estimate the mapping function. Slower: A lot slower to train as they often have far more parameters to train. Overfitting: More of a risk to overfit the training data and it is harder to explain why specific predictions are made.

Regression

Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable (s) (predictor). Parametric in nature
This technique is used for forecasting, time series modelling and finding the causal effect relationship between the variables.
Here, we fit a curve / line to the data points, in such a manner that the differences between the distances of data points from the curve or line is minimized. Using the relationship between variables to find the best fit line or the regression equation that can be used to make predictions
There are various kinds of regression techniques available to make predictions. These techniques are mostly driven by three metrics (number of independent variables, type of dependent variables and shape of regression line).
Different types of Regression -

Linear regression In this technique, the dependent variable is continuous, independent variable(s) can be continuous or discrete, and nature of regression line is linear.
Logistic regression:Logistic regression is a classification algorithm used to assign observations to a discrete set of classes.
Polynomial regression: A regression equation is a polynomial regression equation if the power of independent variable is more than 1. The difference between simple linear regression and multiple linear regression is that, multiple linear regression has (>1) independent variables, whereas simple linear regression has only 1 independent variable.In case of polynomial regression we might use higher powers of X to describe Y, as described in Y=m1X+m2X2+C where m1 and m2 are coefficients of the first and second powers of the factor. When you do a polynomial regression, you’re just doing a multiple regression with multiple transformations of a single variable.

4. Stepwise regression: This form of regression is used when we deal with multiple independent variables. In this technique, the selection of independent variables is done with the help of an automatic process, which involves no human intervention.

5. Ridge regression: is a technique for analyzing multiple regression data. When multicollinearity occurs, least squares estimates are unbiased. A degree of bias is added to the regression estimates, and a result, ridge regression reduces the standard errors.Ridge regression shrinks the value of coefficients but doesn’t reaches zero, which suggests no feature selection feature

6. Lasso regression:is a regression analysis method that performs both variable selection and regularization. Lasso regression uses soft thresholding. Lasso regression selects only a subset of the provided covariates for use in the final model.If group of predictors are highly correlated, lasso picks only one of them and shrinks the others to zero

7.ElasticNet regression:is a regularized regression method that linearly combines the penalties of the lasso and ridge methods.

***Shrinkage means that the coefficients are reduced towards zero compared to the OLS parameter estimates. This is called regularization.

8. Ordinal Regression: Ordinal Regression is used to predict ranked values. In simple words, this type of regression is suitable when dependent variable is ordinal in nature. Example of ordinal variables — Survey responses (1 to 6 scale), patient reaction to drug dose (none, mild, severe).

9. Poisson Regression : Poisson regression is used when dependent variable has count data. Application of Poisson Regression — Predicting the number of calls in customer care related to a particular product Estimating the number of emergency service calls during an event The dependent variable must meet the following conditions — The dependent variable has a Poisson distribution. Counts cannot be negative. This method is not suitable on non-whole numbers

Now lets explore our Linear Regression :

Mathematically, regression uses a linear function to approximate (predict) the dependent variable given as:

Y = βo + β1X + ∈

Error is an inevitable part of the prediction-making process. No matter how powerful the algorithm we choose, there will always remain an (∈) irreducible error which reminds us that the “future is uncertain.”

Hypothesis Testing in Linear Regression

You start by saying that β1 is not significant, i.e. there is no relationship between X and y. So in order to perform the hypothesis test, we first propose

the null hypothesis that β1 is 0.
And the alternative hypothesis thus becomes β1 is not zero.

If you fail to reject the null hypothesis that would mean that β1 is zero which would simply mean that β1 is insignificant and of no use in the model. Similarly, if you reject the null hypothesis, it would mean that β1 is not zero and the line fitted is a significant one. Now, in order to perform the hypothesis test, you need to derive the p-value for the given beta.

Now, if the p-value turns out to be less than 0.05, you can reject the null hypothesis and state that β1 is indeed significant.

NOTE- test statistic for β1 follow a t-distribution instead of a normal distribution

Sum of Squared Errors (SSE) : n order to fit the best intercept line between the points in the above scatter plots, we use a metric called “Sum of Squared Errors” (SSE) and compare the lines to find out the best fit by reducing errors. The errors are sum difference between actual value and predicted value. To find the errors for each dependent value, we need to use the formula below.

Residuals The difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual.

Residual = Observed value — Predicted value e = y — ŷ

Property- Both the sum and the mean of the residuals are equal to zero. That is, Σ e = 0 and e = 0.

we can’t completely eliminate the (∈) error term, but we can still try to reduce it to the lowest. To do this, regression uses a technique known as Ordinary Least Square (OLS). “Ordinary Least Squares” (OLS) method is used to find the best line intercept βo and the slope β1.

I am using linear /multiple regression, you are actually referring to the OLS technique. Conceptually, OLS technique tries to reduce the sum of squared errors ∑[Actual(y) — Predicted(y’)]² by finding the best possible value of regression coefficients (β0, β1, etc).

The most intuitive and closest approximation of Y is mean of Y, i.e. even in the worst case scenario our predictive model should at least give higher accuracy than mean prediction. The formula to calculate coefficients goes like this:

There are other techniques also such as Generalized Least Square, Percentage Least Square, Total Least Squares, Least absolute deviation, Gradient Descent and many more.
It is important to point out though that OLS method will work for a univariate dataset (ie., single independent variables and single dependent variables). Multivariate dataset contains a single independent variables set and multiple dependent variables sets, requiring a machine learning algorithm called “Gradient Descent”.

Now you know ymean plays a crucial role in determining regression coefficients and furthermore accuracy. In OLS, the error estimates can be divided into three parts:

Residual Sum of Squares (RSS) — ∑[Actual(y) — Predicted(y)]²

Explained Sum of Squares (ESS) — ∑[Predicted(y) — Mean(ymean)]²

Total Sum of Squares (TSS) — ∑[Actual(y) — Mean(ymean)]²

Assumptions made in regression

As we discussed above, regression is a parametric technique, so it makes assumptions. Let’s look at the assumptions it makes:

Linearity and additivity of the relationship between dependent and independent variables: There should be a linear relationship between dependent (target) variable and independent (predictor) variable.A linear relationship suggests that a change in target Y due to one unit change in X1 is constant, regardless of the value of X1.By additive, it refers to the effect of X on Y is independent of other variables. This can be checked by plotting scatter plot of target versus individual independent variables.

How to fix:consider applying a nonlinear transformation to the dependent and/or independent variables if you can think of a transformation that seems appropriate.

2. No multicollinearity: This phenomenon exists when the independent variables are found to be moderately or highly correlated. In a model with correlated variables, it becomes a tough task to figure out the true relationship of a predictors with response variable. In other words, it becomes difficult to find out which variable is actually contributing to predict the response variable. There must be no correlation among independent variables.

We can check multicollinearity by • Correlation Matrix • VIF.

3.No autocorrelation:The error terms must be uncorrelated i.e. error at ∈t must not indicate the at error at ∈t+1. Presence of correlation in error terms is known as Autocorrelation. It drastically affects the regression coefficients and standard error values since they are based on the assumption of uncorrelated error terms.usually in the case of time series data This can be evauated by : • Durbin — Watson (DW) statistic. It must lie between 0 and 4. If DW = 2, implies no autocorrelation, 0 < DW < 2 implies positive autocorrelation while 2 < DW < 4 indicates negative autocorrelation.

4.Homoscedasticity (constant variance): The variance of the errors is constant with respect to the predicting variables or the response or time.Generally, non-constant variance arises in presence of outliers or extreme leverage values. How to diagnose: look at a plot of residuals versus predicted values and, in the case of time series data, a plot of residuals versus time. It is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. The plot is used to detect non-linearity, unequal error variances, and outliers.

This plot is a classical example of a well-behaved residuals vs. fits plot. Here are the characteristics of a well-behaved residual vs. fits plot and what they suggest about the appropriateness of the simple linear regression model:

The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.
The residuals roughly form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.
No one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers.

5. Normality of the error distribution:We can check that by Plotting the histogram of residuals.Or Normal Q-Q Plot normalized residuals .This q-q or quantile-quantile is a scatter plot which helps us validate the assumption of normal distribution in a data set. Using this plot we can infer if the data comes from a normal distribution. If the plot would show fairly straight line. Absence of normality in the errors can be seen with deviation in the straight line.

When the quantiles of two variables are plotted against each other, then the plot obtained is known as quantile — quantile plot or qqplot. This plot provides a summary of whether the distributions of two variables are similar or not with respect to the locations. All point of quantiles lie on or close to straight line at an angle of 45 degree from x — axis. It indicates that two samples have similar distributions.

So to summarize we can say :

Assumption about the form of the model:

It is assumed that there is a linear relationship between the dependent and independent variables. It is known as the ‘linearity assumption’.

Assumptions about the residuals:

Normality assumption: It is assumed that the error terms, ε(i), are normally distributed.
Zero mean assumption: It is assumed that the residuals have a mean value of zero.
Constant variance assumption: It is assumed that the residual terms have the same (but unknown) variance, σ2 This assumption is also known as the assumption of homogeneity or homoscedasticity.
Independent error assumption: It is assumed that the residual terms are independent of each other, i.e. their pair-wise covariance is zero.

Assumptions about the estimators:

The independent variables are measured without error.
The independent variables are linearly independent of each other, i.e. there is no multicollinearity in the data

Evaluation metrics used in linear regression

Mean Squared Error:

MSE or Mean Squared Error is one of the most preferred metrics for regression tasks. It is simply the average of the squared difference between the target value and the value predicted by the regression model. As it squares the differences, it penalizes even a small error which leads to over-estimation of how bad the model is. It is preferred more than other metrics because it is differentiable and hence can be optimized better.

Root Mean Squared Error:

RMSE is the most widely used metric for regression tasks and is the square root of the averaged squared difference between the target value and the value predicted by the model. It is preferred more in some cases because the errors are first squared before averaging which poses a high penalty on large errors. This implies that RMSE is useful when large errors are undesired.

Absolute Error:

MAE is the absolute difference between the target value and the value predicted by the model. The MAE is more robust to outliers and does not penalize the errors as extremely as mse. MAE is a linear score which means all the individual differences are weighted equally. It is not suitable for applications where you want to pay more attention to the outliers.

Because we are squaring the difference, the MSE will almost always be bigger than the MAE. For this reason, we cannot directly compare the MAE to the MSE. We can only compare our model’s error metrics to those of a competing model. The effect of the square term in the MSE equation is most apparent with the presence of outliers in our data. While each residual in MAE contributes proportionally to the total error, the error grows quadratically in MSE. This ultimately means that outliers in our data will contribute to much higher total error in the MSE than they would the MAE. Similarly, our model will be penalized more for making predictions that differ greatly from the corresponding actual value. This is to say that large differences between actual and predicted are punished more in MSE than in MAE.

However, even after being more complex and biased towards higher deviation, RMSE is still the default metric of many models because loss function defined in terms of RMSE is smoothly differentiable and makes it easier to perform mathematical operations.

Coefficient of Determination or R²

is another metric used for evaluating the performance of a regression model. The metric helps us to compare our current model with a constant baseline and tells us how much our model is better. The constant baseline is chosen by taking the mean of the data and drawing a line at the mean. R² is a scale-free score that implies it doesn’t matter whether the values are too large or too small, the R² will always be less than or equal to 1.

R² metric tells us the amount of variance explained by the independent variables in the model.

Adjusted R²:

Adjusted R² depicts the same meaning as R² but is an improvement of it. R² suffers from the problem that the scores improve on increasing terms even though the model is not improving which may misguide the researcher. Adjusted R² is always lower than R² as it adjusts for the increasing predictors and only shows improvement if there is a real improvement

Why is R² Negative

There is a misconception among people that R² score ranges from 0 to 1 but actually it ranges from -∞ to 1. Following reasons can cause R² to be -ve :

Maybe their area a large number of outliers in the data that causes the mse of the model to be more than mse of the baseline causing the R² to be negative(i.e the numerator is greater than the denominator).
Sometimes while coding the regression algorithm, the researcher might forget to add the intercept to the regressor which will also lead to R² being negative. This is because, without the benefit of an intercept, the regression could do worse than the sample mean(baseline) in terms of tracking the dependent variable (i.e., the numerator could be greater than the denominator)
SSres will exceed SStot when the line or curve fits the data even worse than does a horizontal line. R2 will be negative when the line or curve does an awful job of fitting the data. This can happen when you fit a poorly chosen model

F-statistic

F-statistic is similar in the sense that now instead of testing the significance of each of the betas, it tells you whether the overall model fit is significant or not. This parameter is examined because many a time it happens that even though all of your betas are significant, but your overall model fit might happen just by chance.

f the ‘Prob (F-statistic)’ is less than 0.05, you can conclude that the overall model fit is significant. If it is greater than 0.05, you might need to review your model as the fit might be by chance, i.e. the line may have just luckily fit the data. This will be more appreciable IN multiple linear regression since there you have a lot of betas for the different predictor variables and thus there it is very helpful in determining if all the predictor variables together as a whole are significant or not or simply put, it tells you whether the model fit as a whole is significant or not.

Coefficients and p-values:

The p-values of the coefficients (in this case just one coefficient for TV) tell you whether the coefficient is significant or not.

Cost Function

It is a function that measures the performance of a Machine Learning model for given data. Cost Function quantifies the error between predicted values and expected values .

The purpose of Cost Function is to be either: Minimized — then returned value is usually called cost, loss or error. The goal is to find the values of model parameters for which Cost Function return as small number as possible. Maximized — then the value it yields is named a reward. The goal is to find values of model parameters for which returned number is as large as possible.

-> MSE

Gradient Descent

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model. Parameters refer to coefficients in Linear Regression and weights in neural networks.

It is important to point out though that OLS method will work for a univariate dataset (ie., single independent variables and single dependent variables). Multivariate dataset contains a single independent variables set and multiple dependent variables sets, requiring “Gradient Descent”.

Starting at the top of the mountain, we take our first step downhill in the direction specified by the negative gradient. Next we recalculate the negative gradient (passing in the coordinates of our new point) and take another step in the direction it specifies. We continue this process iteratively until we get to the bottom of our graph, or to a point where we can no longer move downhill–a local minimum.

Learning rate The size of these steps is called the learning rate. With a high learning rate we can cover more ground each step, but we risk overshooting the lowest point since the slope of the hill is constantly changing. With a very low learning rate, we can confidently move in the direction of the negative gradient since we are recalculating it so frequently. A low learning rate is more precise, but calculating the gradient is time-consuming, so it will take us a very long time to get to the bottom.

Types of Gradient Descent Algorithms

Various variants of gradient descent are defined on the basis of how we use the data to calculate derivative of cost function in gradient descent. Depending upon the amount of data used, the time complexity and accuracy of the algorithms differs with each other.

Batch Gradient Descent
Stochastic Gradient Descent
Mini-Batch Gradient Descent

Gradient descent can be slow to run on very large datasets.

Because one iteration of the gradient descent algorithm requires a prediction for each instance in the training dataset, it can take a long time when you have many millions of instances. In situations when you have large amounts of data, you can use a variation of gradient descent called stochastic gradient descent. In this variation, the gradient descent procedure described above is run but the update to the coefficients is performed for each training instance, rather than at the end of the batch of instances. The learning can be much faster with stochastic gradient descent for very large training datasets and often you only need a small number of passes through the dataset to reach a good or good enough set of coefficients, e.g. 1-to-10 passes through the dataset.

Mini batch algorithm is the most favorable and widely used algorithm that makes precise and faster results using a batch of ‘ m ‘ training examples. In mini batch algorithm rather than using the complete data set, in every iteration we use a set of ‘m’ training examples called batch to compute the gradient of the cost function. Common mini-batch sizes range between 50 and 256, but can vary for different applications. In this way, algorithm

reduces the variance of the parameter updates, which can lead to more stable convergence.
can make use of highly optimized matrix, that makes computing of gradient very efficient.

Performing Linear Regression Using the Normal Equation

Usually finding the best model parameters is performed by running some kind of optimization algorithm (e.g. gradient descent) to minimize a cost function. However, it is possible to obtain values (weights) of these parameters by solving an algebraic equation called the normal equation as well. It is defined as below.

The problem is in its numerical complexity. Solving this equation requires inverting a matrix and this is a computationally expensive operation — depending on the implementation, in a big O notation it is O(n³) or slightly less. This means is scales up horribly, practically meaning that when you double number of features, the computational times increases by ²³ = 8 times. There is also some chance that the result of step 2 is not invertible at all — causing big troubles. These are the reasons why in practice this approach is uncommon. On the bright side, this approach is calculated just in one step and you don’t have to choose the learning rate parameter. Additionally, it terms of memory usage this approach is linear O(m) meaning it stores huge datasets effectively if they only fit into your computer’s memory.

Understanding the Bias-Variance Tradeoff

The prediction error for any machine learning algorithm can be broken down into three parts:

Bias Error
Variance Error
Irreducible Error The irreducible error cannot be reduced regardless of what algorithm is used. It is the error introduced from the chosen framing of the problem and may be caused by factors like unknown variables that influence the mapping of the input variables to the output variable.

Whenever we discuss model prediction, it’s important to understand prediction errors (bias and variance). There is a tradeoff between a model’s ability to minimize bias and variance.

Gaining a proper understanding of these errors would help us not only to build accurate models but also to avoid the mistake of overfitting and underfitting.

Bias Error

Bias is the simplifying assumptions made by the model to make the target function easier to approximate. or it refers to the difference between the values predicted by the model and the real values.

Bias=error in predicting point

Low Bias: Suggests less assumptions about the form of the target function. High-Bias: Suggests more assumptions about the form of the target function.Wll

Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines.
Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.

Variance Error

Variance is the amount that the estimate of the target function will change if different training data was used. or the difference btw the fitting in test and training data test.

Low Variance: Suggests small changes to the estimate of the target function with changes to the training dataset. High Variance: Suggests large changes to the estimate of the target function with changes to the training dataset.

Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data

Examples of low-variance machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.
Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines.

The goal of any supervised machine learning algorithm is to achieve low bias and low variance. In turn the algorithm should achieve good prediction performance.

You can see a general trend in the examples above:

Linear machine learning algorithms often have a high bias but a low variance.
Nonlinear machine learning algorithms often have a low bias but a high variance.

Learning curves

Let’s say we have some data and split it into a training set and validation set. We take one single instance (that’s right, one!) from the training set and use it to estimate a model. Then we measure the model’s error on the validation set and on that single training instance. The error on the training instance will be 0, since it’s quite easy to perfectly fit a single data point. The error on the validation set, however, will be very large. Now let’s say that instead of one training instance, we take ten and repeat the error measurements. Then we take fifty, one hundred, five hundred, until we use our entire training set. The error scores will vary more or less as we change the training set. We thus have two error scores to monitor: one for the validation set, and one for the training sets.

If we plot the evolution of the two error scores as training sets change, we end up with two curves. These are called learning curves. In a nutshell, a learning curve shows how error changes as the training set size increases.

If the training error and true error (cross-validating error) converge to the same value and the corresponding value of the error is high, it indicates that the model is underfitting and is suffering from high bias. If there is a significant gap between the converging values of the training and cross-validating errors, i.e. the cross-validating error is significantly higher than the training error, it suggests that the model is overfitting the training data and is suffering from a high variance.

How does multicollinearity affect the linear regression

Multicollinearity occurs when some of the independent variables are highly correlated (positively or negatively) with each other. This multicollinearity causes a problem as it is against the basic assumption of linear regression. The presence of multicollinearity does not affect the predictive capability of the model. So, if you just want predictions, the presence of multicollinearity does not affect your output. However, if you want to draw some insights from the model and apply them in, let’s say, some business model, it may cause problems.

One of the major problems caused by multicollinearity is that it leads to incorrect interpretations and provides wrong insights. The coefficients of linear regression suggest the mean change in the target value if a feature is changed by one unit. So, if multicollinearity exists, this does not hold true as changing one feature will lead to changes in the correlated variable and consequent changes in the target variable. This leads to wrong insights and can produce hazardous results for a business.

Multicollinearity affects the following:

Interpretation Does “change in Y when all others are held constant” apply?
Inference Coefficients swing wildly, signs can invert. Therefore, p-values are not reliable.

The predictive power given by the R-squared value is not affected because even though you might have redundant variables in your model, So multicollinearity play no role in affecting the R-squared.

Some methods that can be used to deal with multicollinearity are as follows:

Dropping variables: Drop the variable that is highly correlated with others. Pick the business interpretable variable.
Creating a new variable using the interactions of the older variables: Add interaction features, i.e., features derived using some of the original features.
Variable transformations: Principal component analysis or Partial Least Squares Regression

variance inflation factor (VIF)

The variance inflation factor (VIF) quantifies the extent of correlation between one predictor and the other predictors in a model. It is used for diagnosing collinearity/multicollinearity. Higher values signify that it is difficult to impossible to assess accurately the contribution of predictors to a model.

A variance inflation factor (VIF) quantifies how much the variance is inflated.
standard errors — and hence the variances — of the estimated coefficients are inflated when multicollinearity exists.
A variance inflation factor exists for each of the predictors in a multiple regression model. For example, the variance inflation factor for the estimated regression coefficient bj — denoted VIFj — is just the factor by which the variance of bj is “inflated” by the existence of correlation among the predictor variables in the model.

In particular, the variance inflation factor for the jth predictor is:

where **R2j is the R2-value obtained by regressing the jth predictor on the remaining predictors**.

A value of 1 means that the predictor is not correlated with other variables. The higher the value, the greater the correlation of the variable with other variables. The common heuristic we follow for the VIF values is:

greater then 10: VIF value is definitely high, and the variable should be eliminated.
greater then 5: Can be okay, but it is worth inspecting.
less then 5: Good VIF value. No need to eliminate this variable.

Methods to Boost the Accuracy of a Model

1. Treat missing and Outlier values

The unwanted presence of missing and outlier values in the training data often reduces the accuracy of a model or leads to a biased model. It leads to inaccurate predictions. This is because we don’t analyse the behavior and relationship with other variables correctly. So, it is important to treat missing and outlier values well.

Treating Missing Values:

Deletion
Mean/ Mode/ Median Imputation:It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable.
Prediction Model: Prediction model is one of the sophisticated method for handling missing data. Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable and another one with missing values. First data set become training data set of the model while second data set with missing values is test data set and variable with missing values is treated as target variable. Next, we create a model to predict target variable based on other attributes of the training data set and populate missing values of test data set.We can use regression, ANOVA, Logistic regression and various modeling technique to perform this.
KNN Imputation: In this method of imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function. Techniques of Outlier Detection and Treatment Outlier can be of two types: Univariate and Multivariate. Above, we have discussed the example of univariate outlier. These outliers can be found when we look at distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space. In order to find them, you have to look at distributions in multi-dimensions.

Finding and treating Outliers

Most commonly used method to detect outliers is -

visualization: We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for visualization).
Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier
Data points, three or more standard deviation away from mean are considered outlier
Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding
Bivariate and multivariate outliers are typically measured using either an index of influence or leverage, or distance. Popular indices such as Mahalanobis’ distance and Cook’s D are frequently used to detect outliers.

Ways to remove Outliers:

Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends to remove outliers.
Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the process of assigning weights to different observations.
Imputing: Like imputation of missing values, we can also impute outliers. We can use mean, median, mode imputation methods. Before imputing values, we should analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute it with predicted values.
Treat separately: If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output

2.Feature Engineering

This step helps to extract more information from existing data.Feature engineering process can be divided into two steps:

Feature transformation

transformation refers to the replacement of a variable by a function. For instance, replacing a variable x by the square / cube root or logarithm x is a transformation. In other words, transformation is a process that changes the distribution or relationship of a variable with others. It can be done for
we want to change the scale of a variable or standardize the values of a variable
transform complex non-linear relationships into linear relationships
Symmetric distribution is preferred over skewed distribution-whenever we have a skewed distribution, we can use transformations which reduce skewness. For right skewed distribution, we take square / cube root or logarithm of variable and for left skewed, we take square / cube or exponential of variables.
Some times, creating bins of numeric data works well, since it handles the outlier values also. Numeric data can be made discrete by grouping values into bins. This is known as data discretization.

Feature Creation

Deriving new variable(s ) from existing variables is known as feature creation.

3.Feature Selection

Feature Selection is a process of finding out the best subset of attributes which better explains the relationship of independent variables with target variable.

Domain Knowledge: Based on domain experience, we select feature(s) which may have higher impact on target variable.
Visualization
Statistical Parameters: We also consider the p-values, information values and other statistical metrics to select right features.PCA helps to represent training data into lower dimensional spaces, but still characterize the inherent relationships in the data. It is a type of dimensionality reduction technique.(feature importance)

4. Multiple algorithms

Hitting at the right machine learning algorithm is the ideal approach to achieve higher accuracy. But, it is easier said than done.

5. Algorithm Tuning

Parameters are properties the algorithm is learning during training. For linear regression, these are the weights and biases; whilst for random forests, these are the variables and thresholds at each node. Hyper-parameters, on the other hand, are properties that must be set before training. For k-means clustering, you must define the value of k; whilst for neural networks, an example is the learning rate.

Vanilla linear regression doesn’t have any hyperparameters. But variants of linear regression do. Ridge regression and lasso both add a regularization term to linear regression; the weight for the regularization term is called the regularization parameter.
Another type of hyperparameter comes from the training process itself. Training a machine learning model often involves optimizing a loss function (the training metric). A number of mathematical optimization techniques may be employed, some of them having parameters of their own. For instance, stochastic gradient descent optimization requires a learning rate or a learning schedule. Some optimization methods require a convergence threshold.

Hyperparameter Tuning Algorithms:

Grid Search Grid search, true to its name, picks out a grid of hyperparameter values, evaluates every one of them, and returns the winner. For example, if the hyperparameter is the number of leaves in a decision tree, then the grid could be 10, 20, 30, …, 100. Grid search is dead simple to set up and trivial to parallelize. It is the most expensive method in terms of total computation time. However, if run in parallel, it is fast in terms of wall clock time.
Random Search Random search is a slight variation on grid search. Instead of searching over the entire grid, random search only evaluates a random sample of points on the grid. This makes random search a lot cheaper than grid search.
Bayesian Optimization Bayesian Optimization uses the prior knowledge of success with hyper-parameter combinations to choose the next best.
Ensemble Learning Ensembles combine several machine learning models, each finding different patterns within the data, to provide a more accurate solution. These techniques can both improve performance, as they capture more trends, as well as reduce overfitting, as the final prediction is a consensus from many models.It can be bagging,boosting or stacking.

6.Cross Validation

Cross Validation is one of the most important concepts in data modeling. Cross Validation is a technique which involves reserving a particular sample of a dataset on which you do not train the model. Later, you test your model on this sample before finalizing it.

What is Robust Regression ?

Robust regression can be used in any situation in which you would use least squares regression. When fitting a least squares regression, we might find some outliers or high leverage data points. We have decided that these data points are not data entry errors, neither they are from a different population than most of our data. So we have no compelling reason to exclude them from the analysis. Robust regression might be a good strategy since it is a compromise between excluding these points entirely from the analysis and including all the data points and treating all them equally in OLS regression. The idea of robust regression is to weigh the observations differently based on how well behaved these observations are. Roughly speaking, it is a form of weighted and reweighted least squares regression.

-> A regression model with OLS (Ordinary Least Squares) is quite sensitive to the outliers. To overcome this problem, we can use the WLS (Weighted Least Squares) method to determine the estimators of the regression coefficients. Here, less weights are given to the outliers or high leverage points in the fitting, making these points less impactful.

It is mostly useful in :

One instance in which robust estimation should be considered is when there is a strong suspicion of heteroscedasticity. In the homoscedastic model, it is assumed that the variance of the error term is constant for all values of x. Heteroscedasticity allows the variance to be dependent on x, which is more accurate for many real scenarios.
Another common situation in which robust estimation is used occurs when the data contain outliers. In the presence of outliers that do not come from the same data-generating process as the rest of the data, least squares estimation is inefficient and can be biased. Because the least squares predictions are dragged towards the outliers, and because the variance of the estimates is artificially inflated, the result is that outliers can be masked.

Although it is sometimes claimed that least squares (or classical statistical methods in general) are robust, they are only robust in the sense that the type I error rate does not increase under violations of the model. In fact, the type I error rate tends to be lower than the nominal level when outliers are present, and there is often a dramatic increase in the type II error rate. The reduction of the type I error rate has been labelled as the conservatism of classical methods.

Weighted Least Square/Generised Least Square

In both ordinary least squares and maximum likelihood approaches to parameter estimation, we made the assumption of constant variance, that is the variance of an observation is the same regardless of the values of the explanatory variables associated with it, and since the explanatory variables determine the mean value of the observation, what we assume is that the variance of the observation is unrelated to the mean.

In both ordinary least squares and maximum likelihood approaches to parameter estimation, we made the assumption of constant variance, that is the variance of an observation is the same regardless of the values of the explanatory variables associated with it, and since the explanatory variables determine the mean value of the observation, what we assume is that the variance of the observation is unrelated to the mean

Other forms of regression being :

Generalized least squares
Maximum likelihood estimates
Bayesian regression
Kernel regression
Gaussian regression

Deep dive into Cross Validation

Cross Validation is a technique which involves reserving a particular sample of a dataset on which you do not train the model. Later, you test your model on this sample before finalizing it.

Leave one out cross validation (LOOCV)

In this approach, we reserve only one data point from the available dataset, and train the model on the rest of the data. This process iterates for each data point. This also has its own advantages and disadvantages. Let’s look at them:

We make use of all data points, hence the bias will be low We repeat the cross validation process n times (where n is number of data points) which results in a higher execution time This approach leads to higher variation in testing model effectiveness because we test against one data point. So, our estimation gets highly influenced by the data point. If the data point turns out to be an outlier, it can lead to a higher variation

LOOCV leaves one data point out. Similarly, you could leave p training examples out to have validation set of size p for each iteration. This is called LPOCV (Leave P Out Cross Validation)

K-fold cross validation

Below are the steps for it:

Randomly split your entire dataset into k”folds”
For each k-fold in your dataset, build your model on k — 1 folds of the dataset. Then, test the model to check the effectiveness for kth fold
Record the error you see on each of the predictions
Repeat this until each of the k-folds has served as the test set
The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model Below is the visualization of a k-fold validation when k=10.

Stratified k-fold cross validation

Stratification is the process of rearranging the data so as to ensure that each fold is a good representative of the whole. For example, in a binary classification problem where each class comprises of 50% of the data, it is best to arrange the data such that in every fold, each class comprises of about half the instances. It is generally a better approach when dealing with both bias and variance. A randomly selected fold might not adequately represent the minor class, particularly in cases where there is a huge class imbalance.

Cross Validation for time series

Splitting a time-series dataset randomly does not work because the time section of your data will be messed up. For a time series forecasting problem, we perform cross validation in the following manner.

Folds for time series cross valdiation are created in a forward chaining fashion Suppose we have a time series for yearly consumer demand for a product during a period of n years.

Adversarial Validation

When dealing with real datasets, there are often cases where the test and train sets are very different. As a result, the internal cross-validation techniques might give scores that are not even in the ballpark of the test score. In such cases, adversarial validation offers an interesting solution.

The general idea is to check the degree of similarity between training and tests in terms of feature distribution. If It does not seem to be the case, we can suspect they are quite different. This intuition can be quantified by combining train and test sets, assigning 0/1 labels (0 — train, 1-test) and evaluating a binary classification task.

Deep dive into Regularization

This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

Two of the commonly used techniques are L1 or Lasso regularization and L2 or Ridge regularization. Both these techniques impose a penalty on the model to achieve dampening of the magnitude as mentioned earlier. In the case of L1, the sum of the absolute values of the weights is imposed as a penalty while in the case of L2, the sum of the squared values of weights is imposed as a penalty. There is a hybrid type of regularization called Elastic Net that is a combination of L1 and L2.

A simple relation for linear regression looks like this. Here Y represents the learned relation and β represents the coefficient estimates for different variables or predictors(X). Y ≈ β0 + β1X1 + β2X2 + …+ βpXp

The fitting procedure involves a loss function, known as residual sum of squares or RSS. The coefficients are chosen, such that they minimize this loss function.

Ridge

RSS is modified by adding the shrinkage quantity. Now, the coefficients are estimated by minimizing this function.
Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of our model.
When λ = 0, the penalty term has no eﬀect, and the estimates produced by ridge regression will be equal to least squares.
However, as λ→∞, the impact of the shrinkage penalty grows, and the ridge regression coeﬃcient estimates will approach zero. * As can be seen, selecting a good value of λ is critical. Cross validation comes in handy for this purpose.
The coefficient estimates produced by this method are also known as the L2 norm.

NOTE:The coefficients that are produced by the standard least squares method are scale equivariant, i.e. if we multiply each input by c then the corresponding coefficients are scaled by a factor of 1/c. Therefore, regardless of how the predictor is scaled, the multiplication of predictor and coefficient(Xjβj) remains the same. However, this is not the case with ridge regression, and therefore, we need to standardize the predictors or bring the predictors to the same scale before performing ridge regression.

Lasso

Lasso is another variation, in which the above function is minimized. Its clear that this variation differs from ridge regression only in penalizing the high coefficients. It uses |βj|(modulus)instead of squares of β, as its penalty. In statistics, this is known as the L1 norm.

Consider their are 2 parameters in a given problem. Then according to above formulation, the ridge regression is expressed by β¹² + β²² ≤ s. This implies that ridge regression coefficients have the smallest RSS(loss function) for all points that lie within the circle given by β¹² + β²² ≤ s.

Similarly, for lasso, the equation becomes,|β1|+|β2|≤ s. This implies that lasso coefficients have the smallest RSS(loss function) for all points that lie within the diamond given by |β1|+|β2|≤ s.

Since ridge regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so the ridge regression coeﬃcient estimates will be exclusively non-zero.
However, the lasso constraint has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coeﬃcients will equal zero. In higher dimensions(where parameters are much more than 2), many of the coeﬃcient estimates may equal zero simultaneously.
This sheds light on the obvious disadvantage of ridge regression, which is model interpretability. It will shrink the coefficients for least important predictors, very close to zero. But it will never make them exactly zero. In other words, the final model will include all predictors. However, in the case of the lasso, the L1 penalty has the eﬀect of forcing some of the coeﬃcient estimates to be exactly equal to zero when the tuning parameter λ is suﬃciently large. Therefore, the lasso method also performs variable selection and is said to yield sparse models.

How to choose value of λ:

Regularization, significantly reduces the variance of the model, without substantial increase in its bias. So the tuning parameter λ, used in the regularization techniques described above, controls the impact on bias and variance. As the value of λ rises, it reduces the value of coefficients and thus reducing the variance. Till a point, this increase in λ is beneficial as it is only reducing the variance(hence avoiding overfitting), without loosing any important properties in the data. But after certain value, the model starts loosing important properties, giving rise to bias in the model and thus underfitting. Therefore, the value of λ should be carefully selected.

Lets finish this article with some interesting Questions -

Can we use linear regression for time series analysis?

One can use linear regression for time series analysis, but the results are not promising. So, it is generally not advisable to do so. The reasons behind this are —

Time series data is mostly used for the prediction of the future, but linear regression seldom gives good results for future prediction as it is not meant for extrapolation.
Mostly, time series data have a pattern, such as during peak hours, festive seasons, etc., which would most likely be treated as outliers in the linear regression analysis.

You run your regression on different subsets of your data, and in each subset, the beta value for a certain variable varies wildly. What could be the issue here?

This case implies that the dataset is heterogeneous. So, to overcome this problem, the dataset should be clustered into different subsets, and then separate models should be built for each cluster. Another way to deal with this problem is to use non-parametric models, such as decision trees, which can deal with heterogeneous data quite efficiently.

Usually coefficients (beta) represent the level of importance of each variable. If you see a coefficient is changing from one model (subset) to another model (another subset), it means the importance of that specific variable In each dataset is different.

Cross Validation , Grid

Your linear regression doesn’t run and communicates that there is an infinite number of best estimates for the regression coefficients. What could be wrong?

This condition arises when there is a perfect correlation (positive or negative) between some variables. In this case, there is no unique value for the coefficients, and hence, the given condition arises.

Is it true that if two variables are correlated, they always have a linear relationship?

No.Correlation does NOT imply linearity.Simplest example..find correlation coeff between numbers and their squares. You will get a high correlation coeff but the relationship is obviously quadratic.

What is difference between collinearity and correlation ?

correlation measures the relationship between two variables. When these two variables are so highly correlated that they explain each other (to the point that you can predict the one variable with the other), then we have collinearity (or multicollinearity).

Collinearity is a linear association between two predictors. Multicollinearity is a situation where two or more predictors are highly linearly related. In general, an absolute correlation coefficient of >0.7 among two or more predictors indicates the presence of multicollinearity. ‘Predictors’ is the point of focus here. Correlation between a ‘predictor and response’ is a good indication of better predictability. But, correlation ‘among the predictors’ is a problem to be rectified to be able to come up with a reliable model.

How can I get an R-squared value of 1 (fit 100%)?

An R2=1 indicates perfect fit. That is, you’ve explained all of the variance that there is to explain. you can always get R2=1 if :

you have a number of predicting variables equal to the number of observations, or
if you’ve estimated an intercept the number of observations — 1.
construct an nth degree polynomial, where n is the sample size. Each degree adds a new kink through one observation

Either way, 20 parameters perfectly describes 20 data points. Such a model is called just-identified. Although this gives you the highly desirable perfect fit… it is essentially meaningless.

OMITTED VARIABLE BIAS

The omitted variable bias is a common and serious problem in regression analysis. Generally, the problem arises if one does not consider all relevant variables in a regression

The Gauss-Markov theorem

states that if your linear regression model satisfies the first six classical assumptions, then ordinary least squares (OLS) regression produces unbiased estimates that have the smallest variance of all possible linear estimators.

the critical point is that when you satisfy the classical assumptions, you can be confident that you are obtaining the best possible coefficient estimates. The Gauss-Markov theorem does not state that these are just the best possible estimates for the OLS procedure, but the best possible estimates for any linear model estimator.

The Gauss-Markov Theorem: OLS is BLUE! — Best Linear Unbiased Estimator

Thats all for now ..

If you liked this article, be sure to show your support by clapping for this article. This article is basically a colab of many articles from medium , analytical vidya , upgrad material etc.