Linear Regression and Friends

Monday July 16, 2018 at 10:43 am CDT

Linear Regression…

Linear regression models are used when the relationship between some feature y and some number of other features xi is a linear one.


 = w0 + w1x1 + w2x2 + … + wixi

where is the predicted value, wi are the model’s parameters or weights and xi are the model’s features. If, for example, was the monetary value of a house, the values of xi could be the number of bedrooms the house has, the distance of the house from the nearest subway stop, if there are solar panels installed on the roof, etc.

We can express the above equation using matrix notation like so


 = wT ⋅ x

where w is a column vector (note the transpose gives us a row vector) and x is a vector of the model’s features. Taking the dot product of the two gives us the scalar value .

In order to train the model, we’ll need to minimize the difference between its predictions and the target values. There’s more than one way to evaluate how far off your model is from the target. One could use the residual sum of squares as a loss function

where m is the number of samples in your training set (I assume you are on your best behavior and holding one part of your dataset out to use later for testing :)), yi are the actual values, i are the predicted values, and xi is the feature vector for the ith sample in a training set X.

Another option is the root mean square error

Or the mean squared error

With the features vectors xi as input, we need to find the parameters w that minimize the chosen loss function (usually denoted as J(w)).

There are two ways to do this - gradient descent or using the normal equation, which gives you a matrix of the loss-minimizing parameters:


w = (XT ⋅ X) − 1 ⋅ XT ⋅ y

Ok, now that we’ve got that out of the way, let’s talk about regularization.

…and Friends

One of the things we try to avoid when training a model is overfitting to the training set. Overfitting is when the model is able to accurately describe the relationship between features and target values in the training set, but doesn’t generalize well to new data.

That’s where regularization comes in. To regularize a model, you constrain the values its parameters can take.

Below are three regularized versions of linear regression: ridge regression, lasso regression, and elastic net regression. The cost functions associated with each are:

ridge regression

lasso regression

elastic net regression

The regularization term λ determines the extent to which the model’s parameter values are constrained. The larger λ is, the smaller the values of the weights. Lasso regression differs from ridge regression in that it tends to drive the weights for features that are relatively unimportant to 0. Elastic net regression is a compromise between ridge and lasso regression.

Which regularization you use depends on how strongly particular features correlate with the predicted values. If you think there may be features that are weakly correlated with the model’s output, you may want to use lasso or elastic net regression to, effectively, remove them from the model.


Photo by Chris Pagan on Unsplash