So to control the variance we might regularize, or shrink the coefficients.

So, remember that, we, what we might want to minimize, is some

kind of distance between our outcome that we have and our linear model.

So here, this is the distance between the outcome and the linear model fit squared.

That's the residual sum of squares.

Then you might also add a penalty term here.

That says, the penalty will basically say: if the beta

coefficients are too big, it will shrink them back down.

So the penalty is usually used to reduce complexity.

It can be used to reduce variance.

And it can respect some of the structure in the

problem if you set the penalty up in the right way.

The first approach that was used in this

sort of penalized regression is to fit the regression model.

Here again, we're penalizing a distance between

our outcome y and our regression model here.

And then we also have a term here.

That is lambda times the sum of the beta j's squared.

So, what does this mean?

If the beta j's squares are really big then this term will

get too big, so we'll get, we won't get a very good fit.

This whole quantity will end up being very big, so

it basically requires that some of the beta j's be small.

It's actually equivalent to solving this problem where we're looking for

the smallest sum of squared here and sum of squared differences here.

Subject to a particular constraint that, the sum

of squared beta j's is less than s.

So the idea here is that the inclusion of

this lambda coefficient may also even make the problem non-singular.

Even when the x transpose x is not invertible.

In other words, in that model fit where we have more predictors

than we do observations the ridge regression model can still be fit.