0:00
In this video, we'll talk about
the normal equation, which for
some linear regression problems, will
give us a much better way
to solve for the optimal value
of the parameters theta.
Concretely, so far the
algorithm that we've been using
for linear regression is gradient
descent where in order
to minimize the cost function
J of Theta, we would take
this iterative algorithm that takes
many steps, multiple iterations of
gradient descent to converge
to the global minimum.
In contrast, the normal equation
would give us a method to
solve for theta analytically, so
that rather than needing to run
this iterative algorithm, we can
instead just solve for the
optimal value for theta
all at one go, so that in
basically one step you get
to the optimal value right there.
0:49
It turns out the normal equation
that has some advantages and
some disadvantages, but before
we get to that and talk about
when you should use it, let's
get some intuition about what this method does.
For this week's planetary example, let's
imagine, let's take a
very simplified cost function
J of Theta, that's just the
function of a real number Theta.
So, for now, imagine that Theta
is just a scalar value or that Theta is just a row value.
It's just a number, rather than a vector.
Imagine that we have a cost function J that's a quadratic function of this real value
parameter Theta, so J of Theta looks like that.
Well, how do you minimize a quadratic function?
For those of you that know a little bit of calculus,
you may know that the way to
minimize a function is to
take derivatives and to
set derivatives equal to zero.
So, you take the derivative of J with respect to the parameter of Theta.
You get some formula which I am not going to derive,
you set that derivative
equal to zero, and this
allows you to solve for
the value of Theda that minimizes J of Theta.
That was a simpler case
of when data was just real number.
In the problem that we are interested in, Theta is
2:04
no longer just a real number,
but, instead, is this
n+1-dimensional parameter vector, and,
a cost function J is
a function of this vector
value or Theta 0 through
Theta m. And, a cost
function looks like this, some square cost function on the right.
How do we minimize this cost function J?
Calculus actually tells us
that, if you, that
one way to do so, is
to take the partial derivative of J, with respect to every parameter of Theta J in turn, and then, to set
all of these to 0.
If you do that, and you
solve for the values of
Theta 0, Theta 1,
up to Theta N, then,
this would give you that values
of Theta to minimize the cost
function J. Where, if
you actually work through the
calculus and work through
the solution to the parameters
Theta 0 through Theta N, the
derivation ends up being somewhat involved.
And, what I am going
to do in this video,
is actually to not go
through the derivation, which is kind
of long and kind of involved, but
what I want to do is just
tell you what you need to know
in order to implement this process
so you can solve for the
values of the thetas that
corresponds to where the
partial derivatives is equal to zero.
Or alternatively, or equivalently,
the values of Theta is that
minimize the cost function J of Theta.
I realize that some of
the comments I made that made
more sense only to those
of you that are normally familiar with calculus.
So, but if you don't
know, if you're less familiar
with calculus, don't worry about it.
I'm just going to tell you what
you need to know in order to
implement this algorithm and get it to work.
For the example that I
want to use as a running
example let's say that
I have m = 4 training examples.
3:50
In order to implement this normal
equation at big, what I'm going to do is the following.
I'm going to take my
data set, so here are my four training examples.
In this case let's assume that,
you know, these four examples is all the data I have.
What I am going to do is take
my data set and add
an extra column that corresponds
to my extra feature, x0,
that is always takes
on this value of 1.
What I'm going to do is
I'm then going to construct
a matrix called X that's
a matrix are basically contains all
of the features from my
training data, so completely
here is my here are
all my features and we're
going to take all those numbers and
put them into this matrix "X", okay?
So just, you know, copy
the data over one column
at a time and then I am going to do something similar for y's.
I am going to take the
values that I'm trying to
predict and construct now
a vector, like so
and call that a vector y.
So X is going to be a
4:59
m by (n+1) - dimensional matrix, and
Y is going to be
a m-dimensional vector
where m is the number of training examples
and n is, n is
a number of features, n+1, because of
this extra feature X0 that I had.
Finally if you take
your matrix X and you take
your vector Y, and if you
just compute this, and set
theta to be equal to
X transpose X inverse times
X transpose Y, this would
give you the value of theta
that minimizes your cost function.
There was a lot
that happened on the slides and
I work through it using one specific example of one dataset.
Let me just write this
out in a slightly more general form
and then let me just, and later on in
this video let me explain this equation a little bit more.
6:16
The way I'm going to construct the
matrix "X", this is
also called the design matrix
is as follows.
Each training example gives
me a feature vector like this.
say, sort of n+1 dimensional vector.
The way I am going to construct my
design matrix x is only construct the matrix like this.
and what I'm going to
do is take the first
training example, so that's
a vector, take its transpose
so it ends up being this,
you know, long flat thing and
make x1 transpose the first row of my design matrix.
Then I am going to take my
second training example, x2, take
the transpose of that and
put that as the second row
of x and so on,
down until my last training example.
Take the transpose of that,
and that's my last row of
my matrix X. And, so,
that makes my matrix X, an
M by N +1
dimensional matrix.
As a concrete example, let's
say I have only one
feature, really, only one
feature other than X zero,
which is always equal to 1.
So if my feature vectors
X-i are equal to this
1, which is X-0, then
some real feature, like maybe the
size of the house, then my
design matrix, X, would be equal to this.
For the first row, I'm going
to basically take this and take its transpose.
So, I'm going to end up with 1, and then X-1-1.
For the second row, we're going to end
up with 1 and then
X-1-2 and so
on down to 1, and
then X-1-M.
And thus, this will be
a m by 2-dimensional matrix.
So, that's how to construct
the matrix X. And, the
vector Y--sometimes I might
write an arrow on top to
denote that it is a vector,
but very often I'll just write this as Y, either way.
The vector Y is obtained by
taking all all the labels,
all the correct prices of
houses in my training set, and
just stacking them up into
an M-dimensional vector, and
that's Y. Finally, having
constructed the matrix X
and the vector Y, we then
just compute theta as X'(1/X)
x X'Y. I just
want to make
I just want to make sure that this equation makes sense to you
and that you know how to implement it.
So, you know, concretely, what is this X'(1/X)?
Well, X'(1/X) is the
inverse of the matrix X'X.
Concretely, if you were
to say set A to
be equal to X' x
X, so X' is a
matrix, X' x X
gives you another matrix, and we
call that matrix A. Then, you
know, X'(1/X) is just
you take this matrix A and you invert it, right!
9:26
And so that's how you compute this thing.
You compute X'X and then you compute its inverse.
We haven't yet talked about Octave.
We'll do so in the later
set of videos, but in the
Octave programming language or a
similar view, and also the
matlab programming language is very similar.
The command to compute this quantity,
X transpose X inverse times
X transpose Y, is as follows.
In Octave X prime is
the notation that you use to denote X transpose.
And so, this expression that's
boxed in red, that's computing
X transpose times X.
pinv is a function for
computing the inverse of
a matrix, so this computes
X transpose X inverse,
and then you multiply that by
X transpose, and you multiply
that by Y. So you
end computing that formula
which I didn't prove,
but it is possible to
show mathematically even though I'm
not going to do so
here, that this formula gives you
the optimal value of theta
in the sense that if you set theta equal
to this, that's the value
of theta that minimizes the
cost function J of theta
for the new regression.
One last detail in the earlier video.
I talked about the feature
skill and the idea of
getting features to be
on similar ranges of
Scales of similar ranges of values of each other.
If you are using this normal
equation method then feature
scaling isn't actually necessary
and is actually okay if,
say, some feature X one
is between zero and one,
and some feature X two is
between ranges from zero to
one thousand and some feature
x three ranges from zero
to ten to the
minus five and if
you are using the normal equation method
this is okay and there is
no need to do features
scaling, although of course
if you are using gradient descent,
then, features scaling is still important.
Finally, where should you use the gradient descent
and when should you use the normal equation method.
Here are some of the their advantages and disadvantages.
Let's say you have m training
examples and n features.
One disadvantage of gradient descent
is that, you need to choose the learning rate Alpha.
And, often, this means running
it few times with different learning
rate alphas and then seeing what works best.
And so that is sort of extra work and extra hassle.
Another disadvantage with gradient descent
is it needs many more iterations.
So, depending on the details,
that could make it slower, although
there's more to the story as we'll see in a second.
As for the normal equation, you don't need to choose any learning rate alpha.
So that, you know, makes it really convenient, makes it simple to implement.
You just run it and it usually just works.
And you don't need to
iterate, so, you don't need
to plot J of Theta or
check the convergence or take all those extra steps.
So far, the balance seems to
favor normal the normal equation.
12:27
the normal equation, and some advantages of gradient descent.
Gradient descent works pretty well,
even when you have a very large number of features.
So, even if you
have millions of features you
can run gradient descent and it will be reasonably efficient.
It will do something reasonable.
In contrast to normal equation, In, in
order to solve for the parameters
data, we need to solve for this term.
We need to compute this term, X transpose, X inverse.
This matrix X transpose X.
That's an n by n matrix, if you have n features.
13:00
Because, if you look
at the dimensions of
X transpose the dimension of
X, you multiply, figure out what
the dimension of the product
is, the matrix X transpose
X is an n by n matrix where
n is the number of features, and
for almost computed implementations
the cost of inverting
the matrix, rose roughly as
the cube of the dimension of the matrix.
So, computing this inverse costs,
roughly order, and cube time.
Sometimes, it's slightly faster than
N cube but, it's, you know, close enough for our purposes.
So if n the number of features is very large,
13:37
then computing this
quantity can be slow and
the normal equation method can actually be much slower.
So if n is
large then I might
usually use gradient descent because
we don't want to pay this all in q time.
But, if n is relatively small,
then the normal equation might give you a better way to solve the parameters.
What does small and large mean?
Well, if n is on
the order of a hundred, then
inverting a hundred-by-hundred matrix is
no problem by modern computing standards.
If n is a thousand, I would still use the normal equation method.
Inverting a thousand-by-thousand matrix is
actually really fast on a modern computer.
If n is ten thousand, then I might start to wonder.
Inverting a ten-thousand- by-ten-thousand matrix
starts to get kind of slow,
and I might then start to
maybe lean in the
direction of gradient descent, but maybe not quite.
n equals ten thousand, you can
sort of convert a ten-thousand-by-ten-thousand matrix.
But if it gets much bigger than that, then, I would probably use gradient descent.
So, if n equals ten
to the sixth with a million
features, then inverting a
million-by-million matrix is going
to be very expensive, and
I would definitely favor gradient descent if you have that many features.
So exactly how large
set of features has to be
before you convert a gradient descent, it's hard to give a strict number.
But, for me, it is usually
around ten thousand that I might
start to consider switching over
to gradient descents or maybe,
some other algorithms that we'll talk about later in this class.
To summarize, so long
as the number of features is
not too large, the normal equation
gives us a great alternative method to solve for the parameter theta.
Concretely, so long as
the number of features is less
than 1000, you know, I would
use, I would usually is used
in normal equation method rather than, gradient descent.
To preview some ideas that
we'll talk about later in this
course, as we get
to the more complex learning algorithm, for
example, when we talk about
classification algorithm, like a logistic regression algorithm,
15:32
We'll see that those algorithm
actually...
The normal equation method actually do not work
for those more sophisticated
learning algorithms, and, we
will have to resort to gradient descent for those algorithms.
So, gradient descent is a very useful algorithm to know.
The linear regression will have
a large number of features and
for some of the other algorithms
that we'll see in
this course, because, for them, the normal
equation method just doesn't apply and doesn't work.
But for this specific model of
linear regression, the normal equation
can give you a alternative