outputs y like so.

If you have n_0 and x equals

n_0 input features and one hidden unit, and n_2.

So far, and n_2

In our case, just one output unit then

the matrix W_2 is n_2 by n_1 dimensional,

z_2 and therefore, dz_2

are going to be n_2 by one-dimensional.

There's really going to be a one-by-one

when we're doing binary classification.

Z_1, and therefore also,

dz_1 are going to be n_1 by one-dimensional.

Note that for any variable foo,

foo and dfoo always have the same dimensions.

So that's why w and dw always have the same dimension.

Similarly for b, and db,

and z, and dz, and so on.

So to make sure that the dimensions of this all match up,

we have that dz_1 is equal to w_2 transpose times dz_2.

Then this is an element-wise product times

g_1 prime of z_1.

So matching the dimensions from above,

this is going to be n_1 by one,

is equal to w_2 transpose,

we transpose of this.

So it's just going to be n_1 by n_2 dimensional.

dz_2 is going to be n_2 by one-dimensional.

Then this, this the same dimension as z_1.

So this is also

n_1 by one dimensional, so element-wise product.

So the dimensions too make sense.

n_1 one by 1 dimensional vector can be obtained

by an n_1 by n_2 dimensional matrix times n_2 by n_1,

because the product of these two things gives

you an n_1 by one-dimensional matrix.

So this becomes the element-wise product of

two and one by one dimensional vectors

and so dimensions do match up.

One tip when implementing

a backdrop: If you just

make sure that the dimensions of your matrices match up,

so if you think through what are the dimensions of

your various matrices including w_1, w_2, z_1,

z_2, a_1, a_2, and so on and just make sure

that the dimensions of

these matrix operations may match up.

Sometimes, that will already

eliminate quite a lot of bugs and back-prop.

All right. So this gives us dz_1.

Then finally, just to wrap up dw_1 and db_1,

we should write them here I guess.

But since I'm running out of space,

I'll write them on the right of the slide.

dw_1 and db_1 are given by the following formulas.

This is going to be equal to dz_1 times x transpose.

This is going to be equal to dz,

and you might notice a similarity between

these equations and these equations,

which is really no coincidence because

x plays the role of a_0.

So x transpose is a_0 transpose.

So those equations are actually very similar. All right.

So that gives a sense for

how back-propagation is derived.

We have six key equations here for dz_2,

dw_2, db_2, dz_1, dw_1, and db_1.

So let me just take these six equations and copy them

over to the next slide. Here they are.

So far we have derived back-propagation for

ABA training on a single training example at the time.

But it should come as no surprise that

rather than working on a single example at a time,

we would like to vectorize

across different training examples.

So you remember that for propagation,

when we're operating on one example at a time,

we had equations like this,

as let us say, a_1 equals g_1 of z_1.

In order to vectorize,

we took say the z's

and stack them up in columns like this,

z_1_m, and call this capital Z.

Then we found that by stacking things up in columns

and defining the capital uppercase version of this,

we then just had z_1 equals

w_1_x plus b and a_1 equals

g_1 of z_1.

We define the notation very carefully

in this course to make sure that

stacking examples into different columns

of a matrix makes all this workout.

So it turns out that

if you go through the math carefully,

the same trick also works for back-propagation.

So the vectorized equations are as follows: First,

if you take these

d_z's for different training examples and stack

them as the different columns

of the matrix and same for this and same for this,

then this is the vectorized implementation.

Then here's the definition or

here's how you can compute dW_2.

There is this extra one over m

because the cost function J is

this one over m of sum

from I equals one through m of the losses.

So when computing derivatives,

we have that extra one over m term just as we did when we

were computing the weight

updates for logistic regression.

Then that's the update you get for db_2.

Again, some of the d_z's and then we have a one over

m. Then dz_1 is computed as follows.

Once again, this is an element-wise product,

only whereas previously, we saw on

the previous slide that this was

an n_1 by one dimensional vector.

Now, this is a n_1 by m dimensional matrix.

Both of these are also n_1 by m dimensional.

So that's why that asterisk is

a element-wise product. All right.

Then finally, the remaining two updates.

Perhaps, it shouldn't look too surprising.

So I hope that gives you some intuition for

how the back-propagation algorithm is derived.

In all of machine learning,

I think the derivation of the back-propagation algorithm

is actually one of the most complicated pieces

of math I've seen.

It requires knowing both linear algebra as well as

the derivative of matrices to really

derive it from scratch from first principles.

If you are an expert in matrix calculus,

using this process, you

might want to derive the algorithm yourself.

But I think that there are actually

plenty of deep learning practitioners

that have seen the derivation

at about the level you've seen in this video,

and are already able to have all the right intuitions

and be able to implement this algorithm very effectively.

So if you are an expert in calculus,

do see if you can divide the whole thing from scratch.

It is one of the very hardest pieces of map on

the very hardest derivations

that I've seen in all of machine learning.

But either way, if you implement this,

this will work and I think you have

enough intuitions to tune in and get it to work.

So there's just one last detail I will

share of you before you implement your neural network,

which is how to

initialize the weights of your neural network.

It turns out that initializing your parameters not to

zero but randomly turns

out to be very important for training video network.

In the next video, you'll see why.