0:00

in a previous video you saw how to

compute derivatives and implement

gradient descent with respect to just

one training example for religious

regression now we want to do it for Emma

training examples to get started let's

remind ourselves that the definition of

the cost function J cost function WP

which you care about is this average

right 1 over m sum from I equals 1 to M

you know the loss when your algorithm

output a I on the example why we're you

know AI is the prediction on the I've

trained example which is Sigma of Z I

which is equal to Sigma of W transpose X

plus B ok so what we show in the

previous slide is for any single

training example how to compute LD

derivatives when you have just one

training example great so d w1 d w2 and

d be with now the superscript I to

denote the corresponding values you get

if you are doing what we did on the

previous slide but just using the one

training example X I Y I I was using it

missing on either as well so now you

notice the overall cost functions is sum

was really the average because the 1

over m term of the individual losses so

it turns out that the derivative respect

to say w1 of the overall cost function

is also going to be the average of

derivatives respect to w1 of the

individual loss terms but previously we

have already shown how to compute this

term as say d w1 I right which we you

know on the previous slide show how the

computers on a single training example

so what you need to do is really compute

these own derivatives as we showed on

the previous training example and

average them and this will give you the

overall gradient that you can use to

implement

straight into scent so I know there was

a lot of details but let's take all of

this up and wrap this up into a concrete

algorithms and what you should implement

together logistic regression with

gradient descent working so just what

you can do let's initialize J equals 0

on DW 1 equals 0 DW 2 equals 0 DB equals

0 and what we're going to do is use a

for loop over the training set and

compute the derivatives to respect each

training example and then add them up

all right so see as we do it for I

equals 1 through m so M is the number of

training examples we compute CI equals W

transpose X I plus B armed the

prediction AI is equal to Sigma of zi

and then you know let's let's add up j j

plus equals y i long a I M plus 1 minus

y I log 1 minus AI and then put a

negative sign in front of the whole

thing and then as we saw earlier we have

d zi or it is equal to AI minus y i and

DW gets plus equals x1 i d zi b w2 plus

equals x i2 d zi or and i'm doing this

calculation assuming that you have just

be two features so the n is equal to 2

otherwise you do this for d w1 z w2 TW 3

and so on and GB plus equals V V I and I

guess that's the end of the for loop and

then finally having done this for all M

training examples you will still need to

divide by M because we're computing

averages so d w1

if I equals m DW to divide calls m DB

device equals M in all the complete

averages and so with all of these

calculations you've just computed the

derivative of the cost function J with

respect to e three parameters W 1 W 2

and B so the comment details what we're

doing we're using DW 1 + DW and DP

- as accumulators right so that after

this computation you know DW 1 is equal

to the derivative of your overall cost

function with respect to W 1 and

similarly for DW 2 and DV so notice that

DW 1 + DW to do not have a superscript I

because we're using them in this code as

accumulators to sum over the entire

training set whereas in contrast bzi

here this was on P Z with respect to

just one single training example that is

why that has a superscript I to refer to

the one training example either that's

computer on and so having finished all

these calculations to implement one step

of gradient descent you implement w1

gets updated as w1 - a learning rate

times d w1 w2 gives updates w2 one is

learning rate times d w2 and B gives

update as B - learning rate times EB

where PW 1 DW 2 + DB where you know as

computed and finally J here would also

be a correct value for your cost

function so everything on the slide

implements just one single step of

gradient descent and so you have to

repeat everything on this slide multiple

times in order to take multiple steps of

gradient descent in case these details

seem too complicated

again don't worry too much about it for

now hopefully all this will be clearer

when you go and implement this in D

programming assignment but it turns out

there are two weaknesses with the

calculation as with as with implemental

adhere which is that to implement

logistic regression this way you need to

write two for loops the first for loop

is a small loop over the M training

examples and the second for loop is a

for loop over all the features over here

right so in this example we just had two

features so n is 2 equal to 2 and X

equals 2 but if you have more features

you end up writing your DW 1 DW 2 and

you have similar computations for DW v

and so on down to DW n so seems like you

need to have a for loop over the

features over all n features

when you're implementing deep learning

algorithms you find that having explicit

for loops in your code makes your

algorithm run less efficiency and so in

the deep learning error would move to a

bigger and bigger data sets and so being

able to implement your algorithms

without using explicit for loops is

really important and will help you to

scale to much bigger data sets so it

turns out that there are set of

techniques called vectorization

techniques that allows you to get rid of

these explicit full loops in your code I

think in the pre deep learning era

that's before the rise of deep learning

vectorization was a nice to have you

could sometimes do it to speed a vehicle

and sometimes not but in the deep

learning era vectorization that is

getting rid of for loops like this and

like this has become really important

because we're more and more training on

very large datasets and so you really

need your code to be very efficient so

in the next few videos we'll talk about

vectorization and how to implement all

this without using even a single full

loop so of this I hope you have a sense

of how to intimate logistic regression

or gradient descent for logistic

regression on things will be clearer

when you implement the program exercise

but before actually doing the program

exercise let's first talk about

vectorization so then you can implement

this whole thing implement a single

iteration of gradient descent without

using any fall news