0:00

In this video, you'll learn about some of the classic neural network architectures,

Â starting with LeNet - 5, and then AlexNet,

Â and then VGG Net. Let's take a look.

Â Here is the LeNet - 5 architecture.

Â You start off with an image which,

Â say 32 by 32 by 1,

Â and the goal of LeNet - 5 was to recognize handwritten digits.

Â So maybe an image of a digit like that.

Â And LeNet - 5 was trained on grayscale images,

Â which is why it's 32 by 32 by 1.

Â This neural network architecture is actually quite

Â similar to the last example you saw last week.

Â In the first step,

Â you use a set of six five by five filters with a stride of one,

Â because if you use six filters,

Â you end up with a 28 by 28 by 6 over there,

Â and with a stride of one and no padding,

Â the image dimensions reduces from 32 by 32 down to 28 by 28.

Â Then the LeNet neural network applies pooling.

Â And back then, when this paper was written,

Â people used average pooling much more.

Â If you're building a modern variant,

Â you'll probably use mass pooling instead.

Â But, in this example, your average pool,

Â and with a filter width of two and a stride or two,

Â you wind up reducing the dimensions,

Â the height and width, by a factor of two.

Â So we now end up with a 14 by 14 by 6 volume.

Â I guess the height and width of these volumes aren't entirely drawn to scale.

Â Now technically, if I were drawing these volumes to scale,

Â the height and width would be stronger by a factor of two.

Â Next, you apply another convolutional layer.

Â This time you use a set of 16 filters that are 5 by 5,

Â so you end up with 16 channels in the next volume.

Â And back when this paper was written in 1998,

Â people didn't really use hading,

Â or you're always using valid convolutions,

Â which is why every time you apply a convolutional layer,

Â the height and width shrinks.

Â So that's why here you go from 14 to 14 down to 10 by 10.

Â Then another pooling layer,

Â so that reduces the height and width by a factor of two.

Â You end up with five by five over here.

Â And if you multiply all these numbers, 5 by 5 by 16,

Â this multiplies up to 400,

Â that's 25 times 16 is 400.

Â And the next layer is then a fully connected layer,

Â that fully connects each of these 400 nodes with every one of 120 neurons.

Â So there's a fully connected layer,

Â and sometimes that would draw out near

Â explicitly a layer with 400 nodes from skipping that.

Â There's a fully connected layer, and then another a fully connected layer.

Â And then the final step is it uses these essentially,

Â 84 features and uses it with one final output.

Â I guess you could draw one more node here to make a prediction for y-hat,

Â and y-hat took on 10 possible values

Â corresponding to recognizing each of the digits from zero to nine.

Â A modern version of this neural network will use

Â a softmax layer with a ten wave classification output.

Â Although back then, LeNet - 5 actually used a different crossfire at the output layer,

Â one that's used less today.

Â So, this neural network was small by modern standards,

Â it had about 60,000 parameters.

Â And today, you often see neural networks with

Â anywhere from 10 million to 100 million parameters,

Â and it's not unusual to see networks that are

Â literally about a thousand times bigger than this network.

Â But one thing you do see is that as you go deeper in the networks,

Â as you go from left to right,

Â the height and width tend to go down.

Â So you went from 32 by 32, to 28, to 14, to 10, to 5.

Â Whereas the number of channels tends to increase,

Â and goes from 1 to 6,

Â to 16, as you go deeper into the layers of the network.

Â One other pattern you see in this neural network,

Â that's still often repeated today,

Â is that you might have some,

Â one or more conv layers,

Â followed by a pooling layer,

Â and then one or sometimes more than one conv layer,

Â followed by a pooling layer,

Â and then some fully connected layers,

Â and then the outputs.

Â So this type of arrangement of layers is quite common.

Â Now finally, this is maybe only for those of you that want to try reading the paper.

Â There are a couple of other things that were different.

Â The rest of this slide,

Â I'm going to make a few more advanced comments only for those

Â of you that want to try to read this classic paper.

Â And so, everything that I'm going to write and read,

Â you can safely skip on this slide,

Â and there's maybe an interesting historical footnote

Â that is okay if you don't follow fully.

Â So it turns out that if you read the original paper,

Â back then people used Sigmoid and Tahn non-linearities,

Â and people weren't using ReLu non-linearities back then.

Â So if you look at the paper, you see sigmoid and tahn referred to.

Â And there are also some funny ways about this network was wired,

Â and is funny by modern standards.

Â So for example, you've seen how,

Â if you have a nh by nw and nc network with nc channels,

Â then you use f by f,

Â by the same nc dimensional filter,

Â where every filter looks at every one of these channels.

Â But back then, computers were much slower.

Â And so, to save on computation as well as on parameters,

Â the original LeNet - 5 had some crazy complicated way

Â where different filters look at different channels of the input block.

Â And so, the paper talks about those details,

Â but the more modern implementation you wouldn't have that type of complexity these days.

Â And then one last thing that was done back then, I guess,

Â but isn't really done right now,

Â is that your original LeNet - 5 had a non-linearity after pooling.

Â And I think it actually uses sigmoid non-linearity after the pooling layer.

Â So, if you do read this paper,

Â and this is one of the harder ones to read,

Â of the ones we will go over in the next few videos.

Â The next one might be an easy one to start with.

Â Most of the ideas on this slide are described in sections two and three of the paper,

Â and later sections of the paper talk about some other ideas.

Â It talked about something called the graph transformer network,

Â which isn't widely used today.

Â So if you do try to read this paper,

Â I recommend focusing on,

Â really on section two,

Â which talks about this architecture,

Â and maybe take a quick look at section three,

Â which has a bunch of experimental results, which are pretty interesting.

Â The second example of a neural network I want to show you is AlexNet,

Â named after Alex Krizhevsky,

Â who was the first author of the paper describing this work.

Â The other authors were Ilya Sutskever and Geoffrey Hinton.

Â So AlexNet inputs start with 227 by 227 by 3 images.

Â And if you read the paper,

Â the paper refers to 224 by 224 by 3 images.

Â But if you look at the numbers,

Â I think that the numbers make sense only if it's actually 227 by 227.

Â And then the first layer applies a set of 96,

Â 11 by 11 filters with a stride of 4.

Â And because it uses a last stride of four,

Â the dimensions shrinks to 55 by 55.

Â So roughly, going down by a factor of four because of the last stride.

Â And then applies max pooling with a three by three filter.

Â So f equals three,

Â and a stride of two,

Â so this reduces the volume to 27 by 27 by 96.

Â And then it performs a five by five same convolution.

Â So padding, so you end up with 27 by 27 by 276.

Â Max pooling again, this then reduces the height and width to 13.

Â And then another same convolution, so same padding,

Â so it's 13 by 13,

Â by now, 384 filters.

Â And then three by three,

Â same convolution again, gives you that.

Â Then three by three, same convolution, gives you that.

Â Then max pool brings it down to 6 by 6 by 256.

Â If you multiply all these numbers 6 times 6 times 256, that's 9,216.

Â So we're going to unroll this into 9,216 nodes.

Â And then finally, it has a few fully connected layers, and then finally,

Â uses a softmax to output which one of 1,000 clauses the object could be.

Â So, this neural network actually had a lot of similarities to LeNet,

Â but it was much bigger.

Â So whereas LeNet or LeNet-5 from your previous slide had about 60,000 parameters,

Â this AlexNet had about 60 million parameters.

Â And the fact that they could take pretty similar basic building blocks

Â that have a lot more hidden units and trained

Â on a lot more data they trained on the image and the data set,

Â that allowed it to have a just remarkable performance.

Â Another aspect of this architecture that made it much

Â better than LeNet was using the ReLU activation function.

Â And then again, just if you read the paper,

Â some more advanced details that you don't

Â really need to worry about if you don't read the paper.

Â One is that when this paper was written,

Â GPUs were still a little bit slower.

Â So, it had a complicated way of training on two GPUs.

Â And the basic idea was that,

Â a lot of these layers was actually split across two different GPUs,

Â and there was a thoughtful way for when the two GPUs would communicate with each other.

Â And the paper also,

Â the original AlexNet architecture,

Â also had another type of a layer called a local response normalization.

Â And this type of layer isn't really used much,

Â which is why I didn't talk about it.

Â But the basic idea of local response normalization is,

Â if you look at one of these blocks,

Â one of these volumes that we have on top,

Â let's say for the sake of argument this one,

Â 13 by 13 by 256.

Â What local response normalization (LRN) does,

Â is you look at one position,

Â so one position the height and width,

Â and look down just across all the channels,

Â look at all 256 numbers and normalize them.

Â And the motivation for this local response normalization was that for

Â each position in this 13 by 13 image,

Â maybe you don't want too many neurons with a very high activation.

Â But subsequently, many researchers have found that this doesn't help that much.

Â So, this is just one of those ideas I guess I'm drawing in red

Â because it's less important for you to understand this one.

Â And in practice, I don't really use local response normalizations,

Â really, in the networks that I would train today.

Â So if you're interested in the history of deep learning,

Â I think even before AlexNet,

Â deep learning was starting to gain attraction

Â in speech recognition and a few other areas,

Â but it was really this paper that convinced a lot of

Â the computer vision community to take a serious look at deep learning,

Â and to convince them that deep learning really works in computer vision,

Â and then it grew on to have a huge impact,

Â not just in computer vision but beyond computer vision as well.

Â And if you want to try reading some of these papers yourself,

Â and you really don't have to for this course,

Â but if you want to try reading some of these papers,

Â this one is one of the easier ones to read.

Â So this might be a good one to take a look at.

Â So, whereas AlexNet has a relatively complicated architecture,

Â they're just a lot of hyper parameters,

Â where you have all these numbers

Â that Alex Krizhevsky and his co-authors had to come up with.

Â Let me show you a third and final example on this video

Â called the VGG or the VGG-16 network.

Â And a remarkable thing about the VGG-16 net is that they said,

Â instead of having so many hyper parameters,

Â let's use a much simpler network where you focus on just having conv layers

Â that are just three by three filters with stride one and always use the same padding,

Â and make all your max pooling layers two by two with a stride of two.

Â And so, one very nice thing about the VGG network was,

Â it really simplified these neural network architectures.

Â So let's go through the architecture.

Â So you solve for image,

Â and then the first two layers are convolutions which are therefore,

Â these three by three filters.

Â And then the first two layers,

Â you use 64 filters.

Â You end up with a 224 by 224,

Â because you're using same convolutions,

Â and then with 64 channels.

Â So because VGG-16 is a relatively deep network,

Â I'm going to not draw all the volumes here.

Â So what this little picture denotes is what we would previously have

Â drawn as this 224 by 224 by 3.

Â And then a convolution that results in I guess a 224 by 224 by 64.

Â This could be drawn as a deeper volume,

Â and then another layer that results in 224 by 224 by 64.

Â So this conv 64 times 2,

Â represents that you're doing two conv layers with 64 filters.

Â And as I mentioned earlier,

Â the filters are always 3 by 3 with a stride of one,

Â and they're always same convolutions.

Â So rather than drawing all these volumes,

Â I'm just going to use text to represent this network.

Â Next, then uses a pulling layer.

Â So the pulling layer will reduce anything above it.

Â Goes from 224 by 224 down to what?

Â Right, goes to 112 by 112 by 64.

Â And then it has a couple more conv layers.

Â So this means it has 120 filters,

Â and because these are same convolutions,

Â let's see what's the new dimension.

Â It will be 112 by 112 by 128.

Â And then pulling layer,

Â so you can figure out what's the new dimension. It'll be that.

Â And now, three conv layers with 256 filters,

Â then the pooling layer,

Â and then a few more conv layers,

Â pooling a layer, more conv layers, pooling layer,

Â and then it takes this final 7 by 7 by 512 FC to fully connect to a layer,

Â fully connected with 4,096 units and then a Softmax output one of the 1000 clauses.

Â By the way, the 16 in the name VGG-16,

Â refers to the fact that this has 16 layers that have to wait.

Â And this is a pretty large network.

Â This network has a total of about 138 million parameters.

Â And that's pretty large even by modern standards.

Â But the simplicity of the VGG-16 architecture made it quite appealing.

Â You can tell its architecture is really quite uniform.

Â There's a few conv layers followed by a pooling layer,

Â which reduces the height and width.

Â So the pooling layers reduce the height and width.

Â You have a few of them here.

Â But then also, if you look at the number of filters in the conv layers,

Â here you have 64 filters,

Â and then you double to 128,

Â double to 256 doubles to 512.

Â And then I guess, the authors thought 512 was big enough and didn't double it again here.

Â But roughly doubling on every step,

Â or doubling through every stack of conv layers was

Â another simple principle used to design the architecture of this network.

Â And so, I think the relative uniformity of

Â this architecture made it quite attractive to researchers.

Â The main downside was that,

Â it was a pretty large network in terms of the number of parameters you had to train.

Â And so, if you read the literature,

Â you sometimes see people talk about VGG-19.

Â There's an even bigger version of this network.

Â And you could see the details in the paper cited at

Â the bottom by Karen Simonyan and Andrew Zisserman.

Â But because VGG-16 does almost as well as VGG-19,

Â a lot of people will use VGG-16.

Â But the thing I liked most about this was that,

Â this made this pattern of how as you go deeper,

Â height and width goes down.

Â It just goes down by a factor of two each time by the pooling layers,

Â whereas the number of channels increases.

Â And sure it roughly goes up by

Â a factor of two every time you have a new set of conv layers.

Â So, by making the rate at which this goes down and that go up very systematic,

Â I thought this paper was very attractive from that perspective.

Â So that's it for the three classic architecture's.

Â If you want, you should go to now read some of these papers.

Â I recommend starting with the AlexNet paper,

Â followed by VGG Net paper,

Â and then the LeNet paper.

Â It's a bit harder to read but it is a good course if you want to take a look at that.

Â But next let's go beyond these classic networks and look at some even more advanced,

Â even more powerful neural network architectures. Let's go into-

Â