0:00

When designing a layer for a CONV layer ,

Â you might have to pick do you want to 1 x 3 filter,

Â or 3 x 3, or 5 x 5.

Â Or do you want to pooling layer?

Â What inception network does is it says,

Â why you do them all.

Â And this makes the network architecture more

Â complicated but it also works remarkably well.

Â Let's see how this works. Let's say for the sake of example that you have input,

Â say, 28 x 28 x 192 dimensional volume.

Â So what's the inception network or what an inception layer says is,

Â is instead of choosing what filter size you want in a CONV layer

Â or even do you want a convolutional layer or pooling layer.

Â Let's do them all. So what if you can use

Â a 1 x 1 convolution and that will output a 28 x 28 x something,

Â let's say 28 x 28 x 64 outputs,

Â and you just have a volume there.

Â But maybe you also want to try a 3 x 3 and that might output a 28 x 28 x 128.

Â And then what you do is just stack up the second value next to the first value.

Â And to make the dimensions match up,

Â let's make this a same convolution.

Â So the output dimension is still 28 x 28

Â same as the input dimension in terms of height and width with 28 x 28 by,

Â in this example, 128.

Â And maybe you might say, "Well,

Â I want to have my best, maybe a 5 x 5 filter works better."

Â So let's do that too.

Â And have that output a 28 x 28 x 32.

Â And again, you use the same convolution to keep the dimensions the same.

Â And maybe you don't want the convolutional layer.

Â Let's apply pooling.

Â And that has some other output and let's stack that up as well.

Â And here, pooling outputs 28 x 28 x 32.

Â Now in order to make all the dimensions match,

Â you actually need to use padding for max pooling.

Â So this is an unusual form of pooling because if you

Â want the input say of height and weight 28 x 28 and have

Â the output matched the dimension everything else also by 28 x 28 then you need to

Â use the same padding as well as a stride of one for pooling.

Â So this detail might seem a bit funny to you now but let's keep

Â going and we'll make this all work later.

Â But with a inception module like this,

Â you can input some volume.

Â And output in this case,

Â because it adds up all these numbers. 32 + 32 + 128 + 64 that's equal to 256.

Â So you'd have one inception module input

Â 28 x 28 x 129,

Â and output 28 x 28 x 256.

Â And this is the heart of

Â the inception network which is due to Christian Szegedy, Wei Liu,

Â Yangqing Jia, Pierre Sermanet,

Â Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.

Â And the basic idea is that instead of you

Â needing to pick one of these filter sizes or pooling you want and committing to that,

Â you can do them all and just concatenate

Â all the outputs and let the network learn whatever parameters it wants to use,

Â what are the combinations of these filter sizes at once.

Â Now, it turns out that there's a problem with

Â the inception layer as I've describe it here which is computational cost.

Â On the next slide, let's figure out what's the computational cost of

Â this 5 x 5 filter resulting in this block over here.

Â So just focusing on the 5 x 5 part on the previous slide,

Â we had this input a 28 x 28 x 192 block.

Â And you implement a 5 x 5 same convolution of 32 filters output 28 x 28 x 32.

Â On the previous slide that I had drawn there as

Â a thin purple size because I draw this as a more normal-looking blue block here.

Â So, let's look at the computational costs of outputting just 28 x 28 x 32.

Â So you have 32 filters because the outputs has

Â 32 channels and each filter is going to be 5 x 5 x 192.

Â And so, the output size is 20 x 20 x 32.

Â And so, you need to compute 28 x 28 x 32 numbers.

Â And for each of them,

Â you need to do this many multiplications,

Â 5 x 5 x 192.

Â So the total number of multipliers you need is the number of multipliers you need to

Â compute each of the output values times the number of output values you need to compute.

Â And if you multiply all of these numbers,

Â this is equal to a 120 million.

Â And so, while you can do a 120 million multiplies on the modern computer,

Â this is still a pretty expensive operation.

Â On the next slide, you see how using the idea of 1 x 1 convolutions,

Â we should learn about in the previous video,

Â you'll be able to reduce the computational costs by about a factor of 10 to go

Â from about 120 million multiplies to about one-tenth of that.

Â So please remember the number 120 so you can compare

Â it with what you see on the next slide, 120 million.

Â Here's an alternative architecture for

Â inputting 28 x 28 x 192 and outputting 28 x 28 x 32,

Â which is following: You're going to input the volume,

Â use a 1 x 1 for convolution to reduce the value to a 16 channels instead of 192 channels.

Â And then on this much smaller volume,

Â run your 5 x 5 convolution to give you your final output.

Â So notice the input and output dimensions are still the same.

Â You input 28 x 28 x 192 and output 28 x 28 x 32 same as the previous slide.

Â But what we've done is taken is we've taken just huge volume we had on the left.

Â And was shrunk it to

Â this much smaller intermediate volume which is only has 16 instead of 192 channels.

Â Sometimes this is called a bottleneck layer.

Â I guess because a bottleneck is usually the smallest of something.

Â So I guess if you have a glass bottle let's say

Â this then this is I guess where the cork goes.

Â Then the bottleneck is the smallest part of this bottle.

Â So the same way, the bottleneck layer is the smallest part of this network.

Â We shrink the representation before increasing the size again.

Â Now, lets look at the computational costs involved.

Â To apply this 1 x 1 convolution,

Â we have 16 filters each of the filters is going to be of dimension 1 x 1 x 192,

Â this 192 matches that 192.

Â And so, the cost of computing this 28 x 28 x 16 volumes is going to be,

Â all you need just many outputs and each of them you need to do 192 multiplications.

Â I could have written 1 x 1 x 192 to represent this.

Â And you multiply this out,

Â this is 2.4 million.

Â It's about 2.4 million.

Â How about the second,

Â so that's the cost of this first convolutional layer.

Â The cost of this second convolutional layer will be that,

Â well you have this many outputs.

Â So 28 x 28 x 32,

Â and then for each of the outputs you have to apply a 5 x 5 x 16 dimensional filter.

Â So 5 x 5 x 16 and you will multiply that out is equal to 10.0.

Â And so, the total number of multiplications you need to do is the sum of

Â those which is 12.4 million multiplications.

Â And if you compare this with what we had on the previous slide,

Â you reduce the computational cost from about 120 million multiplies

Â down to about one-tenth of that to 12.4 million multiplications.

Â And the number of additions you need to do is

Â about very similar to the number of multiplications you need to do.

Â So that's why I'm just counting the number of multiplications.

Â So to summarize, if you are building a layer of a neural network and you don't want

Â to have to decide do you want a 1 x 1 or 3 x 3 or 5 x 5 of pooling layer.

Â The inception module, let's you say,

Â let's do them all. And let's concatenate the results.

Â And then we ran to the problem of

Â computational cost and we just saw here was how using a 1 x 1 convolution,

Â you can create this bottleneck layer

Â thereby reducing the computational cost significantly.

Â Now, you might be wondering just shrinking down the representation size so dramatically,

Â does it hurt the performance of your neutral network?

Â It turns out that so long as you implement this bottleneck layer within the region,

Â you can shrink down the representation size significantly.

Â And it doesn't seem to hurt the performance.

Â That saves you a lot of computation.

Â So these are the key ideas of the inception module.

Â Let's put them together and then in the next video,

Â show you what the full inception network looks like.

Â