If the basic technical ideas behind deep learning, behind newer networks have been around for decades, why are they only just now taking off? In this video, let's go over some of the main drivers behind the rise of deep learning because I think this will help you to spot the best opportunities within your own organization to apply these to. Over the last few years, a lot of people have asked me, "Andrew, why is deep learning suddenly working so well?" When I'm asked that question this is usually the picture I draw for them. Lets say we plot a figure where on the horizontal axis we plot the amount of data we have for a task. Let's say on the vertical axis, we plot the performance of our learning algorithm such as the accuracy of our spam classifier or our ad click predictor or the accuracy of our neural net for figuring out the position of other cars for our self-driving car. It turns out if you plot the performance of a traditional learning algorithm like support vector machine or logistic regression as a function of the amount of data you have, you might get a curve that looks like this, where the performance improves for awhile as you add more data but after a while, the performance pretty much plateaus. This is supposed to be a horizontal lines, did not draw that very well. It was as if they didn't know what to do with huge amounts of data. What happened in our society over the last 20 years maybe is that for a lot of the problems, we've went from having a relatively small amounts of data to having often a fairly large amount of data where lot of this was- thanks to the digitization of a society where so much human activity is now in the digital realm. We spend so much time on our computers, on websites, on mobile apps and activities on digital devices creates data. Thanks to the rise of inexpensive cameras, goes into our cell phones, accelerometers, all sorts of sensors in the Internet of things, we also just have been collecting more and more and more data. So, over the last 20 years for a lot of applications, we just accumulated a lot more data, more than traditional learning algorithms were able to effectively take advantage of. With neural networks, it turns out that if you train a small neural net then its performance maybe looks like say that, if you train a somewhat larger neural net, that's called as a medium-size neural net, performance often will be better and if you train a very large neural net, then it's performance often just keeps getting better and better. So, couple observations, one is, if you want to hit this very high level of performance, then you need two things. First, often you need to be able to train a big enough neural network in order to take advantage of the huge amount of data and second, you need to be out here on the x-axis, you do need a lot of data. So, we often say that scale has been driving deep learning progress and by scale, I mean both the size of the neural network, meaning just a neural network with a lot of hidden units, a lot parameters, a lot of connections as well as scale of the data. In fact, today one of the most reliable ways to get better performance in a neural network is often to either train a bigger network or throw more data at it and that only works up to a point because eventually you'll run out of data or eventually the neural network is so big that it takes too long to train but just improving scale has actually taken us a long way in the world of deep learning. In order to make this diagram a bit more technically precise, let me just add a few more things. I wrote the amount of data on the x-axis. Technically, this is amount of labelled data where by labelled data I mean training examples we have both the input x and the label y. To introduce a little bit of notation that we'll use later in this course, we're going to use lowercase alphabet m to denote the size of my training set. So, the number of training examples is lower-case m. So that's the horizontal axis. Couple other details to this figure, in this regime of small training sets, the relative ordering of the algorithms is actually not very well defined. So, if you don't have a lot of training data, it's often up to your skill at hand engineering features that determines performance. So, it's quite possible that if someone training in SVM is more motivated to hand-engineer features than someone training an even larger neural net, maybe into small training set regime, the SVM could do better. So, in this region to the left of the figure, the relative ordering between the algorithm is not that well-defined and performance depends much more on your skill at handling features and other law details of the algorithms and is only in this big data regime, very large training sets, very large m regime in the right that we more consistently see large neural nets dominating the other approaches. So, if any of your friends ask you why are neural nets taking off? I would encourage you to draw this picture for them as well. So, I would say that in the early days, in the modern rise of deep learning, it was scale data and scale of computation. Just our ability to train very large neural networks, either on a CPU or a GPU, that enabled us to make a lot of progress. But increasingly, especially in the last several years, we've been seeing tremendous algorithmic innovation as well. So, I also don't want to understate that. Interestingly, many of the algorithmic innovations have been about trying to make neural networks run much faster. So, as a concrete example, one of the huge breakthroughs in neural networks has been switching from a Sigmoid function, which looks like this, to a ReLu function which we talked about briefly in an earlier video that looks like this. If you don't understand the details of what I'm about to say, don't worry about it. But it turns out that one of the problems of using Sigmoid functions in machine learning is that there are these regions here where the slope of the function, where the gradient is nearly zero, and so learning becomes really slow because when you implement gradient descent and the gradient is zero, the parameters just change very slowly and so learning is very slow. Whereas by changing the Activation function of the neural network to use this function called the Value function or the Rectified linear unit, RELU, the gradient is equal to one for all positive values of input, right. So, the gradient is much less likely to gradually shrink to zero. The gradient here, the slope of this line is zero on the left but it turns out that just by switching to the Sigmoid function to the Value function has made an algorithm called Gradient descent, work much faster. So, this is an example of maybe a relatively simple algorithmic innovation but ultimately the impact that this algorithmic innovation was it really helped computation. So, they've actually been quite a lot of examples like this where we changed the algorithm because it allows our code to run much faster and this allows us to train bigger neural networks or deduce so the reasonable amount of code even when we have a large network or a lot of data. The other reason that fast computation is important is that it turns out the process of training neural network is very iterative. Often, you have an idea for a neural network architecture and so you implement your idea in a code. Implementing your idea then lets you run an experiment which tells you how well your neural network does and then by looking at it, you go back to change the details of your neural network and then you go around this circle over and over. When your new network takes a long time to train, it just takes a long time to go around this cycle and there's a huge difference in your productivity building effective neural networks when you can have an idea and try it and see if it works in 10 minutes or maybe at most a day versus if you train your a neural network for a month. Well, sometimes it does happen because when you get a result back in 10 minutes or maybe in a day, you could just try a lot more ideas and be much more likely to discover a neural network that works well for your application. So, faster computation has really helped in terms of speeding up the rate at which you can get an experimental result back and this has really helped both practitioners of neural networks as well as researchers working in deep learning iterate much faster and improve your ideas much faster. So, all this has also been a huge boon to the entire deep learning research community which has been incredible I guess, in inventing new algorithms and making nonstop progress on that front. So, these are some of the forces paring the rise of deep learning but the good news is that these forces are still working powerfully to make deep learning even better. Take data. Society is still authoring more and more digital data. Or take computation with the rise of specialized hardware like GPUs and faster networking, many types of hardware. I'm actually quite confident that our ability to build very large neural networks, from a sheer computation point of view, will keep on getting better. And take algorithms. well the whole deep learning research communities still continues to be phenomenal at innovating on the algorithms front. So, because of this, I think that we can be optimistic. I'm certainly optimistic that deep learning will keep on getting better for many years to come. So, with that, let's go on to the last video of this section where we'll talk a little bit more about what you learnt from this course.