In the previous video, we introduced the concept of neural networks and then we worked through the algebra required to describe a fully connected feed forward network with hidden layers. In this video, we're going to see how the multivariate chain rule will enable us to iteratively update the values of all the weights and biases such that the network learns to classify input data based on a set of training examples. When we say that we are training a network, we typically mean using some kind of labelled data, which are pairs of matching inputs and outputs. For example, if we were to build a network to recognize pictures of faces and predict if they were happy, then for our training data, each of the inputs might be the intensity of a single pixel from the image. And this would be paired with output which just says whether this image contains a face and whether it was a happy face. The classic training method is called back propagation because it looks first at the output neurons and then it works back through the network. If we start by choosing a simple structure such as the one shown here with four input units, three units in the hidden layer and two units in the output layer, what we're trying to do is find the 18 weights and five biases that cause our network to best match the training inputs to their labels. Initially, we will set all of our weights and biases to be a random number. And so initially, when we pass some data into our network, what we get out will be meaningless. However, we can then define a cost function, which is simply the sum of the squares of the differences between the desired output y, and the output that our untrained network currently gives us. If we were to focus on the relationship between one specific weight and the resulting cost function, it might look something like this, where if it's either too large or too small, the cost is high. But, at one specific value, the cost is at a minimum. Now, based on our understanding of calculus, if we were somehow able to work out the gradient of C with respect to the variable W, at some initial point W0, then we can simply head in the opposite direction. For example, at the point shown on the graph, the gradient is positive and therefore increasing W would also increase the cost. So, we should make W smaller to improve our network. However, at this point it's worth noting that our cost function may look something more like this wiggly curve here, which has several local minima and is more complicated to navigate. Furthermore, this part is just considering one of our weights in isolation. But what we'd really like to do is find the minimum of the multi-dimensional hyper surface much like the 2D examples we saw in the previous module. So, also like the previous module, if we want to head down hill, we will need to build the Jacobian by gathering together the partial derivatives of the cost function with respect to all of the relevant variables. So, now that we know what we're after, we just need to look again at our simple two-node network. And at this point, we could immediately write down a chain rule expression for the partial derivative of the cost with respect to either our weight or our bias. And I've highlighted the a1 term which links these two derivatives. However, it's often convenient to make use of an additional term which we will call z1, that will hold our weighted activation plus bias terms. This will allow us to think about differentiating the particular sigmoid function that we happened to choose separately. So, we must therefore include an additional link in our derivative chain. We now have the two chains rule expressions we'd require to navigate the two dimensional WB space in order to minimize the costs of this simple network for a set of training examples. Clearly, things are going to get a little more complicated when we add more neurons. But fundamentally, we're still just applying the chain rule to link each of our weights and biases back to its effect on the cost, ultimately allowing us to train our network. In the following exercises, we're going to work through how to extend what we saw for the simple case to multi-layer networks. But I hope you've enjoyed already seeing calculus in action. See you next time.