Hello everyone, my name is Pavel, and now we'll continue our conversation about how much our algorithms are trained on big data. In this video, I will tell you about the ways to train the second order learning method that work faster than linear methods. I put here a portrait of Issac Newton, the man who came up was one of the most popular ways of learning algorithms, it's amazing, isn't it? Now I will tell you that the second method explains the intuition of the Newton method, and tell you method of BFGS and L-BFGS, which are the approximate method of Newton, and are used on the big data. In the end, I will tell you a secret, how linear regression actually trains itself, through the Newton method. Imagine the same problems that we solved earlier, we have our curve, we want to build some approximation, find the minimum value of this function. This can be our function of errors when we solve the problem of machine learning. When we use the gradient descent to build the tangents to this curve, and work down the tangent, the second order methods say, let's build curves. Second order times for each such curves at the selected point, the first and second derivative are equal to the derivative of the original function, what kind of curve could this be in? Curves of the second order tangents are parabolas, now we're looking for the minimum of this parabola, and this is second step of our approximation. After this, we again build a tangent parabola, but oriented to this new plane. The meeting of this parabola is step number three, and so on, until it converges. The formula looks very simple on the one dimensional case, we take the current value of our parameters, and subtract from it the ratio of the first derivative through the second derivative, and these we are moving to the minimum. How does it look in the two dimensional case, let's see, we construct our paraboloid tangent to the point of initial approximation. Its minimum is the point of approximation number one. Now we'll build parabola number two, and also move to the minimum by the following steps. The method, as I have said, works much faster, if a linear stochastic gradient needs 10000 steps, this method can easily call for six steps. How to move from a one dimensional case to a multi dimensional case. The first derivative, as you already know, has to be replaced by a gradient. Which is a vector containing derivatives with respect to all variables. And the second order derivative must be replaced by the. This is a matrix that contains the derivative for each of the coordinates. In the state of substructions, the ratio of the derivative from the current ratio to get the new one. We need to subtract the gradient multiplied by the inverse hashing metrics. This will move into our optimum, however, there is a problem in this algorithm. The point that the calculation of the inverse [INAUDIBLE] as well as the source of the analytical solution requires n in the cube of operations. We are faced with the same problem that for 10 features, we need to make about 1,000 operations, for 100 features, 1 million, for 1000, 1 billion, and so on, this is extremely expensive. We often have methods that contain much more features, hundreds of thousands, and sometimes millions. To solve this problem, a special method was invented, it's called BFGS, or the Broyden-Fletcher-Goldfarb-Shanno algorithm. It seems to me that these names are great for using them as a tongue twister. It says, let's look for the ratio of the hashing approximately, we'll use the real hashing ratio we obtained in the previous step, and find a new ratio, adding a small addition, those are values that you have before. The additive is a product of two matrices, one of which contains two columns, and the second two rows, we will need those in the future. How I find this additive, it can be found from the so-called secant equation. This equation specifies that two nearby points in the gradient change is equal to the difference of the coordinates of points, multiplied by the hashing. If we substitute our values of the inverse hashing in this formula, and the old hashing plus additive, then we can find this additive from the second equation. Excellent, we have learned how to build the approximate value of the hashing, and how to solve our problem. However, if there are many features, still many operations are required, it's not a cube of n anymore, but a square of n. That is, 10 features need to store and import 100 values, whereas 100 attributes is just 10,000 values, if there were 1 million 1,000 features. In general, it also lot, it also requires a lot of RAM, imagine storing a matrix for a million values. And if you have 10,000 features, then you store a million values, it's just awful. To solve this problem, the algorithm L-BFGS was created. The solution to this problem, if the L-BFGS algorithm, which stands for Limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm. In principle, there is no need to remember this decoding, everyone calls this L-BFGS, how does this work? As you remember, in the BFGS algorithm, we presented our inverse hashing value as a sum of the previous inverse hashing value plus an addition. This addition can be presented as a sum, as a product of two matrices, where each of them has two columns or two rows. The fact that the value of the hashing in the previous step can also be decomposed the same way as the value that was before, and thus, we can come to the initial value of the hashing. By the way, a unit matrix multiplied by some parameter alpha is usually taken as an initial gradient. And there, one trick is done, and all intermediate hashing corrections are discarded, only last ten values remain. And in the end, we get the last ten values plus the initial approximation. Of course, this is some kind of approximation of the hashing. But nevertheless, experience shows that this approximation works really well and quickly converts. If we put this inverse hashing in our formula, and then open the brackets, then it turns out that each action can be done in linear time. For example, to multiply a gradient by a matrix u that contains two rows, and a column of lengths n, we need to do 2n operations. In the end, we get a vector of length 2, to multiply the vector of length 2 by the matrix u, we need to make 4n operations. That is, everything became linear, and now at the end, let me tell you a secret, how linear aggression is trained. Inside it, there are simply is a condition that asks whether we have many features or not. If the number of features does not exceed the specified threshold, and the specified threshold is 4,000 features, then we use an analytical solution. It looks for the outputs of a system of linear iterations, if the number of features is more than 4,000, then L-BFGS algorithm is used. So in this lesson, you have learned what optimization method exists, and I explained to you the linear method, the second order optimization method. You have learned gradient descent, stochastic gradient descent, Newton method, BFGS, L-BFGS, and in the end we have learned how linear regression works. Well, we'll close with the topic of optimization, figure out how it works, and it's time to move on.