Hello, my name is Pavel, and during this week, you will learn how to use machine learning on big data. Sounds interesting to you? Let's go. This week, you will learn why machine learning should be used on big data. You will also know Spark MLlib, and learn how to use linear models on large scale to predict events, and learn some techniques for improving quality of prediction. In this video, I will tell you how to solve the problem of big data sampling in the right and the wrong way. You will get the answer to the question, why you need a large scale machine learning? And why you should study Spark MLlib? So, imagine that you have a huge data sets with billion of events, the first idea that comes to my mind, what if you take a small piece of the events from the data set, load it into a local machine and work with it and somehow, create models on it, make some predictions. All that you did last week with the help of scikit-learn. This idea is not so bad, to do this, You need to pull out a random split, random subset of the examples from your data set, and use them. However, there is one pitfall I want to tell you about. So, you have a giant log of user's activity in the Internet. You don't load 10% of it to your computer, and use to predict the gender of a person. Suppose that a visitor chosen by us, once read an article about fishing. This can tell us that this is a man. "Stereotypes". What is the problem here? If a person made 100 clicks on your site, then in the sample he will have about 10 of his clicks. In the 10 clicks selected randomly, the site about fishing will fall with probability of 10%, and with a 90% probability, we will miss this important event. Sample correctly, in this case, you need to sample by selecting not the events but the users who created these events. Why do we need a complete data set? Let's figure it out. If you train your model on a very small data set, your model can easily become overfitted. Consider this example, where we distinguish boys and girls. You train a complex model, it determines correctly the classes for all the boys and girls, but if you try to classify, another data set using this model, you will see that this quality is far from 100%. Such an event is called overfitting, and training on big data is one of the ways to solve this problem. The second reason, why you need large data, is a very complex model for solving your problem, in which there are a lot of parameters to be automatically configured, and a large data set will allow you to train it to correctly predict an event, distinguish between kittens and dogs, and avoid mistakes. The quality of the prediction as usual, increase from a number of examples. Therefore, the more data you have, the more curious you get. If you have a giant data set, then trying to train the model on the entire data set, and see what quality will be in this case, make sense. Another case, where large data can really help you, when you need to predict a rare event. For example, if someone throw a coin, and it stood upright, of course, this can be predicted, because a coin is a random number generator. But, this is a good example of rare event. The facts are such rare events are very important, and sometimes critical for business, for example, the fall of service, if you can write a program that predicts a drop half an hour before it happens, you will save the company a lot of money. 99% of all operations in the bank are performed by honest, law abiding citizens. However, amongst them, there a small percentage of frauds who try to steal money from someone. If you write a program that will catch such fraud, banks will get their profit. You need a lot of data here, because a number of fraudulent operations, compared to the number of legal ones, is vanishly small. We have figured out why we need large data. Now, why do we need a separate library of machine learning for large data? What it will give for us? First, it can build a predictive model using all data that it can get. Terabytes, ten terabytes, petabytes of data, only you need to deliver a sufficient number of machines. Second, it can be distributed to apply this model to the same terabytes and petabytes of data. If your algorithm can classify your data within 1,000 hours, then if you paralyze it on 1,000 processors, the whole classification will take one hour, compare the classifications that lasted 40 days and one hour, the choice is up here. So, complex machine learning models have a bunch of settings that you set by yourself. The so-called hyper-parameters, machine learning on large data allows you to run the selection of hyper-parameters, distributing them across the cluster. Some machines will train and check the quality of classification, with one hyper-parameter, another machine will train model with another hyper-parameter, thus, you can advance of the fact that you have a giant part of machines which you have access to. So, what library do I talking about? What kind of library do I mean? Well, of course I'm talking about Spark MLlib. In Spark 2.0, Spark MLlib works on top of the Spark SQL library. You already know it. You started it during the last course. What are the advantages of Spark SQL? The first advantage is simplicity. Is there anybody here not to know how to use SQL? It is integrated as a bunch of sources. It can read data from files, write data into files, read from Hive, write to Hive, use any external databases. The third advantage is its performance. The Spark SQL library runs 10 times faster than the MapReduce, because it doesn't store the intermediate data on the disk, and if you can cache the data, save it into the memory, then the rate of speeds will be treated as one to 100. Just think about it. What as advantages of Spark MLlib? Actually, Spark MLlib was inspired by the library scikit-learn, with which you have already got acquainted I hope. What are its advantages? First, unified way of processing data and apply in models. Any model, any transformation, they all are applied in the same way. The second advantage is an easy deployment for large data. You just need to run your program, it will pump out your data set, distribute it and process it, and the third, a very important quality, is a large number of sub-party libraries that are written by other people. Often professional scientists, who are interested in developing something for Spark. The fact is that Spark is now a hot topic, and everyone interested and trying to write something for it. That are the advantages. And so, what have you learned today? How to incorrectly and correctly circumvent the problem of large data? Why you need a large scale machine learning, and why you should study Spark MLlib? Stay with us. In the next lecture, I will tell you how to train the simplest models on the Spark MLlib. It will be interesting.