So what I would like to do is, we've learned a lot, but there are a few things we did not talk about. Should I have more data? So maybe the line is bad because I only 20 points. Should I collect more data? Should I add more explanatory variables or I'm getting only 29 percent but as a house buyer I want at least 80 percent. Should I add more things like school and various other things, or maybe the median value is not the best thing to try to predict using these variables. You may like to predict a house value in a particular street. The other problems with data are the outliers in the data, somebody's mansion out there which is messing up my whole problem or maybe a park and all the houses around it are so valuable. Sometimes it may be there is some data which is missing and our tree's missing data in many ways. These and many more are more sophisticated ways of trying to improve the model. In this course, we will not talk so much about it but I want you to be aware of it. We will talk about the second thing, more or less features as we go forward into Modules 3 and 4. What I would like you to do, if possible, is attempt this exercise. In the R library, there is Motor Trend US magazine or data, very old cars. It talks about 10 aspects of automobile design for 32 automobiles. It tries to relate the miles per gallon and what explains the MPG. So here's some of the variables. You can look at the dataset. If you if you load the dataset which is here, right here. You can look at some of these variables, and if you love cars there's good things to understand and think are these are things that matter. The weight of a car, things like that. At this point, you must lay a bet with yourself, are these good enough to explain MPG and that's your intuition speaking. Now, let the data speak through some univariate plots. I don't understand whether these univariate plots makes sense to you. Based on your knowledge of cars, of course today's cars are very different, but if you're going to look at maybe 20 years back, does it make sense that the distributions are the right kind of numbers I expect? After we do that, we run these regressions. So first of all, look at the weight as an explanation from MPG. You will find that you can explain almost 75 percent of miles per gallon, makes sense like in lighter cars. You can show that for every increase, that there's a decrease in mileage per gallon for one unit increase in weight, 5.39 miles per gallon. Same thing with horse power. The more horses you have, the lower is the mpg. So you can show there is a decrease if you run this regression, for every unit increase in horse power and about 60 percent of the data is explained by this. You say, "Hey, they don't add up, 75 percent and 60 percent is 135 percent." No. When you put it together, you will see obviously you need more horses when the car is heavier. So somewhere Hartford is explaining part of the effect of the weight also. Therefore, when you try to run a multiple regression, which I want you to do. Before I forget, remember these numbers I got using an 80-20 partition, 80-20-0 partition and random number of 42. You run a multiple regression. When you run a single-variable, we call it a simple regression. When you run multiple variables, this is a multiple regression. It looks like this. In this case, this is MPG, this is the weight of the car, this is the horse power of the car. When you fit it, you are able to explain 82 percent not 75 plus 60, 135 percent, which makes sense. Because we know that horse power is also a substitute for weight. So the equation you get is this, and the coefficients are different. So what you would like to explain is how good is this fit, go look at the output of it, look at the visuals of it, write a few words saying how good is the fit, how good is that model. Do you trust the parameters using the T value and the P value. Second, try to explain why are the joint estimates slightly different compared to the individual estimate. Remember, the coefficient for example for horse power is 0.072, whereas when you run the regression, you will find only 0.033. So why does it happen? Do a visual test of it predicted with what is observed and see what happens to the line. Maybe you want to get fancier and do other tests. So in summary in this module, what have you done? Let me be very frank. I really didn't want to talk just about regression in the whole course. Regression is a handy tool and today, it's one of those tools which also is almost fully developed so that you can press a button and get everything you want. Some of the other tools we're going to see are not as developed like this, but you're going to sense how to develop a model which tries to test your intuition, which uses data to explain something. Second, I hope I gave you a sense of why we do data visualization, and I think we also talked about scatter plots. Scatter plots are very good for finding interaction between two variables. How does one variable behave with respect to another? It talks about correlation, we learned of a concept called correlation. We see that without looking at the algebra just pressing a few buttons, we are able to generate models and that's why many of these tools have got canned into use into software. By the way, I see a future in which a machine can even interpret the model for you and say, "Hey, there's a problem here. You may have to add some variables. Maybe this is a curvilinear fit." Exactly what I told you. You can probably do something we can code into a machine. We also looked at the interpretation of the output and this is another thing which is machine interpretable. Is the model behaving the way we want, is there fit linear, are there outliers, what do I do with missing data, all that which probably a machine can be trained to do. I think the human being comes right in the beginning saying, "This is my intuition, this is the data we need." The last step, how do I improve it? Do I believe this model? Even there, a machine may be able to say how generalizable this model is. Do I need more data? Do I need more variables? Or you as a human being might be able to say, "Hey, I forgot this important variable." I know a machine can't tell you that. So once again, if you look at it, the model is one piece. It fits into a lot of other steps stating the problem, understanding the data, developing a model, understanding it, interpreting it, and going back to say, "Do you all believe it, can I improve it?" So I hope you are at this moment comfortable, that is, understanding what is exponentially modeling.