Hi, this is Professor De Liu from Carlson School. This is a Week 2 of this module. Our subject today is multiple linear regression. We're going to talk about what is multiple linear regression. How can you use multiple linear regression for predictive modeling? How can you implement multiple linear regression in Excel? Let's begin. In this module, we have multiple objectives. First, we want you to understand what is multiple linear regression, and then we are going to learn how to fit multiple linear regression models in Excel using various Excel tools. Then we want to use this model for prediction. We are going to learn about using trend function as well as using regression tool for predictions. Last but not least, we want you to understand the purpose of model selection. We are going to learn some simple techniques for selecting the number of variables to include in your multiple linear regression. This module has three lessons. In Lesson 1, we are going to introduce what is the multiple linear regression. As we have learned in the previous module, we learned that simple linear regression involves a single independent variable x, and then we have a model which is f(x), and its model is linear regression. Multiple linear regression involves two or more independent variables. Instead of a single x, you might have three x's or maybe 30 x's. The model is still going to be linear. Why multiple linear regression? Oftentimes, our outcome is influenced or predicted by multiple factors. For example, in the example that I mentioned earlier about pricing of a house, at first I was thinking about the total square footage is going to be a main factor that influences house price. Then I realize it's not that simple. House price is also affected by a number of bedrooms, most people like more bedrooms. Also, the size of garage, does it hold one car, two cars? I'm not even mentioning the location because the location could also play an important role in the price of a house. In another example, the mostly hidden cost of house is determined by multiple factors. First of all, it's a function of outside temperature. The hotter the temperature, and the lower the heating bill, and of course it's a size of house, bigger house requires more heating cost. For the same aged house, the better attic insulation it provides, the smaller the heating bill. Then age of the furnace, so if you have very old heating unit furnace and then it's less efficient, the heating cost will be higher. These are examples of outcome variable that affected by multiple factors. In addition to having multiple factors, we'll also need to make sure that each independent variable varies in a linear manner with outcome variable. This refers to the linear part of the multiple linear regression. For example, in a housing model earlier, we expect the price to vary linearly with the total square footage, and also to vary linearly with number of bedrooms, and so on. You might wonder, this seems to me a restrictive model. How do I know that it has a linear relationship? What if these variables have a nonlinear relationship with the model? Indeed, there are many nonlinear relationships in a real world, but linear models are still very popular for a number reasons. First, this is a simpler model, and this is the first thing you think of when you try to model something. Try the linear relationship first, if it doesn't work, then try something more sophisticated. Second, linear models are much more efficient to estimate. In the real world, linear models are still good approximations in many different situations but you do need to remember that multiple linear regression assumes a linear relationship. In practice, we have a way of checking whether relationship is approximately linear. In order to do that, we often need to look at the plots. For example, one of the variable in a housing model is square footage, so we can plot the two variables. Suppose your plot looked like this. You can look at the scattered plot, and see if the relationship looks linear. To me, this looks like linear because if I draw a straight line from the dots, it seems like the dots are distributed around the regression line, so this would seem a linear model is appropriate. It will also be useful to know what kind of plot that tells me that they are nonlinear relationships. I'll draw a few examples and let's look at them. In the first example, this doesn't look like linear, this seems like a nonlinear increase. If I draw a linear model across it, it seems like there is systematic variations deviating from the linear, so this looks like a nonlinear curve. In the second model, this is also a nonlinear. This is sometime called U-shaped, an inversed U-shaped. In this case, it's an inversed U. What about the third one? Does this look a like linear relationship? If you were to draw a regression line, it is going to almost flat, and the dots is quite disperse around it. This represent the case where there's no relationship. If you look at x and y, the x doesn't have much predictive value for y, and that relationship doesn't exist because y seems distributed random around its mean regardless of x values. What about the last one? In the last case, it's like the Case 3 but we can see a slight increasing trend. If you draw a regression line it should look like this. Yes, there is a little bit regression slope, but it does distribute quite far away from the regression line, a lot of them are far away. I would say that in this case, they represent a weak linear relationship because you see some slope, but the dots are quite far away from the line. I hope these examples help you decide whether a variable has a linear relationship with your outcome variable. Let's look at two other variable of the housing model. The one is size of garage. Because size of garage is discrete , zero, one, and two, are the only possible values. The scattered plot is not as nice as the previous ones, but you can tell this still represent a linear relationship. The ideal case for discrete x-variable is that, after three discrete dots, every y value should fall exactly on the regression line. In this case, they're a little far away from it, but overall, they're still on both sides of the regression line, so this does look like a linear relationship. Similarly, number of bedrooms. In this case, you might be able to make a case where this is slowly curving up in the sense that you might consider drawing a regression line look like this curve upward but it's not too far away from linear, so we can still use linear as approximate relationship between a number of bedrooms and the price.