[SOUND] Regression analysis is a statistical technique that uses observed data to relate the dependent variable to one or more independent variables. The objective of regression analysis is to build the regression model that can describe, predict, and control the dependent variable based on the independent variable. Going back to the scatter plot for level of education and per capita income for the 50 states of US, the black line that goes through the individual data points, the blue diamonds, is drawn based on regression equation. This line and its equation is calculated based on a method known as the least square method. Least square method will develop the linear equation, which relates y variable to the x variable. This equation is defined by where it intercepts the y axis, this is known as b0, and its slope, which is calculated by taking the rate of change in y by the rate of change in x and is the slope of the line and is represented by b1. Once we have the regression equation, then for a given x value, we can solve and predict the value for variable y. However, we're not going to be exactly right in our prediction. We will have some error. For example, for this point here, which is the fourth value for x, if you use the equation, the predicted value of y will be here. The difference between our predicted value and the actual observed value is what we call prediction error, or just error or the residual value. So, including this error, the prediction value will look like this. So now, think of fitting the line through all these points. For the data we have, we can actually calculate the error for each point and its estimated value. For points that are below the line, our prediction will be higher value than we observe. And for points that are above the line, our prediction will be a lower value than what we observe. As you saw, some residuals will be positive and some negative. So, adding them all up is not a good assessment of how well the line fits the data. If you consider the sum of the squares of the residuals, then the smaller the sum, the better the fit. The line of best fit is the line for which the sum of the square residuals is smallest. This is what is often called the least squares line. Using the least square method, then the estimated value of y, as predicted by the regression equation, is written as y hat = b0 + b1x. I will probably drop the hat in most of my slides, but technically, the hat over the y is a reminder that we are dealing with prediction. But that is all we will be doing here anyway, so just let's remember and make life a little less cluttered. b1, the slope is based on two sums of squares, and these sums of squares are calculated used the following formulas. Now I can guess what you're thinking, these formulas look less than user friendly. As before, we will be doing the analysis using Excel, so most everything is automated. But to understand the output, it is important to know how they are being derived. So be patient and stay with me, soon it will all come together. Once the slope value of b1 has been calculated, we can calculate b0, the y intercept, by taking the average value of y- the slope times the average value of xs. Imagine having 8 different stores at different locations. You're wondering if the annual sales at a store is dependent on how many people live within a 15 mile radius of the store. If we could establish that in fact this is the case, then in future, when we want to open up a new store, we can look at alternative sites and predict the annual sales for each possible location and then pick the one with the highest predicted sales. Let me take you through how we would solve this. First I will use the equations and later show you how Excel represents the same results to you. Again, my objective here is that by going through the math, Excel would make more sense. These are the equations I will be using. First, I take the data we have and calculate the numbers I need. So for example, 238.7 is the sum of all values in the column with the heading of x. Once I do this for all columns, I would have everything I need to find the slope, and these values are in the row highlighted in yellow. And each number is one of the notations used in the slope calculations. Using the summary values, first find ss of xy. ss stands for sum of squares, which we calculate here. Then, use the numbers to calculate ss of xx. Now we take the ratio of the two to find the slope. This says for every increase in value of x, the annual sale will go up by about $66.94. Slope is the rate of change. Now we can calculate the intercept b0. Again, substituting numbers for the notation, we can find the intercept to be -283.098. Now we can write the regression equation. This is the complete regression equation. Plotting our data, the scatter plot and the regression line will look like this. The intercept is a negative value. Just imagine extending the line until it will intercept with y. That would yield a negative value. And the slope of the line is 66.9459. So again, this means when you increase x by 1, and here, 1 will represent thousands since the numbers were scaled, the annual sale will increase by about 67, which is really $67,000. Now for location with 12,000 nearby, then what is the estimated average annual sales? We will put 12 in x, again, numbers were scaled by thousands, and the answer will be $520,247. Now let's look at what we get if we used Excel to solve this problem for us. Excel returns three parts in this analysis. Right now, I want you to focus on the last table, this part. Focusing on the last table, the highlighted values are showing the coefficients for the intercept, b0, and the coefficient for the random variable population. That's our b1. Now look at what we got when we did the calculations using the equation. Not surprisingly, it is the same. There is a lot more to do than just finding the equation and regression analysis. This is why the output has so much more information than the intercept and the slope of the regression equation. We will be learning about all these in the remaining lessons in this module.