So in the last video, we described the sampling distribution of the least squares estimator. And we described also why the sampling distribution is important in regression analysis. And in this video will describe how to come up with, meaning how to derive the sampling distribution for the least squares estimator. Now recall that at a fixed set of predictor values, so fix all of the predictors, the Xs. If we resample randomly from the population, so random samples from all the population all of size n. We would obtain a new response value, and consequently, a new and different least squares estimate. So each individual sample from this population would have a different least squares estimate. Now considered across all of these different random samples of size n, the least squares estimator is a random variable. And the sampling distribution is just defined to be the probability distribution of that least squares estimator treated as a random variable across all random samples of size n. The distribution for the least squares estimator is useful for the purposes of statistical inference. And we can do statistical inference in the regression context. And for example, we may be interested in knowing whether a slope parameter is statistically different from zero in the way that we described it in the last lesson. Really, what that means is we want to know whether there is evidence in the data to suggest that there's a correlation signal in the noise, right? So is there actually a correlation between a particular predictor and the response, or is it really just some noise and the fact that the least squares estimate is non zero? Is that just sort of picking up on noise in the data and not a real signal? So hypothesis tests can help us understand that relationship and whether a slope parameter is really different from zero. And so having the sampling distribution of the least squares estimator, and in particular having the standard error will allow us to perform those hypothesis tests. Now, we might also be interested in quantifying uncertainties related to the regression parameter. And constructing a confidence interval for the regression parameter is one option for quantifying your uncertainty. And again, confidence intervals like hypothesis tests require knowledge of the sampling distribution and the standard error of the least squares estimator. So again, these are two reasons why we might want to have the sampling distribution. So let's make a note of our assumptions up through this point. Now, if you recall when we found the least squares point estimate, so that means when we found beta hat, we didn't need to assume the distribution of the error terms. Instead, what we needed to assume was that the error terms were zero and that the expectation of the response was equal to the linear model. So basically, assumption number two says that the structural form of the regression model is correct. We also assume that we had a constant variance and that the error terms were independent from one another. The third assumption that we made was that the covariance between error terms was zero. So we really had independent errors. And also we could label these as two different assumptions. I have them grouped here together as one, that each error term has the same variance. So when you're taking the covariance of the error term with itself, when i is equal to j. Then you just have the variance. And what this says is the variance is the same for every error term. And just as an aside, when we in the last module worked with the marketing data, we saw some evidence that the variance for each of the error terms was not the same. If you remember for high values of the Facebook predictor, it looked like you had a larger variants for sales than you did for low values of the Facebook predictor. And the last assumption that we made when we solved for the least squares estimator was that the matrix X transpose X inverse exists. And we've talked about cases when it will exist and when it won't exist. Now, in order to perform inference in the regression context, we have to make an additional assumption. And that additional assumption is that the shape of the distribution of the Ys is normal. So this assumption five is really only adding normality as an assumption to the response variable. And the mean and the variance are known, and they are given in two and three. Now, five is often a reasonable assumption, and it's one that we'll learn how to check later on in the course. So we'll learn how to analyze a model and decide whether there's any evidence that it violates this normality assumption. But we should also note that results, that depend on this assumption. And those results would be the coverage properties of confidence intervals for the beta vector or the error rates of hypothesis tests related to beta. Those results are relatively robust to deviations from this normality assumption. And what that really means is that if we have a rough normality or approximate normality, then will typically be okay, right? So if our distribution of the response has a little bit of skew to it, maybe the tails are a little bit heavy. We should be in pretty good shape in terms of still having reasonable coverage for our confidence intervals and reasonable error rates close to the state and error rates and coverage properties. So now that we've stated our assumptions, let's talk about the sampling distribution of the least squares estimator. So first we should note that the least squares estimator is equal to X transpose X inverse times X transpose times Y. And we should analyze this quantity in terms of what is constant and what is random. So remember, in our resampling discussion, the Xs were fixed. So we fixed all the predictors at particular values. And we resampled and we said, if we resample all those values, the assumption is that we would get different measurements for Y. And so what that really means is that This entire first term involves the design matrix, the predictors is a constant with respect to the random variable Y, right. There are no Ys here, just Xs, which we assume to be fixed. And so if you analyze beta hat, you'll notice that really, it's just a linear combination of under our fifth Assumption normally distributed random variables. So linear combination of normals is itself a normal random variable. So what that means is that in virtue of re sampling beta hat, so remember, I use, the underline to denote bold face when it's typed. So be it. A hat has a normal distribution. Now, if we have a normal distribution, we should specify the mean and the variance. And, in this case, we'll have a variance co variance matrix, and not just a single number for the variance. And that's because what we're actually doing here is writing down a multi variant normal distribution because we have a vector here. So we're saying the distribution of this vector, this vector of size, P plus one, his multi variant normal and the mean turns out to be just the true vector of parameters and another way of saying this is that the least squares estimator is unbiased as an estimator of beta. So that's a nice property. And then the variance co variance matrix will be sigma squared times x transpose x inverse. So note that again the least squares estimator is a vector of length, length P plus one mhm, mhm and the mean will also be a vector of length P plus one and then notice that this matrix here is a matrix of size P plus one by P plus one mhm. So this is a square matrix size P plus one by P, plus one times of constant. So that still gives you a matrix of P plus one by P plus one. And what this matrix represents is on the diagonals of this matrix. You'll have the variance of each component of the least squares estimator. So, for example, the 11 entry along the diagonal, the 11 entry will be the variance for the intercept term right, which is the first term, the beta, not in this least squares estimator. And then if you go to the to two term along the diagonal, you'll have the variance for beta one, which is the slope term associated with the first predictor and then so on. We could do that all the way down the diagonal. Now, the off diagonal terms in this matrix represent co variance terms. So if we go to an off diagonal term, say we went to the, we went down one, right? So we went to the to one term that to one term on the off diagonal would be the co variance between of the intercept term in the first slope term. And, of course, the variance co variance matrix will be symmetric because the 21 term, for example, and the 12 term should be the same because they're giving the co variance of the intercept with the first slope and the slope with the first intercept, those should be the same thing, right? Co variance. You can switch the arguments and the numbers should be the same. So we'll prove both of these. Actually, that the least squares estimator is unbiased. So if we took its expectation, we would get the beta vector, and we'll also show that if we take the variance of the least squares estimator will get this quantity here. But it might be interesting to look at, simple, any regression. So if we just had our be to not hat and beta one hat. What did the distributions of these look like if we were just looking at simple linear regression? Well, unsurprisingly, the distributions would be normal, and they would be centered at the true values. So we still have normality. We still have unbiased nous. And the question is, you know, what does this look like If we just had one predictor and we took the right component of that matrix? Well, it turns out that in the intercept case, the variance. So this would just be a single number. The variance would be sigma squared times one over end, mhm plus x bar squared over the sun, the excise minus X bar squared. So that would be the variance of the least squares estimator for the intercept term and simple in your aggression. And then the variance for the slope term would be sigma squared times one over. That's some of X. I minus X bar squared. Yeah. Mm. And if you think about the analysis of the variants here. You'll notice that there will be more variability from sample to sample in each of these quantities if there's more variability in the original response data. Right. So this component here says, Well, if you have a lot of variability in the data itself, there will be a lot of variability from sample to sample in your least squares estimator. Same thing here, and it also depends on what end is right. So the number of data points will impact this term here. But also this denominator and this X bar, and so variability in the least squares estimator depends on, the variability in the error terms. Another way of saying that is the variability in the response data and also what the original design points are and the sample size. All right, so we've used an argument that says linear combinations of normal random variables are normal. That gave us the shape of the distribution for beta hat, but we didn't prove that, in expectation Beta had is equal to beta. So we didn't prove that we have an unbiased estimator when we use the least squares estimator. And we also didn't show that the variance of beta hat is equal to sigma squared times x transpose X inverse. So let's do that now. So, first, the expected value of the least squares estimator. Well, that's the same as the expected value of X transpose X inverse times X transpose times. Why? So that's just by definition. And then we said, Well, this entire term is constant in our re sampling argument. And when you have a constant inside of the expectation, you can take that constant out. So inside of the expectation we're just left with the vector Y. Well, the expectation of y is X times beta, right? That's the expected value of y the structural part of that model. And so we should have X transpose times x inverse times x transpose and then times X times beta. While this worked out really nicely because we have X transpose inverse times X transpose X this entire quantity here is the identity matrix. So, here we have the identity. Now what size identity? Well, this matrix is a P plus 1 by P plus 1, and so is this 1. So this will be the identity matrix with dimensions P plus 1 by P plus 1. And that's multiplying this beta vector, which is of size P plus 1. And so you'll just get beta out of this. So this is equal to beta, and that tells you that we have an unbiased estimator of beta when we use beta half. All right, so deriving the variance-covariance matrix is a little bit more tricky, but not not too much more. So, the variance of beta hat as we said as a matrix. So it will be a P plus 1 by P plus 1 matrix. And again, let's just start with the definition or what we have derived to be the least squares estimator. So we should have the variance of X transpose X inverse times X transpose times Y. Now again, this whole first term is a constant okay, and we're taking the variance of a constant times a vector Y. Now, if you've never worked with the variance of vector before, this may be new to you. But basically, what we do is if we have the variance of a constant times the vector, we will take out the constant in the front but also the constant in the back transpose. And this is analogous to if you just have the variance of a constant times a random variable that's not a vector, but just a random variable you would take the constant out front squared. Well, with linear algebra, we can always square things. We have to take things out in two different ways, right? In the front but also so to the left, but also to the right transpose. So what that means is we'll have out front X transpose X inverse times X transpose times the variance of the Y vector, Times the transpose of X transpose X inverse times X transpose. Okay, so let's simplify a bit. We'll have the first piece will keep the same X transpose X inverse times X transpose. Now the variance of the Y vector will be sigma squared times the n by n identity. So that's the matrix that's of size n by n has one's along the diagonals 0 everywhere else. And the reason for that is because the variance of the Y vector is a variance-covariance matrix. So it should have sigma squared as the variance along all of the diagonals and 0 on the off diagonals. Since we assume that the Y's are independent measurements, okay, that takes care of this front term. Keep it the same, takes care of this middle term. What about this last term? Well, let's actually take the transpose. We have the transpose of the product of two different matrices, so we will take the transpose of each but swap the order. So we'll have X transpose out front times X transpose times X inverse because the transpose of this matrix will be that matrix itself. Now we can simplify this quite a bit if we notice something about this middle term. First this constant, we can shift around anywhere we like, because it's a constant, so we can actually move this out to the front. So well, put sigma squared out front. And then there will be a question of whether we can remove the n by n identity matrix. And as long as we don't mess up any operations, we should be able to do that. So if we move this turn out front will have a matrix here that is P plus 1 by n. And this matrix here that it would multiply if we got rid of the n by n identity would be an n by P plus 1. So these matrices multiply each other. It's a well formed multiplication. So that means that if we actually kept this in here, it wouldn't be making any difference. So we can drop the n by n identity and we'll be left with X transpose times X inverse times X transpose, dropping the n by n identity. We've got X and then times X transpose X inverse. Again, that looks great because we have X transpose times X times the inverse of X transpose times X. So this whole term here will be the identity matrix of size people as 1, which means it will drop out. And we're just left with sigma squared times X transpose times X inverse. So we've done it. We've shown that the least squares estimator is unbiased, and we've also derived its variance-covariance matrix. And that should actually finish off this entire proposition, which is that the sampling distribution of the least squares estimator is normal with mean, the true parameter vector, and variance-covariance matrix stated here.