We're going to continue a discussion that we began at the end of the previous class talking about Regression Model Diagnostics and measures of influence and leverage. When we talk about Regression Model Diagnostics, here are some of the standard quantities that are very often used in those kinds of analyses. Scaled residuals such as the standardized residual, this is a ordinary least squares residual that is divided by the square root of mean square for error sigma hat. And then the studentized residual which is the ordinary residual divided by the actual standard deviation of that residual which is the square root of sigma hat squared times one minus hii. And then finally, there's this PRESS statistic that we've seen before and the PRESS statistic is basically the sum of the squares of the prediction errors. Those are the errors in predicting the height observation from a model that does not include that observation. And those prediction errors, those e square of that can be found from your ordinary least squares residuals. Isa by by simply dividing Isa by by one minus hii, the corresponding hat diagonal. And then R-square for prediction based on PRESS is simply one minus the PRESS statistic over SS total and that it looks a lot like the ordinary R-square. If you had the error sum of squares up here instead of PRESS, that would be ordinary R-square. I kind of like this R-square for prediction statistic, I think it gives you an indication of the predictive capability of your regression model. How is it going to do in predicting new data? For that viscosity example, we can compute the PRESS residuals using the ordinary residuals and the hii values found in table 10.3 and the corresponding value of the PRESS statistic is 5207.7. So, we can substitute that into equation 10.51 and the R-square for prediction turns out to be about 0.89. So, we would expect this model to explain about 89% of the variability in predicting new data as compared to the approximately 93% of the variability in the original data explained by the least squares fit. You're never going to predict the new data as well as you fit the sample, but the overall predictive capability here seems pretty reasonable. As long as PRESS and the ordinary R-square are pretty similar, one feels fairly confident about the ability of the model to predict new data. And of course, here's table 10.3, which we've seen before and the ordinary residuals and these hat diagonals that you see here. They would be used in computing PRESS. The R-Student statistic is also sometimes useful. Studentized residuals are often considered outlier diagnostics and it's customary to use the mean square error in computing R-sub i, that's how you estimate sigma square. Sometimes, that's called internal scaling of the residual because MSE is something that is internally generated from fitting your model to all of the data. An alternative to that would be to use an estimate of sigma squared. It's based on a data set with the ith observation removed and we can denote that estimate of sigma square by s square of i, it's easy to show that s square of i can be written in the form of equation 10.52. So, now that estimate of sigma square is used instead of mean square error to get us an externally studentized residual and that's equation 10.53 and, That statistic is usually called R-student. In many cases, R-student doesn't differ a whole lot from the studentized residual. But if your ith observation is really quite influential. S square of i can be quite different than mean square error and so R-student will be a lot more sensitive to that point. The other interesting thing is that if we make our standard assumptions about the error terms, R-student has a t distribution with n minus p minus one degrees of freedom. So, you could actually do a formal statistical test for outliers using R-student and table 10.3 shows the value of R-student for our viscosity data. And none of those values of R-student appear to be large enough to cause us concern. The largest one is considerably less than two in absolute value. So, the data isn't really much of an indication here of a problem with outliers. We talked a little bit about influence and the leverage in the last lecture. Here's a little more information about that, specifically something called leverage. The influence of points in your sample can be evaluated using Cook's distance measure, but leverage is something a little bit different. Leverage looks at the actual disposition of the points in the X space where they are. Remote points have disproportionate leverage on your model parameter estimates, and of course on the predicted values and all of your model summary statistics. And it turns out that that hat Matrix that we've talked about before, H equal to x times x prime x inverse times x prime, very useful in potentially identifying influential observations. H controls the variances and covariances of Y hat, the vector predicted values and E, the vector of residuals because the variance of Y hat is sigma square H, and the variance of the residuals is sigma times I minus H. So the elements of H, I, J can be interpreted as the amount of leverage exerted on YJ or exerted by YJ on the predicted value y hat sub I. So, inspection of those elements could reveal points that are potentially influential just because of their location in the X space. Usually, we pay attention to the diagonal elements hii. The sum of the diagonal elements turns out to be the same as the rank of the H Matrix which is the rank of X, which is p, the number of model parameters. So, the average size of a hat diagonal element would be p over n and a rule of thumb that's proven to be very useful is that if any diagonal element is greater than twice the average, two times p over n, that it is a high leverage point. So let's go back and apply that to the viscosity data in our example, two times p over n would be 0.375 because p is 3, we have 3 parameters in the model and we have 16 observations. So, 0.375 would be the cutoff for a hat diagonal, anything larger than that would indicate that that point is probably a leverage point. And so here are our hat diagonals in table 10.3 and as I scan those hat diagonals, I don't see any of them that are large enough to cause concern so far as leverage is concerned. The largest hat diagonal, it looks like is is 0.319 and that's observation number 7 and we'd have to be bigger than 0.375 in order for us to conclude that that is a leverage point. Something else that we sometimes may want to do in linear regression is to test for lack of fit. We do that in designed experiments and this is a more general way to describe that. Obviously, what lack of fit is trying to determine is does the model that you have used adequately fit the data or should we consider using higher order terms? And the statistical test that we use to do that assumes that the error term can be decomposed into two pieces. A sum of squares for pure error and a sum of squares for lack of fit. Now, how do we get the sum of squares for pure error? Well, to do that, we have to assume that we have some observations that are replicate runs, that is at the ith level of at least one of your predictors, we have N sub i observations. So let's let YIJ be the jth observation on response at this point x sub i, i equal one two to m. Let's say there are m points where we have these replicated values and then the ijth residual would simply be yij minus the predicted value at that point y hat sub i. Y bar sub i is the average of the ni observations at x i. Square both sides of that and sum over i and j and we get equation 10.57. And the left hand side of that equation is the usual residual sum of squares and on the right hand side, the two terms represent pure error. That's the first one and then lack of fit. And the pure error sum of squares is simply obtained by computing the corrected sum of squares over the repeat observations at each level of x and then pooling them across the different levels of x. If the assumption is of constant variances is appropriate, then this is a model independent estimate of pure error. And then that's what we would call SSpe in our previous notation, they're in sub i degrees of freedom for pure error at each level of x. And so the total total number of degrees of freedom associated with pure error turns out to be n minus m, n being the total number of observations, mb the number of levels where we have these repeat observations. So now your sum of squares for lack of fit, this just turns out to be a weighted sum of squares between the mean response y bar i and each x level and the corresponding fitted value. If the fitted values are close to the corresponding average response, then there's a good indication that the model fit is appropriate. If the fitted values deviate greatly from the average, then we probably have a difficulty with quality of fit. In minus p degrees of freedom associated with lack of fit, their m levels of x and p degrees of freedom are lost because p parameters are estimated from fitting the model. So computationally, we usually get SS lack of fit by subtracting pure error from SSe, so the f statistic would simply be the mean square for lack of fit. That is SS lack of fit over in minus p divided by SS pure error over n minus m or we could write that as mean square like of fit over mean square for pure error. That would be the appropriate statistic for testing lack of fit.