The next condition is nearly normal residuals with mean zero.
Remember that some residuals will be positive
and some are going to be negative.
On a residuals plot we look for a random scatter of residuals around zero.
This translates to a nearly normal
distribution of residuals centered at zero.
And we can check this using a histogram or a normal probability plot.
So, once again, using R, we can make a histogram of
our residuals that are stored in the object for the regression model.
And we can also make a normal probability plot
using the functions qqnorm for the plot, and qqline for
the, guidance line that we're going to use to
see if the points actually align on a straight line.
This is what our plots look like.
We are seeing a little bit of a skew in the residuals.
However, the skew doesn't look too bad.
And looking at the normal probability plot as well, except for
at the tail areas, we're not seeing huge deviations from the mean.
So I think we can say that this condition seems to be fairly satisfied.
The next condition is constant variability of residuals.
We want our residuals to be equally variable for
low and high values of the predicted response variable.
So we check the residuals plot of residuals versus
the predicted values, that's e versus r y hat.
And note that we're using residuals versus predicted, instead of residuals versus x,
because it allows for considering the entire
model with all explanatory variables at once.
We want our residuals to be randomly scattered
in a band with a constant width around zero.
So in other words, we're looking to see nothing like that resembles a fan shape.
It is also worthwhile to view the absolute value of residuals versus
the predicted values to identify any unusual observations easily.
As usual, we can easily create both of these parts in R.
Here for example, we have our residuals on our y axis, and
on the x axis we have what R calls the fitted values.
What this basically means is our predicted values, or in other words our y hats.
And we can also calculate the absolute values of these
residuals and plot that against the fitted values as well.
So here's what our plots look like.
The first plot is a residuals versus fitted plot.
We don't see a fan shape here.
It appears that the variability of the
residual stays constant as the value of the
fitted or the predicted values change, so,
the constant variability condition appears to be met.
The absolute value of residuals plot can be
thought of simply the first plot folded in half.
So if we were to see a fan shape in the first plot,
we would see a triangle in the absolute value of residuals versus fitted plot.
Doesn't exactly seem to be the case, so it seems like this condition is met as well.
Lastly, independent residuals, and note that
independent residuals basically means independent observations.
If we have any time series structure, or if
we're suspecting that there may be any time series structure
in our data set, we can check for independent residuals
using the residuals versus the order of data collection plot.
If, on the other hand, that is not a consideration, to check to see, if the
residuals are independent, we don't really have another
diagnostic approach, diagnostic graph that we can use.
Instead, we want to go back to first principles
and think about how the data are sampled.
We've talked numerous times in this course
about what independence of observations means and what
do we need in terms of the sampling of the data to obtain independent observations.
So let's quickly take a look to see if this
order of data collection plot looks wonky in any way.
For that, we simply plot our residuals, and
we don't even have to specify anything for our
x-axis, because R will basically plot them in
the order that they appear in our data set.
And the order of data collection plot where we have the residuals on the y-axis,
and the order of data collection on the x-axis, does not show any patterns.
If there was some non-independent structure we would see
these residuals increasing or decreasing but we don't see any
such pattern, so it appears that any sort of
time series structure is not a consideration for this dataset.