Remember, we don't want to just blindly get rid of outlying
points, because those actually might be the most interesting cases.
Perhaps these stars that are much colder than the
other ones are indeed more interesting to look at.
But what we want to do is we don't want to lump them along with the
stars that have a higher temperature and try to model all of them together.
[BLANK_AUDIO]
One last remark on influential points.
Let's take a look at this statement and evaluate whether it's true or false.
Influential points always reduce R squared.
It is true that influential points tend to make life more difficult.
But is it true that they always reduce R squared?
Let's take a look at these two graphs, one where
which we have an influential point and one where we don't.
The first plot does not have an influential point.
And we can see that the regression line looks fairly horizontal,
indicating that there's little to no relationship between x and y.
In the second plot, we have an influential point that is far away from the trajectory
of the original regression line, and hence pulls the regression line to itself.
In the first plot, the correlation coefficient is very low, just
0.08, and hence R squared is pretty low as well, at 0.0064.
In the second plot, however, all of a sudden, we're seeing an increase in
our correlation coefficient as well as an
increase associated with that in our R squared.
So, even though we would never want to fit a linear model in the second plot,
we are actually seeing a much higher correlation and a much higher R squared.
This is a good lesson for always viewing a scatter plot before fitting a model.
If we were simply deciding on whether or not the model
is a good fit by looking at the correlation coefficient and R
squared, we would never catch the anomaly in the data, and
that there is only one influential point that's driving the entire relationship.