For model two, we can do the same thing.

So we can take the observed data and

we can subtract off the model fit where we only included the batch variable.

So now we have two measurements of how close are the model fits to the real data.

You can actually look at that in the data example that I've shown.

So here I'm actually showing you the residuals for

the model fit where you just fit an overall mean, and

the residuals where you fit, case where you fit a mean for each different level.

So, here you can see that the residual sum of squares or

the distance from the actual values to the model fit value squared,

when you sum it up over all of the different values is equal to 22.05.

And here when you actually fit the model that includes a term,

an average in each of the different cancer statuses,

you actually get a lower value of this residual sum of squares.

Now, you should always expect that.

Whenever you include more variables in the model,

the residual sum of squares will go down.

It just doesn't go down enough to make us suggest that the model fit

is a better fit.

So this is the statistic that people use to quantify that.

So this is the commonly used F statistic.

So here are these RSS terms are the residual sum of squares where it's either

under the null model if there's a 0 or an alternative model if there's a 1.

So here we're taking the difference between the sum of square fits for

the null model and the alternative model.

And what does that mean?

So we know that the residual sum of squares under the null model,

the model where we don't include the phenotype variable,

will always be bigger because we've included fewer variables in that model.

So this term on the top is always positive and it quantifies basically how much

the residual sum of squares shrinks by including that variable in the model.

Then we've standardized that to the units of the alternative model fit.

So basically we're saying the alternative model fit

includes all the variables we might fit.

How many units of residual sum of squares

did we change by going from this null model fit to the alternative model fit?

And then there's this term over here in the front and

this basically quantifies the difference in the number of variables.

And this standardizes it so that when we change the number of variables we include

in the null and alternative models it standardizes this difference here.

So again here we have n the total number of samples that we have.

p1 is the total number of parameters in the model, the alternative model.

And p0 is the total number of parameters in the null model.

And so you can see here, we're dividing this difference here,

RSS0 minus RSS1 by the difference in the number of parameters in those two models.

And similarly, we're doing something down here.

We're standardizing this RSS1 by the total number of observations

minus the total number of parameters in that model.

So this standardization allows us to come up with a standard form for

the distribution of this statistic, just like standardizing to standard deviation

units for the t statistic, gave us that similar form.

So we can use this statistic to quantify, does the model fit for

the null model, fit substantially worse than the null model fit for

the alternative, or the model fit for the alternative model?