Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

Loading...

From the course by Johns Hopkins University

Mathematical Biostatistics Boot Camp 2

39 ratings

Learn fundamental concepts in data analysis and statistical inference, focusing on one and two independent samples.

From the lesson

Hypothesis Testing

In this module, you'll get an introduction to hypothesis testing, a core concept in statistics. We'll cover hypothesis testing for basic one and two group settings as well as power. After you've watched the videos and tried the homework, take a stab at the quiz.

- Brian Caffo, PhDProfessor, Biostatistics

Bloomberg School of Public Health

Okay, so as a, as a bit of an aside since we're talking about a

paired data I thought maybe I'd talk for a minute about regression to the mean.

Because it's sort of a historically famous topic, and it, and it

involves one of the sort of

eminent characters in the discipline of statistics.

That person's name is Francis Galton.

And Francis Dalton is the cousin of Charles Darwin who invented quite

a few topics and statistics and he was the first to

mention, to recognize this phenomenonon that when you have match data, high

initial observations tend to be associated with With lower second observations,

and low initial observations tend to

be associated with higher second observations.

So the example that he gave was, sons of very tall

fathers tended to be a little bit shorter.

Still tall, But tended to be

shorter, and paradoxically not paradoxically, but seemingly

paradoxically fathers of very tall sons tended to be a little bit shorter.

And as an example from what we were talking about

today, second exams for those who scored very high on the

first exam tended to be a little bit lower.

Whereas first exams for those who scored very high on

the second exams tended to be a little bit lower.

So let's talk a little bit about why this phenomenon occurs.

OK, so the reason this occurs is, imagine if the tests were completely random

Then with and and the students were id draws from

that distribution. so the probability, so the

highest observations on test one were just random observations.

So, the probability of a second observation being that high is quite low.

It's, it's more likely to be in the, in the, in the center of the distribution.

Conversely a very low test, something that had a very low probability

of occurring given that it's already low, the probability

of a second test being that low is, is small.

And so if if its perfect or if its if its exactly noise if the

pairs of observations are exactly noise then then you'll

get a lot of regression in the matrix Let's consider the other extreme.

Let's imagine if the only, the test was a perfect adjuticater of students abilities.

Then, and it was perfectly calibrated instrument.

And there was no noise.

Then the student should ideally get exactly the same score on both exams.

At which point there'd be no, variation around an identity line on

this plot here that says test 1 by test 2. Okay?

Now, those are the 2 extremes.

One is complete variation, and no, no trend.

And the other is 100 percent correlation.

basically all trend, where the test was a perfect instrument.

And of course every practical case

lies somewhere somewhere usually somewhere in between.

so you know here's as an example of the eight people who

got below 80 on the test but one did all but one

did better on test two and of the five people who got

above a 95 on test one three did worse on the second test.

so, it's not tremendous amount of regression to the

mean, but, but some, some certainly is there, and

here, I, I draw the identity line.

Okay, so, let's discuss this phenomena a little bit more.

and we're going to assume that the data's been normalized.

So what does that mean?

That means we want the data to have mean zero variance one.

So in this case the mean of the first test

was 87, the mean of the second test was about 90.

And the standard deviation of the first test, and the second test

were both about six, so for every first exam we will have taken the

exam subtracted off 87 and divided by 6, and then for every.

Tests second test we would've subtracted 90 and divided by 6.

So, you know, and of

course we would've done it with the exact numbers,

not, not just the rounded numbers like I'm suggesting.

And then in that case we'd have the, the empirical mean for test one would be zero.

And the empirical standard deviation for test one would be one.

And the empirical mean for test 2 would be 0,

and the empirical standard deviation for test 2 would be 1.

So, so, so that gets rid of any sort of shift

effects, or scale effects.

Now when we're talking about testing, about the paired

data, we were exactly looking at shifts in the mean.

So here we've gotten rid of the

mean information by recentering the data at zero.

so if there is no regression in the mean,

the data would just scatter about an identity line.

and well if there was exactly no

regression in the mean it would fall exact-,

it would fall perfectly on an identity. Main line.

but the more scatter there is about the

line, the more regression in the mean there is.

So the best fitting line goes to the it actually goes to the

average and since we normalized our data, it goes to the point zero zero.

And it has slope. I wrote it out here.

the correlation between test on and test two.

Times the ratio of the standard deviation of

test two to the standard deviation of test one.

Which we normalize the standard devia, test two and test one.

So the standard deviations are exactly one.

so the, the, the best fitting line goes to

the has slope correlation test one by test two.

okay.

Okay, so just, just re, re, rehashing

something from the previous line, no previous slide.

The best fitting regression line has slope correlation Test1, time, correlation of

Test1 and Test2 Which in general is going to be less than one.

In a case where its one there isn't much statistics left to do.

So this will shrink, this will be shrunk towards a horizontal

line telling us our expected normalized score for test two will

be this correlation times the normalized test one score.

So if the correlations is 0.

95.

Then your estimated test 2 score will be

point 0.95 times your estimated test 1 score.

And this line sort of appropriately adjusts

for regression and mean for test 2.

Conditioning on test 1, or equivantely test 1 conditioning on

test 2, if we knew your test 2 score, and

we wanted to guess what your test 1 score was.Uh, normalized Test2 scores if we

wanted to guess what your normalized Test1 score

is, we would multiply by the same correlation.

On our plot, the slope of the line will be Test2 we'll, we'll have this slope

correlation Test1, Test2 where I'm assuming test two

is the vertical axis and test one is the horizontal axis.

If we wanted to slope going in the other direction,

we, the, the, the slope would obviously be the, the inverse.

I don't have a obvious that is but it, but it, the, the slope would be the inverse.

So, just to rehash.

In either case, if you want to adjust for regression

to the mean, you're multiplying the test, the normalized

test that you have.

And the one you would want to predict, you would multiply it by that correlation.

But let me just show you a plot here.

This, this'll probably make it a little bit easier.

So, here, I have Normalized Test 1 score on the

vertical, on the horizontal axis, and Normalized Test 2 score

on the vertical axis, and the slope of this line

here is the correlation between the two, which is about 0.21.

And that's the regression line of test one on test two.

I show an identity line here in the middle.

And then if you wanted to do the, to do the same thing whe,

a, as if you had had no, test one on the On the vertical axis.

Then you would use this li-, this almost vertical-ish looking line here.

And that's just the inverse.

That has slope inverse of this correlation here.

Notice both these lines pass through the point, zero zero.

Okay?

And this phenomena. And so, so this,

this line.

Notice how it's, it's, it's, it's, it's, you know?

When, when now we're looking at test 2.

And we want to predict our test 2 score from our test 1.

Is very flat, right?

It's very a flat, suggesting that there wasn't

a lot of correlation between the two tests.

And that amount of noise means that there's fair

amount of regression to the mean and so that if

we want to predict out test two score for our

test one score, were shrinking it quite a bit and

then you can see this dissenting in the other direction

that it becomes very the, the, the adjustment for the regression

to mean becomes very vertical and that's because the correlation

was quite low because there was, quite a bit of noise.

Notice as if as, as the points if they

were to fall more and more on an identity line.

So they would kind of.

Collapse around this identity line and these two cross haired lines

would sort of go like this and they would converge to

the identity line themselves and that would mean that there was

very little regression to the Anyway it's a neat neat histor,

its a neat historical story that Francis Dalton figured this out and

his discovery of this was something that led to the discovery

of the of the, the far more advanced topic of regression.

but let's, you know, let's make some final, final comments about this.

So, an ideal examiner, I guess I'm not an ideal examiner from this test I gave.

Were there, there'd be little difference

between the identitiy line and In the in the fitted regression line and the

more unrelated the two exam scores are

the more pronounced the regression to the mean.

So and and and this is something I heard a talk about this once.

That there was a question of how how much of our

discussion of sports is really discussion of regression to the mean.

So the idea is really that there's a lot of noise in say teams' or

player's performances and the ones that do the best in any one

given year is a combination of them just being better plus random variation.

It was that random variation is very high then they'll, then, then there's

a good chance that the, the, the following year, the following season or whatever.

they'll have a, a, a far lower perofrmance

in whatever statistic or measure you're, you're talking about.

So for example

in US in baseball, a popular sport in the US you know someone has a, a particularly

high batting average early on in the season, there's

a good chance they'll regress to the mean in

the second, the latter half of the season And, you know, the, the reason for that is

because there is a part, a component of it

that is performance, that's inher, it's the player's skill.

And, and, and then there's a component that's that random variation.

And the larger the amount of random variation,

the more you'll see regression to the mean.

And that case, the, the discussion is a lot of discussion about

about sports, I think actually amounts to discussing regression to the mean.

That sometimes the players that have huge rebounds are, were just unlucky.

And the, the, the players

that initially, and then then got luckier in their la,

i, when they were, when they were having their rebound.

And the players that, that did extremely well, and then did a lot worse, they

didn't, their intrinsic ability may not have had changed to much.

But, but the, the simply they were lucky early on.

And then less lucky later on.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.