0:02

Previously we talked about statistical significance.

But in general, in genomic studies,

you're often consider, considering more than one data set at a time.

In other words, you might be a, analyzing the expression of every one of the genes

in your body, or you might be looking at hundreds of thousands of millions of

variants in the DNA, or many other multiple testing site, type scenarios.

So in these scenarios what you're doing is you're calculating a measure of

association between say some phenotype that you care about say cancer

versus control and every single data set that you collected.

Say, a data set for each possible gene.

So in this case what's happened is people are still applying

the hypothesis testing frame work.

They're using P-values and things like that.

But the issue is that, that framework wasn't built for doing many,

many hypothesis tests at once.

So if you remember when we talked about what a P-value was,

it's the probability of observing a statistic as, or more extreme,

than the one we, you calculated in an original sample.

And so what is it, one property of P-values that's very important, and

that we should pay attention to is that if there's nothing happening,

suppose that there's absolutely no difference between the two groups that

you're comparing, the P-values are uni, what's called uniformly distributed.

So this is a plot of some uniformly distributed data histogram.

On the x-axis you see the P-value, and

on the y-axis is the frequency of the number of P-value that fall to that bin.

And so, this is what the uniform distribution looks like.

And so, what a uniform distribution means is that 5% of the P-values will be

less than 0.05.

20% of the P-value would be less than 0.02, and so forth.

In other words, when there is no signal, the P-value distribution is flat.

So what does that mean?

How does that sort of play a role in a multiple testing problem?

And so here's an example with a cartoon.

Imagine that you're trying to investigate whether jelly,

jelly beans are associated with acne.

So what you could do is, you could perform a study where you compare

people who eat a lot of jelly beans and

people who don't eat a lot of jelly beans and look to see if they have acne or not.

And so if you do that, you probably won't find anything.

And so, at the first test, people go ahead and collect the data on the whole sample,

they calculate the statistic, the P-value's greater than 0.05, they conclude

there's no statistically significant association between jelly beans and acne.

But in the, you might consider, oh, well it might be just a kind of jelly beans.

So you could go back and test brown jelly beans and yellow jelly beans and so

forth, and in each case, most of the time, the P-value would be greater than 0.05.

And so it would not be statistically significant, and you wouldn't report it.

But then, since P-values are uniformly distributed,

about one out of every 20 tests that you do,

even if there's absolutely no association between jelly beans and

acne, about one out of 20 will still show up with a P-value less than 0.05.

And so a danger is that you do these many, many, many tests and

then you find the one with P-value is less than 0.05 and you just report that one.

So here's an example where there's a news article saying that green jelly beans have

been linked to acne.

So that's again, whether it's either reporting this with a statistical

significance measure that was designed when performing one hypothesis test, but

in reality they per, performed many.

So how do we deal with this?

How do we adapt the hypothesis testing framework

to the situation where you're doing many hypotheses?

Tests.

So the way that we do that is with different error rates.

So the two most commonly error rates that you'll probably hear about when doing

a genomic data analysis are the family wise error rate and

the false discovery rate.

So the family wise error rate says that if we're going to do many,

many hypothesis tests we want to control,

control the probability that there will be even one false positive.

This is a very strict criteria.

If you find many things that are significant, and

a false family wise error rate that's very low,

you're saying that the probability of even one false positive is very small.

3:40

Another very commonly used error measure is the false discovery rate.

This is the expected number of false positives divided by the number of

total discoveries.

So what does this do?

It sort of quantifies, among the things that you're calling statistically

significant, what fraction of them appear to be false positives?

And so the false discovery rate often is a little bit more liberal

than the family wise error rate.

You're not controlling the probability of even one false positive.

You're allowing for some false positives, to make more discoveries.

But it quantifies the error rate at which you're making those discoveries.

And so to interpret these error rates you have to

be very careful because they actually have different interpretations.

You do different things to data, but you also have to interpret them differently.

So just because you find more statistically significant results when you

use the false discovery rate than when you use family wise error rate,

it doesn't mean that magically, all of the sudden,

there were more results that were truly different.

It just means that there's a different interpretation

to the analysis that you do.

So i'm going to give you a very sim, simple example.

Suppose you're doing an analysis with 10,000 genes.

A gene expression, differential expression analysis.

And you discover that 550 of those genes are significant at the 0.05 level.

5:19

Alternatively suppose that when we declare those 550 to be significant we were using

the false discovery rate.

In this case we're quantifying among the discoveries that we've made the rate of

errors that we would make then.

So about 5% times the 550 things we discovered equals about

27.5 false positives.

So in this case, we discovered the same number of things, but

using a different error rate, it means that we control the error level much

lower than if we just calculated P-values less than 0.05.

Finally, suppose we use the family wise error rate.

In this case, if we had found 550 genes differentially expressed out of 10,000,

at a Family Wise Error Rate control of 0.05, that means

the probability of even one of those 550 being a false positive is less than 0.05.

So that means that almost all of them would probably be true positives.

So in this case, we've sort of illustrated the three types of

ways that you could sort of calculate statistical significance.

In each case it means something totally different

with statistical significance set.

When you use those words it means something totally different depending on

what error rate that you’re controlling.

One last thing to consider when looking at multiple hypothesis tests is

the inevitable scenario.

So everybody who's done some real science has run into this scenario

where the P-value that they calculated is just greater than 0.05.

And the natural reaction is to be very sad and to think game over, oh,

I've got to try all over again because my P-value's greater than 0.05.

It's a really good idea not to do that.

First of all, it's important to report negative results even if you can't get

them into the best journals, to avoid what's called publication bias.

But more importantly, it's a careful, it's important to be careful to avoid hacking.

6:54

So a very typical email a statistician might get after reporting a P-value

greater than 0.05 is this one that my friend Ingo got.

So it said, curse you, Ingo!

Yet another disappearing act!

Because the P-value is greater than 0.05 after doing some correction.

And so, the, while this is a joke and it was totally said in jest, in general,

there can be pressure to try to discover more things at as,

more statistically significant level.

It's very important to avoid that temptation,

because you'll run into something called P-value hacking.

So in general, statistics hacking means doing things to the data.

Or, changing the way that you do the calculations

in order to manufacture a statistically significant result.

Even when your original analysis didn't do it.

So this is an example of a paper where people took a data set,

a very simple, simulate, simulated data set.

And made very sensible transformations to that data set

with the statistical methods they used.

And turn to almost any result into a statistically significant result.

A way to avoid this, is to in advance of looking at the data,

specify a data analysis plan and stick to it.