0:00
Once you fit a statistical model and you've identified those genes or
those features that are statistically significantly associated
with the phenotype you care about after correcting for multiple testing,
you might want to identify if there's some biological pattern to those genes or
to those features that you've identified that are differentially expressed.
So again I'm going to go back to this example where we're trying to predict
the response to Lenalidomide from Myelodysplastic Syndrome.
So again we find these genes that are 47 Genes that
are differentially expressed at a false discovery rate of 10%.
And so you can see for example that they're appears to be some
genes that have something in common here near the top of this list of
differentially expressed genes but is there a way that we can quantify that?
So one way that you can do that is you can take the statistic for
every gene that you calculated, and you can order them from largest to smallest.
Alternatively you can take the smallest p value to the largest p value.
And so over here are the most statistically significant associations and
over here are the least statistically significant associations.
Then you can take some gene set that you care about and
label all the genes that are in that gene set.
In this case, I've made them red.
So what you can do is then you can calculate a running statistic
that goes up every time you have a gene in the gene set and
goes down every time you have a gene out of the gene set.
And so what you can see is, if all of the genes that are in the gene set cluster
near the most statistically significant values, then you'll see much more
values that go up than values that go down, and you'll get a high peak here.
And so the statistic here is actually a max deviation from zero.
That's the gene set enrichment statistic.
This is related to something called the Kolmogorov-Smirnov statistic if
you know a little bit more about advanced statistics.
1:40
And so the idea here is that we want to identify, is this enrichment
statistically significant if it's more than we would expect to see by chance?
So one way that people do that is they again permute the sample labels.
We've permuted the responders and the non-responders.
And now we get the new set of labels.
And so, once we get the new set of labels, We can recalculate the statistics and
reorder them.
And so now that we see the genes that belong to the gene set are a little bit
more scattered throughout this profile and so
you see that the profile goes down and then up and then down and then up.
It wiggles a little bit more but it doesn't deviate from zero as far and so
there appears to be less of an enrichment of those values.
So you can recalculate for several permutations the value of this gene set
statistic, and then you can calculate again a P-value for each gene set category
as to whether the permuted values are more extreme than the observed value.
And so you can calculate a P-value for Each of the gene sets and
then again do a false discovery correction and identify gene sets that are associated
with those statistically significant results.
So what are the gene sets you can look at?
The Gene Ontology Consortium has a large ontology of gene sets that are based on
their function and based on their spatial location within the cell and so forth.
You can also look at molecular signatures that have been curated.
For example this set of molecular signatures that you can get from this
MSigDB database.
Or you can look at things like interactions between proteins and
then see is there an enrichment for a particular set of interactions among
the genes that you found to be differentially expressed.
Really its any previously defined set of genes that has some
function that you care about you can use for a gene set enrichment analysis.
So one thing to keep in mind is this can be very hard to interpret especially if
the categories are broad or vague.
So for example, if you get a category that comes out as transcriptional regulation,
that's a very broad category, there's lots of different subcategories of that.
And so if that's enriched, it's not clear how much added value it's giving you.
It's better if you can find specific, concrete categories that are enriched.
Here, if you're not very careful you can tell stories, so
again you have to correct for the multiple testing problem and
you have to be very aware of your own implicit biases.
This incurs a second multiple testing problem like I said compared to
just the multiple testing problem involved in identifying differentially
expressed genes.
Now you're multiply testing multiple sets and so you have to account for
that as well.
This idea can actually be simplified.
The statistic I showed you here, this gene set enrichment
statistic can be simplified into basically a very simple T statistic
comparing the genes that are in the set to the genes that are out of the set and so
you can read about that here in this paper.