An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

111 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 4

In this week we will cover a lot of the general pipelines people use to analyze specific data types like RNA-seq, GWAS, ChIP-Seq, and DNA Methylation studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Once you fit a statistical model and you've identified those genes or

Â those features that are statistically significantly associated

Â with the phenotype you care about after correcting for multiple testing,

Â you might want to identify if there's some biological pattern to those genes or

Â to those features that you've identified that are differentially expressed.

Â So again I'm going to go back to this example where we're trying to predict

Â the response to Lenalidomide from Myelodysplastic Syndrome.

Â So again we find these genes that are 47 Genes that

Â are differentially expressed at a false discovery rate of 10%.

Â And so you can see for example that they're appears to be some

Â genes that have something in common here near the top of this list of

Â differentially expressed genes but is there a way that we can quantify that?

Â So one way that you can do that is you can take the statistic for

Â every gene that you calculated, and you can order them from largest to smallest.

Â Alternatively you can take the smallest p value to the largest p value.

Â And so over here are the most statistically significant associations and

Â over here are the least statistically significant associations.

Â Then you can take some gene set that you care about and

Â label all the genes that are in that gene set.

Â In this case, I've made them red.

Â So what you can do is then you can calculate a running statistic

Â that goes up every time you have a gene in the gene set and

Â goes down every time you have a gene out of the gene set.

Â And so what you can see is, if all of the genes that are in the gene set cluster

Â near the most statistically significant values, then you'll see much more

Â values that go up than values that go down, and you'll get a high peak here.

Â And so the statistic here is actually a max deviation from zero.

Â That's the gene set enrichment statistic.

Â This is related to something called the Kolmogorov-Smirnov statistic if

Â you know a little bit more about advanced statistics.

Â And so the idea here is that we want to identify, is this enrichment

Â statistically significant if it's more than we would expect to see by chance?

Â So one way that people do that is they again permute the sample labels.

Â We've permuted the responders and the non-responders.

Â And now we get the new set of labels.

Â And so, once we get the new set of labels, We can recalculate the statistics and

Â reorder them.

Â And so now that we see the genes that belong to the gene set are a little bit

Â more scattered throughout this profile and so

Â you see that the profile goes down and then up and then down and then up.

Â It wiggles a little bit more but it doesn't deviate from zero as far and so

Â there appears to be less of an enrichment of those values.

Â So you can recalculate for several permutations the value of this gene set

Â statistic, and then you can calculate again a P-value for each gene set category

Â as to whether the permuted values are more extreme than the observed value.

Â And so you can calculate a P-value for Each of the gene sets and

Â then again do a false discovery correction and identify gene sets that are associated

Â with those statistically significant results.

Â So what are the gene sets you can look at?

Â The Gene Ontology Consortium has a large ontology of gene sets that are based on

Â their function and based on their spatial location within the cell and so forth.

Â You can also look at molecular signatures that have been curated.

Â For example this set of molecular signatures that you can get from this

Â MSigDB database.

Â Or you can look at things like interactions between proteins and

Â then see is there an enrichment for a particular set of interactions among

Â the genes that you found to be differentially expressed.

Â Really its any previously defined set of genes that has some

Â function that you care about you can use for a gene set enrichment analysis.

Â So one thing to keep in mind is this can be very hard to interpret especially if

Â the categories are broad or vague.

Â So for example, if you get a category that comes out as transcriptional regulation,

Â that's a very broad category, there's lots of different subcategories of that.

Â And so if that's enriched, it's not clear how much added value it's giving you.

Â It's better if you can find specific, concrete categories that are enriched.

Â Here, if you're not very careful you can tell stories, so

Â again you have to correct for the multiple testing problem and

Â you have to be very aware of your own implicit biases.

Â This incurs a second multiple testing problem like I said compared to

Â just the multiple testing problem involved in identifying differentially

Â expressed genes.

Â Now you're multiply testing multiple sets and so you have to account for

Â that as well.

Â This idea can actually be simplified.

Â The statistic I showed you here, this gene set enrichment

Â statistic can be simplified into basically a very simple T statistic

Â comparing the genes that are in the set to the genes that are out of the set and so

Â you can read about that here in this paper.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.