An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

100 ratings

Johns Hopkins University

100 ratings

Course 7 of 8 in the Specialization Genomic Data Science

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Other type of outcome that's commonly observed in genomic data particularly data

Â from next generation sequencing is count outcome data.

Â So the very common scenario is you're dealing with a situation where you have

Â a number of reads that overlap a particular region or

Â variant and you want to make a regressional model for those counts.

Â So here's an example.

Â It's from gene expression datas.

Â So, suppose, for example, that you wanted to count how many reads covered each gene.

Â So, here are a number different that you could count that.

Â You could say just if the read falls entirely within that gene how

Â many counts does it get?

Â There's a number of choices that you could actually make for that, but for

Â now let's just assume that you made one of these choices and you've gotta count for

Â each gene, then you can make those counts by calculating

Â the total number of reads say that cover that gene.

Â Now you have a count for each sample and for

Â each gene of the total number of reads that cover that gene.

Â So what you want to do is you want to model that distribution and so

Â you might want to build a regression model.

Â Again, based on the relationship between the phenotype that you care about and

Â the counts for a particular gene.

Â The most commonly used distribution to model counts

Â in statistics is the Poisson distribution.

Â So one thing to keep in mind about the Poisson distribution is that the mean and

Â the variance are the same.

Â So for example if you look at a Poisson distribution with a low mean,

Â it also has a low variance.

Â When you increase the mean you also increase the variability, and

Â if you increase the mean even farther you increase the variability even more.

Â So this is a distribution that's very good for

Â count data because it's only positive, and it has other properties that for

Â idealized type distributions it models count data very well.

Â But it's very restrictive in modeling the variance.

Â So, again, you could fit a regression model, but here, now,

Â the regression model is going to be a little bit more complicated.

Â So this is an example of the generalized linear model.

Â Logistic regression is another example where we take

Â a function of the expected value of the thing that we care about.

Â So here, we have the counts that we care about, conditional on, say,

Â the group indicator.

Â And so what we're going to do is model the expected value of the counts,

Â given our indicator as some function of the data.

Â So here, usually what we do is use a link function when you're modeling counts, and

Â the link function is often the log function.

Â So you say the log of the counts is going to be modeled as a function of some

Â adjustment variable here, this is usually a normalization constant

Â which models the total sequencing depth, plus the parameter that we care about.

Â So again, this is another parameter that we're going to be looking at, that again,

Â models on the log scale of the count

Â the relationship with the outcome variable that we care about.

Â So this is another way of fitting a regression model

Â now to a set of count data.

Â And so if you fit a model like this, you get a slightly better fit than you do with

Â using sort of standard linear regression models to count data.

Â So here is a set of data where you have the average on the y, or

Â x axis and the variance on the y axis.

Â And so here you see the fit from the Poisson model in purple and

Â it turns out that you can even do a little bit better than that if you

Â model the relationship directly between the mean and the variance.

Â So remember that the Poisson variable required that the mean and

Â the variance be the same.

Â So you have a straight line in the relationship between mean and variance.

Â But sometimes that's not exactly true for counts, and so you actually model it as

Â a function of the relationship between the mean and variance.

Â So the two most popular techniques for modeling count data and

Â bioconductor are edgeR and DEseq and both of those uses a type of local or

Â smoothed regression to estimate the relationship between mean and variance.

Â You can then plug that into a more flexible model,

Â the negative binomial model.

Â So the negative binomial model allows you to model the mean and variances set using

Â separate parameters, or using a pair of parameters rather than just one parameter.

Â And so while the Poisson distribution that I've modeled here in black fixes

Â the variability for a specific mean value, you could have that same mean value but

Â also a large number of different variances using the negative binomial distribution.

Â So that's a little bit more flexible.

Â So now you can model the counts as a negative binomial distribution,

Â where the mean of that negative binomial distribution is equal to a sample-specific

Â size factor that maybe relates to the number of reads that you've got.

Â And then a parameter proportional to the number of fragments,

Â which you then model as a function of using that same sort of log link.

Â You model the relationship between the thing that you care about, how many counts

Â you get, or how many fragments you have for that particular gene.

Â For that particular sample is a linear function of the covariates that you

Â care about.

Â So again it's writing it down as a linear model but

Â where the scale is slightly different.

Â So this is an example of a generalized link

Â linear model which you can learn a lot more about.

Â For example in this set of lecture notes, they go into a lot of detail about

Â generalized linear models, in particular for Poisson regression.

Â This is again a huge topic and we've only scratched the surface but

Â I wanted to show you an example of how you can relate

Â count data to covariance that you care about using a regression model even if

Â it's not the standard linear regression model by using this link function.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.