An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

115 ratings

From the lesson

Module 3

This week we will cover modeling non-continuous outcomes (like binary or count data), hypothesis testing, and multiple hypothesis testing.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

Once you fit a statistical model with a regression model,

Â say a linear regression model that's just a regression model,

Â the next thing that you might want to do is perform some sort of inference.

Â So, remember that that model is a fit to the data that you collected.

Â But again, according to the central dogma of statistics, we actually want to take

Â the sample that we collected, fit some regression model, and then make some

Â statement about the relationship between the variables and the population.

Â So again if we have some variable that we care about in the population, say it's

Â the number of pink symbols, the fraction of pink symbols in the population, we

Â actually get an estimate of that fraction based on the sample that we've taken.

Â Similarly we would get an estimate of the regression coefficient

Â between the variables in the [INAUDIBLE] population.

Â So what we want to do next is we want to quantify how that relationship is and

Â how much uncertainty we have in our estimate from the sample.

Â So, for example you could give a totally different estimate if you took a different

Â sample and you want to know how much variability do we actually expect?

Â And so this matters a lot by different observations.

Â So again look at this example where you have three different genes and

Â you can imagine if you had two groups and

Â that in gene one you can see that there is a difference in expression between the two

Â groups, but there's also a relatively good amount of variability.

Â For gene two there's a difference in expression and

Â there's very little variability so there's definitely a very clear difference

Â that we would likely expect to see again if we repeated the experiment.

Â And for Gene 3 there's a small difference but there's also a tiny variability.

Â So this might actually be a replicable but small difference between the genes.

Â So the variability is a quantity that matters a lot.

Â So if we go back to the galton$child height data,

Â imagine that we're trying to estimate the mean height in the population.

Â So what we could do is we could estimate the mean height in the sample by

Â taking the average.

Â Then we could estimate something about the variability.

Â So one way to measure variability is to take the standard deviation,

Â or the variance, and in this case we can estimate the variance by

Â taking the sample mean and then calculating the distance to the sample

Â mean from every other data point and squaring it.

Â So again, this is that distance calculation just like we did with

Â clustering, but

Â now we've applied it to how far away are you from our estimate for each sample.

Â So we kind of average those, and we use M -1 here,

Â because we're actually trying to get an unbiased estimate of the sample variance,

Â although that doesn't necessarily matter once you have a large enough sample,

Â that -1 isn't a very important quantity.

Â So then we have an estimate of our parameter, and we have an estimate of

Â the variability, and then we can divine something called a confidence interval.

Â So, the confidence interval is basically our estimate minus some fraction of

Â the variability to the lower side, and then the same thing,

Â our estimate, plus some fraction of the variability to the upper side.

Â So this tells you a little bit about how much variability we have in the sample.

Â That's the sx.

Â That's the square root of this variance estimate up here.

Â And then the square root of n tells us that as the number of

Â samples grows bigger and bigger, we have less and

Â less variability in our estimate of the sample mean.

Â And then we also have some constant here that says how wide do we

Â want this confidence interval to be?

Â How much do we want it to trust?

Â And so a confidence interval is defined by the probability that

Â the real parameter is covered by the confidence interval.

Â We can set that to be some function of this constant here.

Â So if we set that constant, if the data are normal and we set the constant to

Â be 1.96, then the probability that the true parameter is covered by

Â the confidence interval if we recalculated the confidence interval over and

Â over again for new data sets would be 95%.

Â So in general you're looking for confidence intervals give you some idea of

Â the expected range of values that you're somewhat confident will cover

Â the true range of value if you repeated the study over and over and over again.

Â So inference, again, is a whole different class.

Â I here linked to a really good one.

Â But you could also keep in mind that the basic thing that you need

Â to do is you need to estimate the quantity that you care about in

Â this case the regression coefficient, and then do something like

Â quantify the uncertainty in that regression coefficient.

Â And then use that uncertainty quantification to say something about how,

Â what are the likely values for the real parameter in the population.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.