One of the most fundamental concepts in statistics for genomics,
or really statistics for anything, is experimental design.
So here, we're going to talk about three of the key ideas, variability,
replication, and power.
So, the first thing to keep in the mind is the central dogma of statistics.
So, the central of dogma of statistics states that we have some population that
we'd like to measure things about.
So, this is the population that we have over here.
And we have some, in this case,
shapes, and we want to measure the fraction that are a particular color.
The way that we do that is that we use probability and
sampling to get a smaller subset of these objects, and measure something on them.
So, in genomics this might be the population of all people that have
a particular cancer.
We might sample a subset of them, ideally randomly, but
almost never is that the case.
And then measure something about them, say, the gene expression profiles or
measurements about their genetic variance, or something else.
We then want to use that information to make inference about the population.
So we want to take statistics and basically,
summarize these data that we've collected on this small sample, and
see if we can say what's the proportion in the big sample up here.
So, what happens is there's a couple of different kinds of variability that
get introduced.
But the main one that we talk about a lot in statistics is sampling variability.
We use probability here to select which samples that we wanted to look at.
Now, it could be that set of samples, or in a different case,
you might get a whole different set of samples.
And so you might get variable estimates of what we think is going to happen in
the population.
So, it turns out there's three major sources of variability in
most genomic measurements.
So, the first is phenotype variability.
So, almost always in a genomic experiment,
at least if you're talking about a genomic experiment in humans.
But even in other organisms as well, you want to measure variability, say,
between cancers and normals, or between two different levels of output of a crop.
So, there's variability in the measurements due to that phenotype.
There's also variabilities due to measurement error, and this is a big one.
So, the measurement error can either be sort of random measurement error,
that sort of happens to every measurement as we go along.
It can also be measurement error that's correlated or biased.
There are all sorts of reasons why that you might measure things
differently between different samples that aren't necessarily down to biology, or
the phenotype that you care about.
And so that often comes in the form of batch effects or
other things like that which we'll talk about later.
Finally, there's natural biological variation.
This is one of the reasons why a genomics is so exciting, but also so hard.
Any two individuals will have variability in their genomic measurements
just because they're different people.
Not necessarily due to any specific biological difference between them, or
due to any measurement error.
In fact, you can take the exact same person and
measure their genomics repeatedly over time of the same kind of tissue
in the same exact way, and you'll get variation from time to time.
Because there's natural variability in say, the amount of, that each
gene is expressed, or what variants are present in which cells in the body.
So that natural variation is something that you have to account for
when modeling things with statistics.
So, the other thing that you need to keep in mind when you're doing
experiments is how are you going to measure those types of variability.
So, there's two big type of replicates that people often do in experiments.
So, replicates are when you do an experiment, and you want to do it more
than one time because there's so much variability in the sample.
You want to be able to measure that variability, and make an estimate of how
uncertain you are about certain statements that you're making.
So, the first kind of replication is technical replication.
So, this is where you have some sample, say, it's a sample that you've already
collected from somebody, and you're going to run genomic measurements on it.
So, what you do is you do some sort of processing of the samples, and
you do that processing two separate times.
You do the exact same processing, but you just repeat it twice.
Then you get what's called technical replicates.
So, these are replicates that just replicate the technical part of
the process.
Now, there are different kinds of technical replicates,
because there's obviously multi-step processes that generate these replicates.
So, you can have different kinds of technical replicates, but
in this case it's always the same biological sample that you're considering.
The other kind of replicate, and one that's very important to
collect in genomic measurements, is biological replicates.
So, this is where you take different people, or different organisms,
or different samples from different tissues and you prep them.
You, again, prep them the exact same way.
But here you're prepping two different samples, and so because you're
prepping two different samples, you get two biological replicates.
So, biological replicates measure something about biologic variability,
not just how variable the machine was in making those measurements.