0:06
Hello everybody and welcome to the first lecture of Week 7.
This is the last week of our course here in Experimental Methods.
And we're gonna be talking about how we interpret all the experimental
data from the methods we were discussing in the first few weeks of the course.
And in this first lecture, we're gonna be talking about the more large scale,
omics type experiments, and how one can interpret those.
So with this lecture, we're gonna be talking about mainly what can you do
with mRNA sequencing, or mass spectrometry data when you have, essentially,
lists of genes or transcripts or proteins that are differentially expressed.
Meaning that their levels have changed between two or
more conditions that you care about.
And we're gonna go over a whole host of techniques
which are based on using a whole wide range of prior
knowledge to learn things about your system of interest
based on these lists of differentially expressed genes.
And I'd just like to note that this part of the course
is focusing on the omics type experiments.
The next lectures of this week will be talking about interpreting
the experimental data you may get from flow cytometry or live cell imaging.
Which is a much more single cell and/or much more dynamics focused.
And as such, this lecture here will be kind of a bridge
to another course we're offering, called Network Analysis and Systems Biology.
This is taught by a professor here, also in our department, Doctor Avi Ma'ayan.
And this will just be a brief Kind of overview of
the techniques that he will be going a lot more in depth into.
2:05
So what I'm gonna do with this lecture is first just kind of
go over some of the analyses that can be done to learn
things with lists of differentially expressed genes.
And then briefly go over a case study where
this sort of methodology was applied.
So one of the main cornerstones of analyzing these omics,
kind of unbiased large scale data, is the idea of differential expression, or
what things have changed between two conditions?
So for example, if you have an mRNA sequencing experiment or
a mass spec proteomic experiment, usually you're doing this under two or
more conditions to see what changes as a result of how you've treated your samples.
So this gives rise to quantified transcript levels or
protein levels, almost genome-wide.
And you're very often interested in asking the question, now,
what is changed between these two conditions?
So, differential expression analysis is a variety of methods
that attempts to answer this question.
Implementing differential expression analysis and
really understanding how to do it and also how not to do
it, this relies very heavily on computation and statistics.
So we won't get into much detail here in this course on that.
Although, Professor Ma'ayan's course, as a mentioned in the outline,
will be getting more into that sort of analysis.
But here I just wanna say that if you have this type of omics data,
either quantification of transcript levels or
quantification of protein levels.
There are lots of robust ways in taking those sorts of data and
turning them into lists of differentially expressed genes.
Okay, and one thing that people don't often
appreciate is that mRNA sequencing data, and
proteomics data can often give complementary information.
It's sometimes thought by people that if you have a list of differentially
expressed transcripts and a list of differentially expressed proteins,
there'll be a lot of overlap there.
And you won't learn a whole lot new between the two datasets and of course,
there is some overlap.
But what's becoming very clear in recent years is that translational control
of protein expression has a much greater role than was previously appreciated.
4:49
So when things get up or down regulated on the level of mRNAs, that doesn't always
correspond to up or down regulation on the level of protein expression.
Various studies have found
different levels of correlation between transcript levels and protein levels.
Some estimated it's as low as maybe 30, 40% correlation.
Other estimate as high as 70 or 80% correlation.
And of course, that the correlation's gonna really depend on
the biological system which you're studying.
But the point is here that the two methods can
really give you different information.
Another point is technical, in that because experiments like mRNA
sequencing or proteomics are really kind of a sampling technology,
where you might may not always see all the transcripts of proteins
in your sample just because you're not looking deeply enough into it.
You can, by using the two different approaches to look at proteins and
transcripts, you can actually see,
get an effectively better coverage of your sample.
So the first and most straightforward way that people use to analyze these
types of omics datas are plots called heat maps and/or volcano plots.
So the basic idea of a heat map is to simply look at,
for example, if you pile up your genes or
transcripts here on one axis of kind of a matrix here,
and then you pile up your conditions on another axis of it.
And then each square in here corresponds to a particular transcript or
a gene under a certain condition.
Then you can color each square by the level of expression.
And you can do many things with that.
One of them is just to see how closely or far away different samples
are from one another by using so-called hierarchical clustering analysis.
And this allows you to compute a so-called dendrogram,
which is pictured on the axes, like here, or here.
That allows you to see what genes are closer to one another in this expression
space, or what samples are closer to one another in this sample space.
And this can give a lot of insight into interpreting
the experiment that you've done.
So that's a very common way of visualizing these types of omic data.
And I guess the most straightforward way to try
to figure out what can you learn from these omics data,
is kinda by picking out the biggest changers.
Or, what are the one or two things that changed the most,
and then let's study those further.
And one way to pick those out is a so-called Volcano Plot, and
I've showed a representative Volcano Plot here,
aptly named because usually it does look like an erupting volcano.
So what you do is you plot a full change
8:19
between two conditions on the x-axis.
So how much did the expression of each transcript,
in this case if we're thinking about MNA sequencing, or microarray.
How much did it change relative to control?
And then on the y-axis, what is the statistical significance of that change.
So things that have a large fold change and
also have a large statistical significance,
these are the ones that you would want to select for further analysis.
Say by QPCR or knock down experiments, over expression experiments, etc.
That's kind of the simplest, but
a kind of naive view of what you can do with these OMES data.
Of course, it's kind of a screening tool that allows you to
pick out what are the couple of things that are changing the most,
kind of the red flags, from those experiments.
But there's so much more you can do with it.
And it's all, to some extent, based on the idea of, instead of looking at
what's going on in single genes, to let's look at things collectively.
And what's going on with groups of genes, or so-called gene sets.
And one of the pioneering methods in this area is gene set enrichment analysis,
which was proposed in a series of two publications as noted here.
The idea is that you can use prior knowledge
10:19
And you can use this prior knowledge to see what lists of differentially expressed
genes from your study overlap with this prior definition of gene sets.
So the way that this works is that you have your own MES scale data and
then you rank it by something.
It could be just four change between two conditions.
And then once you rank it by this criteria you can see how much as
you walk down your list, how many genes from your list.
Then appear in these other previously collected gene sets.
And as you find members that are the same, a so
called enrichment score, sum, gets incremented.
And as you find those that are different, the enrichment score goes down.
So, when you find gene sets that have a very high
enrichment score as one walks down your gene list, then these are very likely
to have some sort of significance in interpreting your data set.
11:33
So, this is one way to instead of trying to find the one or
two genes that are changing the most.
Let's instead look at kinda what's coordinately changing in my experiment.
And let's let that point our fingers towards entire biological processes or
pathways, etc.
That are defining the entire response of the cells.
To how they were treated.
Another key aspect in this whole definition of gene sets was
the idea of gene ontology.
This was a very seminal effort back in 2000,
to try to come up with a type of classification system.
Where one could define gene sets in three different ways.
By biological process,
an example here is DNA metabolism would be the biological process,
and then within that biological process you can have different children.
So like DNA degradation, packaging, replication, repair, re-combination, etc.
And then within each of these processes you can have,
go down to list of genes that are involved in that process.
Or you can have, again, different sub-biological processes.
Similar ontologies can be defined for molecular functions, such as being
an enzyme, or being a nucleic acid binding protein, etc.
You can just kinda go down these lists, and
get to a list of genes which involved with these molecular functions.
Similarly you can, it has been defined in terms of cellular components.
So sort of, the structure of the cell.
So, you start out with the cell itself.
You can divide it into cytoplasm and nucleus.
Certain genes are associated with the cytoplasm,
certain with the nucleus, so on and so forth.
And one really nice aspect of this is that it works very well across organisms so,
because a lot of the features of cells are shared.
And a lot of genes have homologs, then you can really get a lot of power
out of this kind of approach and this way of defining gene sets.
Another form of OMICS level analysis is so-called transcription factor analysis.
So if you have a list of genes which have been differentially expressed in response
to your treatment, a common hypothesis is that many of these genes that have
changed expression together probably share some common transcription factors.
And if we can figure out what those transcription factors are,
that gives us a lot more insight into what's going on in the system.
14:25
So, one common way that this is done is by using a lot of prior knowledge.
And one very nice tool that I've highlighted here is a tool called ChEA.
Transcription factor regulation inferred from integration of genome-wide
chromatanum uni-percipitation experiments so
this is a tool that Mayong Lab has built.
And what they did was scour the literature and
look for essentially all the interactions they could find that describe how
a variety of transcription factors bind to the genes
to the promoters of nearly all the genes in the human genome.
15:11
And so based on this, one can then overlap that with the sets of
differentially expressed genes that you have and
look to see if any transcription factors are enriched in those sets of genes.
And of course, this isn't the only tool that's out there, there's a lot more
based on different sorts of methods that allow one to do that.
But the general idea remains quite powerful in that if one has lists of
differentially expressed genes,
you can use prior knowledge of transcriptional regulation to figure out
what transcription factors may be involved in mediating these exchanges.
In that same kind of vein, one can also,
instead of looking at transcription factor DNA regulations,
one can then take another step forward and look Now if I know what transcription
factors might be operating to cause these changes in gene expression,
what other proteins might be involved in these processes?
And one common way of looking at this is to look at protein protein interaction
databases, that contain information on how not only transcription factors,
but many other proteins interact with other proteins in the cell.
So there's many experiments that have been
done with different levels of fidelity and different techniques.
One very common one is a screen or essay called a yeast two hybrid assay.
And many of these protein-protein interactions have been put into
a variety of databases here.
And I'm showing, again, another tool for the Ma'ayan lab, that's been built to
take into account many of these known protein-protein interactions.
So, again, what you can do is, that given a list of differentially express genes
17:17
in terms of other proteins that might be expressed?
And so, by looking at these connections, you can derive what's called a network
model for how your differential express genes are interacting
potentially with each other through known protein-protein interactions.
So this is a very powerful way to look beyond transcription or
regulation or regulation of protein levels, to look into how
the function might be propagating in a signaling network for example.
So although that's very powerful, there can be caveats to this.
18:25
Meaning that the protein-protein complex holds for a very long time.
And this can actually miss a lot of important interactions.
For example, many of the protein-protein interactions involved in signaling
pathways are very transient by their nature,
because if they weren't then the signaling pathway wouldn't be able
to turn off the signal and they wouldn't be able to function properly.
So, although there are some caveats to it, you can get a lot of useful information
out of applying this knowledge of protein protein interactions to generate
network models from lists of differentially expressed genes.
So when we start to put those sorts of analyses together,
we can actually put them together in a pipeline and really come up with a lot
of new knowledge from simply looking at lists of differentially expressed genes.
So as an example here, you can use several tools that
the Ma'ayanlab has developed to go from a list of differentially expressed genes.
And do transcription factor analysis to try to look at
what transcription factors might have been driving those changes in gene expression.
Based on those transcription factors, you can then use protein-protein interaction
databases to try to find, take another step back and find what are the pathways
that might be involved in mediating the activity of these transcription factors.
And lastly, once you have these hypotheses in the form of a network.
You can look at what are the proteins in that network, and try to look at
what sort of enzymatic activities might be enriched for in those network.
And in particular,
many times these changes are mediated by changes in kinase activities.
So you can look all the certain types of kinase relationships are,
kinase sub-straight relationships that are highly enriched in
the resulting network models that you can derive.
20:34
And one way to do that is so called Kinase Enrichment Analysis.
Again another tool from the May'yan lab that has been developed,
in order to do that.
So by combining many of these tools with prior knowledge.
In this case prior knowledge of where do transcription and
factors bind in terms of promoters and genes?
How do transcription factors interact with other
proteins that might be present in the cell?
And then, how do those other proteins, how might they be substrates for
known kinases.
One can really start to tease out lots of information and
really propose very novel hypotheses,
simply based on lists of differentially expressed genes.
And a software that kind of takes all these tools and
puts them together in one Is a pretty recent one called Enrichr,
again, developed by the Ma'yan lab here at Mount Sinai.
So the idea is that again,
you just start with lists of differentially expressed genes.
And you can use this software Enrichr, and it already has a built into it.
Many of these gene sets that are used by a variety of tools that people use.
So Gene Set Enrichment Analysis is contains some of the gene sets.
But there's many others out there which Enrichr has in it.
And then you can use Enricher to do all sorts of different analysis.
So, you can do enrichment of different ontologies,
or just look at different biological processes and
look at those results in the form of bar graph, for example.
You can generate networks as I was mentioning before,
by looking at protein protein interaction prior knowledge.
You can just visualize multivariate data by generating grids
where each spot on the grid corresponds to a different enrichment score, for example.
So the cell tool is just very flexible and powerful to be able to incorporate
all these different types of prior knowledge in kind of in one spot.
So that was just a very brief overview of the types
of tools that are available to really gain a lot of knowledge and
generate hypotheses based simply on lists of differentially expressed genes.
Which can be derived from are there mRNA sequencing?
23:06
Or microarray data, cardiomics data, or the combination of them all.
So now I'd just like to show one example of how all of these sorts
of techniques were applied to really come up with something new,
and really learn something about a biological system.
In this case, this is a study from Ma'ayan in
collaboration with both here at Mount Sinai,
where they were using all these tools that I described previously
to study kidney fibrosis, that is That is caused by HIV.
So they have a mouse model of this system where the wild type mice,
if you look at their kidneys, it looks like this.
But, in the case of a model of HIV and how that causes kidney damage,
this is what the kidneys looked like in this so called Tg26 mouse strain.
So the first thing that was done in this study was that they simply
looked at mRNA expression.
So they did transcriptome experiments.
In this case microarrays because it was a bit of an older study.
But in any case, then they could look at
25:03
And after doing these analyses,
they came up with this list of transcription factors.
Which wasn't special by itself.
But, so as I mentioned before, they took this list of transcription factors and
ran it through their tool called Genes to Networks.
And so, based on these transcription factors, they came up with a putative
network model for how these transcription factors are connected
to not only one another but also to other proteins present in the cell.
So based on this network model,
they then used kinase enrichment analysis to look at what are the possible
kinases that are involved in regulating the activities of these transcription
factors which gave rise to this pattern of differential expression.
And upon doing this kinase enrichment analysis, they found two of the most
common kinases that you would find for almost any System that you look at,
not kinase-1 and not kinase-3, also known as ERK1 and ERK2.
26:05
But the third one on the list here was a kinase called HIPK2,
which was much less studied but highly significant according to their analysis.
So the next step in their study was to do extensive
validation experiments to look at what is HIPK2 doing in
this kidney fibrosis process and then is induced by HIV.
And is it really a viable hypothesis, does HIPK2
actually play a roll In this kidney fibrosis process.
To make a long story short, they found that it has quite a major role.
So what they were able to do was to take those mice and
cross with them with a HIPK2 knockout.
And then they could look at all sorts of aspects of how that HIPK2 knockout
interacts with the TG26 background which causes the HIV kidney fibrosis.
And what they found was when the HIPK2 was knocked out,
the kidney function essentially returned a very close,
was improved greatly both in terms of various physiological parameters,
measurements of kidney function, and in addition to the histology
to the tissue morphology and architecture where the wild type.
Kidneys looked like this.
27:42
The knockout kidneys didn't look too different from the wild type.
But when you took the disease background and then knocked out HIPK2,
it looked much more like the wild type.
So this was a really nice example of how
just by looking at differential expression between two conditions.
In this case it was a wild type mouse or a transgenic mouse,
which was a model for HIV induced kidney fibrosis.
You could take this and run it through a series of programs that utilizes
a large body of prior knowledge to identify a single kinase.
In this case HIPK2, which seems to be a very viable drug target to help
treat HIV induced kidney fibrosis.
So that's all of the lecture slides that I had for this part of week seven.