Hello everybody and welcome to the first lecture of Week 7. This is the last week of our course here in Experimental Methods. And we're gonna be talking about how we interpret all the experimental data from the methods we were discussing in the first few weeks of the course. And in this first lecture, we're gonna be talking about the more large scale, omics type experiments, and how one can interpret those. So with this lecture, we're gonna be talking about mainly what can you do with mRNA sequencing, or mass spectrometry data when you have, essentially, lists of genes or transcripts or proteins that are differentially expressed. Meaning that their levels have changed between two or more conditions that you care about. And we're gonna go over a whole host of techniques which are based on using a whole wide range of prior knowledge to learn things about your system of interest based on these lists of differentially expressed genes. And I'd just like to note that this part of the course is focusing on the omics type experiments. The next lectures of this week will be talking about interpreting the experimental data you may get from flow cytometry or live cell imaging. Which is a much more single cell and/or much more dynamics focused. And as such, this lecture here will be kind of a bridge to another course we're offering, called Network Analysis and Systems Biology. This is taught by a professor here, also in our department, Doctor Avi Ma'ayan. And this will just be a brief Kind of overview of the techniques that he will be going a lot more in depth into. So what I'm gonna do with this lecture is first just kind of go over some of the analyses that can be done to learn things with lists of differentially expressed genes. And then briefly go over a case study where this sort of methodology was applied. So one of the main cornerstones of analyzing these omics, kind of unbiased large scale data, is the idea of differential expression, or what things have changed between two conditions? So for example, if you have an mRNA sequencing experiment or a mass spec proteomic experiment, usually you're doing this under two or more conditions to see what changes as a result of how you've treated your samples. So this gives rise to quantified transcript levels or protein levels, almost genome-wide. And you're very often interested in asking the question, now, what is changed between these two conditions? So, differential expression analysis is a variety of methods that attempts to answer this question. Implementing differential expression analysis and really understanding how to do it and also how not to do it, this relies very heavily on computation and statistics. So we won't get into much detail here in this course on that. Although, Professor Ma'ayan's course, as a mentioned in the outline, will be getting more into that sort of analysis. But here I just wanna say that if you have this type of omics data, either quantification of transcript levels or quantification of protein levels. There are lots of robust ways in taking those sorts of data and turning them into lists of differentially expressed genes. Okay, and one thing that people don't often appreciate is that mRNA sequencing data, and proteomics data can often give complementary information. It's sometimes thought by people that if you have a list of differentially expressed transcripts and a list of differentially expressed proteins, there'll be a lot of overlap there. And you won't learn a whole lot new between the two datasets and of course, there is some overlap. But what's becoming very clear in recent years is that translational control of protein expression has a much greater role than was previously appreciated. So when things get up or down regulated on the level of mRNAs, that doesn't always correspond to up or down regulation on the level of protein expression. Various studies have found different levels of correlation between transcript levels and protein levels. Some estimated it's as low as maybe 30, 40% correlation. Other estimate as high as 70 or 80% correlation. And of course, that the correlation's gonna really depend on the biological system which you're studying. But the point is here that the two methods can really give you different information. Another point is technical, in that because experiments like mRNA sequencing or proteomics are really kind of a sampling technology, where you might may not always see all the transcripts of proteins in your sample just because you're not looking deeply enough into it. You can, by using the two different approaches to look at proteins and transcripts, you can actually see, get an effectively better coverage of your sample. So the first and most straightforward way that people use to analyze these types of omics datas are plots called heat maps and/or volcano plots. So the basic idea of a heat map is to simply look at, for example, if you pile up your genes or transcripts here on one axis of kind of a matrix here, and then you pile up your conditions on another axis of it. And then each square in here corresponds to a particular transcript or a gene under a certain condition. Then you can color each square by the level of expression. And you can do many things with that. One of them is just to see how closely or far away different samples are from one another by using so-called hierarchical clustering analysis. And this allows you to compute a so-called dendrogram, which is pictured on the axes, like here, or here. That allows you to see what genes are closer to one another in this expression space, or what samples are closer to one another in this sample space. And this can give a lot of insight into interpreting the experiment that you've done. So that's a very common way of visualizing these types of omic data. And I guess the most straightforward way to try to figure out what can you learn from these omics data, is kinda by picking out the biggest changers. Or, what are the one or two things that changed the most, and then let's study those further. And one way to pick those out is a so-called Volcano Plot, and I've showed a representative Volcano Plot here, aptly named because usually it does look like an erupting volcano. So what you do is you plot a full change between two conditions on the x-axis. So how much did the expression of each transcript, in this case if we're thinking about MNA sequencing, or microarray. How much did it change relative to control? And then on the y-axis, what is the statistical significance of that change. So things that have a large fold change and also have a large statistical significance, these are the ones that you would want to select for further analysis. Say by QPCR or knock down experiments, over expression experiments, etc. That's kind of the simplest, but a kind of naive view of what you can do with these OMES data. Of course, it's kind of a screening tool that allows you to pick out what are the couple of things that are changing the most, kind of the red flags, from those experiments. But there's so much more you can do with it. And it's all, to some extent, based on the idea of, instead of looking at what's going on in single genes, to let's look at things collectively. And what's going on with groups of genes, or so-called gene sets. And one of the pioneering methods in this area is gene set enrichment analysis, which was proposed in a series of two publications as noted here. The idea is that you can use prior knowledge based on previous OMEX experiments, or definitions of what genes are involved in certain biological processes. I'll talk a little bit more about the so called ontology analysis on the next slide known pathways, etc. And you can use this prior knowledge to see what lists of differentially expressed genes from your study overlap with this prior definition of gene sets. So the way that this works is that you have your own MES scale data and then you rank it by something. It could be just four change between two conditions. And then once you rank it by this criteria you can see how much as you walk down your list, how many genes from your list. Then appear in these other previously collected gene sets. And as you find members that are the same, a so called enrichment score, sum, gets incremented. And as you find those that are different, the enrichment score goes down. So, when you find gene sets that have a very high enrichment score as one walks down your gene list, then these are very likely to have some sort of significance in interpreting your data set. So, this is one way to instead of trying to find the one or two genes that are changing the most. Let's instead look at kinda what's coordinately changing in my experiment. And let's let that point our fingers towards entire biological processes or pathways, etc. That are defining the entire response of the cells. To how they were treated. Another key aspect in this whole definition of gene sets was the idea of gene ontology. This was a very seminal effort back in 2000, to try to come up with a type of classification system. Where one could define gene sets in three different ways. By biological process, an example here is DNA metabolism would be the biological process, and then within that biological process you can have different children. So like DNA degradation, packaging, replication, repair, re-combination, etc. And then within each of these processes you can have, go down to list of genes that are involved in that process. Or you can have, again, different sub-biological processes. Similar ontologies can be defined for molecular functions, such as being an enzyme, or being a nucleic acid binding protein, etc. You can just kinda go down these lists, and get to a list of genes which involved with these molecular functions. Similarly you can, it has been defined in terms of cellular components. So sort of, the structure of the cell. So, you start out with the cell itself. You can divide it into cytoplasm and nucleus. Certain genes are associated with the cytoplasm, certain with the nucleus, so on and so forth. And one really nice aspect of this is that it works very well across organisms so, because a lot of the features of cells are shared. And a lot of genes have homologs, then you can really get a lot of power out of this kind of approach and this way of defining gene sets. Another form of OMICS level analysis is so-called transcription factor analysis. So if you have a list of genes which have been differentially expressed in response to your treatment, a common hypothesis is that many of these genes that have changed expression together probably share some common transcription factors. And if we can figure out what those transcription factors are, that gives us a lot more insight into what's going on in the system. So, one common way that this is done is by using a lot of prior knowledge. And one very nice tool that I've highlighted here is a tool called ChEA. Transcription factor regulation inferred from integration of genome-wide chromatanum uni-percipitation experiments so this is a tool that Mayong Lab has built. And what they did was scour the literature and look for essentially all the interactions they could find that describe how a variety of transcription factors bind to the genes to the promoters of nearly all the genes in the human genome. And so based on this, one can then overlap that with the sets of differentially expressed genes that you have and look to see if any transcription factors are enriched in those sets of genes. And of course, this isn't the only tool that's out there, there's a lot more based on different sorts of methods that allow one to do that. But the general idea remains quite powerful in that if one has lists of differentially expressed genes, you can use prior knowledge of transcriptional regulation to figure out what transcription factors may be involved in mediating these exchanges. In that same kind of vein, one can also, instead of looking at transcription factor DNA regulations, one can then take another step forward and look Now if I know what transcription factors might be operating to cause these changes in gene expression, what other proteins might be involved in these processes? And one common way of looking at this is to look at protein protein interaction databases, that contain information on how not only transcription factors, but many other proteins interact with other proteins in the cell. So there's many experiments that have been done with different levels of fidelity and different techniques. One very common one is a screen or essay called a yeast two hybrid assay. And many of these protein-protein interactions have been put into a variety of databases here. And I'm showing, again, another tool for the Ma'ayan lab, that's been built to take into account many of these known protein-protein interactions. So, again, what you can do is, that given a list of differentially express genes you can use this tool that was developed by the Ma'ayan Lab called Genes2Networks. Where it takes this list of genes and then it looks at how are the protein products of these genes. What are they known to interact with in terms of other proteins that might be expressed? And so, by looking at these connections, you can derive what's called a network model for how your differential express genes are interacting potentially with each other through known protein-protein interactions. So this is a very powerful way to look beyond transcription or regulation or regulation of protein levels, to look into how the function might be propagating in a signaling network for example. So although that's very powerful, there can be caveats to this. The biggest one is that many of these protein-protein interactions can be quite biased. Meaning that the proteins that are well studied, tend to have the most interactions, just because people have looked at them more. And also, many of the assays used to study protein-protein interactions kind of dictates that those Interactions that are scored positive, are very stable. Meaning that the protein-protein complex holds for a very long time. And this can actually miss a lot of important interactions. For example, many of the protein-protein interactions involved in signaling pathways are very transient by their nature, because if they weren't then the signaling pathway wouldn't be able to turn off the signal and they wouldn't be able to function properly. So, although there are some caveats to it, you can get a lot of useful information out of applying this knowledge of protein protein interactions to generate network models from lists of differentially expressed genes. So when we start to put those sorts of analyses together, we can actually put them together in a pipeline and really come up with a lot of new knowledge from simply looking at lists of differentially expressed genes. So as an example here, you can use several tools that the Ma'ayanlab has developed to go from a list of differentially expressed genes. And do transcription factor analysis to try to look at what transcription factors might have been driving those changes in gene expression. Based on those transcription factors, you can then use protein-protein interaction databases to try to find, take another step back and find what are the pathways that might be involved in mediating the activity of these transcription factors. And lastly, once you have these hypotheses in the form of a network. You can look at what are the proteins in that network, and try to look at what sort of enzymatic activities might be enriched for in those network. And in particular, many times these changes are mediated by changes in kinase activities. So you can look all the certain types of kinase relationships are, kinase sub-straight relationships that are highly enriched in the resulting network models that you can derive. And one way to do that is so called Kinase Enrichment Analysis. Again another tool from the May'yan lab that has been developed, in order to do that. So by combining many of these tools with prior knowledge. In this case prior knowledge of where do transcription and factors bind in terms of promoters and genes? How do transcription factors interact with other proteins that might be present in the cell? And then, how do those other proteins, how might they be substrates for known kinases. One can really start to tease out lots of information and really propose very novel hypotheses, simply based on lists of differentially expressed genes. And a software that kind of takes all these tools and puts them together in one Is a pretty recent one called Enrichr, again, developed by the Ma'yan lab here at Mount Sinai. So the idea is that again, you just start with lists of differentially expressed genes. And you can use this software Enrichr, and it already has a built into it. Many of these gene sets that are used by a variety of tools that people use. So Gene Set Enrichment Analysis is contains some of the gene sets. But there's many others out there which Enrichr has in it. And then you can use Enricher to do all sorts of different analysis. So, you can do enrichment of different ontologies, or just look at different biological processes and look at those results in the form of bar graph, for example. You can generate networks as I was mentioning before, by looking at protein protein interaction prior knowledge. You can just visualize multivariate data by generating grids where each spot on the grid corresponds to a different enrichment score, for example. So the cell tool is just very flexible and powerful to be able to incorporate all these different types of prior knowledge in kind of in one spot. So that was just a very brief overview of the types of tools that are available to really gain a lot of knowledge and generate hypotheses based simply on lists of differentially expressed genes. Which can be derived from are there mRNA sequencing? Or microarray data, cardiomics data, or the combination of them all. So now I'd just like to show one example of how all of these sorts of techniques were applied to really come up with something new, and really learn something about a biological system. In this case, this is a study from Ma'ayan in collaboration with both here at Mount Sinai, where they were using all these tools that I described previously to study kidney fibrosis, that is That is caused by HIV. So they have a mouse model of this system where the wild type mice, if you look at their kidneys, it looks like this. But, in the case of a model of HIV and how that causes kidney damage, this is what the kidneys looked like in this so called Tg26 mouse strain. So the first thing that was done in this study was that they simply looked at mRNA expression. So they did transcriptome experiments. In this case microarrays because it was a bit of an older study. But in any case, then they could look at differences in expression in the wild type mice versus the Tg26 mice, visualized here as a heat map indentogram. And they coupled this with promoter analysis. So they looked at the patterns of differential expression and asked, what transcription factors are likely to be involved with this pattern of differential expression? And this was combined with data from protein DNA arrays which assesses the ability of proteins like transcription factors to bind to certain sequences of DNA. And after doing these analyses, they came up with this list of transcription factors. Which wasn't special by itself. But, so as I mentioned before, they took this list of transcription factors and ran it through their tool called Genes to Networks. And so, based on these transcription factors, they came up with a putative network model for how these transcription factors are connected to not only one another but also to other proteins present in the cell. So based on this network model, they then used kinase enrichment analysis to look at what are the possible kinases that are involved in regulating the activities of these transcription factors which gave rise to this pattern of differential expression. And upon doing this kinase enrichment analysis, they found two of the most common kinases that you would find for almost any System that you look at, not kinase-1 and not kinase-3, also known as ERK1 and ERK2. But the third one on the list here was a kinase called HIPK2, which was much less studied but highly significant according to their analysis. So the next step in their study was to do extensive validation experiments to look at what is HIPK2 doing in this kidney fibrosis process and then is induced by HIV. And is it really a viable hypothesis, does HIPK2 actually play a roll In this kidney fibrosis process. To make a long story short, they found that it has quite a major role. So what they were able to do was to take those mice and cross with them with a HIPK2 knockout. And then they could look at all sorts of aspects of how that HIPK2 knockout interacts with the TG26 background which causes the HIV kidney fibrosis. And what they found was when the HIPK2 was knocked out, the kidney function essentially returned a very close, was improved greatly both in terms of various physiological parameters, measurements of kidney function, and in addition to the histology to the tissue morphology and architecture where the wild type. Kidneys looked like this. The diseased kidneys looked like this. The knockout kidneys didn't look too different from the wild type. But when you took the disease background and then knocked out HIPK2, it looked much more like the wild type. So this was a really nice example of how just by looking at differential expression between two conditions. In this case it was a wild type mouse or a transgenic mouse, which was a model for HIV induced kidney fibrosis. You could take this and run it through a series of programs that utilizes a large body of prior knowledge to identify a single kinase. In this case HIPK2, which seems to be a very viable drug target to help treat HIV induced kidney fibrosis. So that's all of the lecture slides that I had for this part of week seven. And next, I'll be talking about how we can use the other types of experimental methods that I've described. So flow cytometry and live cell imaging, and then interpret those with a different type of modelling analysis.