Welcome back to Peking University MOOC, Bioinformatics: Introduction and Methods. Let's continue with our lectures. In the previous unit, we learned about the concepts of ontology, the hierarchical structure of the gene ontology and KEGG pathway database, how they are associated with genes and gene products, and how the cover software annotates and identifies statistically significant pathways. In this unit, I will use an example from my own lab to illustrate how we can apply these ideas and methods to study a real problem and discover interesting patterns. As many of you know, addiction is a serious medical problem. There are many substances that can cause addiction, such as cocaine, heroin, tobacco, alcohol and so on. As this figure on the top right shows, different addictive substances vary in terms of the level of physical harm and dependence they cause. However, despite the differences, addictions to different substances seem to share a common course. So a question that we and other researchers had wondered about was, is there are common molecular pathway underlying addiction? We felt that to answer this question a value approach might have a unique advantage. Most of this work was done by a talented former graduate student of mine, Phi Yin Lee, who is now a faculty member at the Institute of Molecular Medicine at Peking University. To answer this question, we first gathered genes that are related to addiction. Tri /gYun read about 1,000 papers and collected all the genes that had been linked to addiction in literature. There are two main types of experimental evidence. The first is evidence from genetic studies such as linkage analysis and association studies. And the second is evidence from molecular biology experiments such as gene expression and proteomics studies. These two types of evidence each identified hundreds of genes related to addiction. However, only four genes are identified by both types of experimental approaches. The overlap is surprisingly small. This kind of discrepancy between genetic and molecular biology findings had been previously reported by other groups about other complex diseases as well. After the initial confusion, we decided to look into this problem from other different angles. Eventually we found out that, even though genetic and molecular biology approaches identified they are different genes, the genes tend to fall into the same pathways and connecting the protein interaction networks. They just tend to fall into different parts of the pathways and interaction networks. Based on this observation, we decided that in order to get a comprehensive picture of addiction, we need to integrate gene sets from different types of experimental approaches. This without it in the set of 1500 genes. Because there may be noise in this large gene set, we selected a sub set of 396 genes that were supported by two or more evidences. Now, given the set of 396 addiction genes, which pathways are involved? Which pathways are statistically significantly represented? You may remember that KOBAS is one of the tools for this type of questions. So we run KOBAS on the 396 addiction genes against a background of all human genes. And found that 18 pathways were statistically, significantly rich including long-term potentiation, long-term depression, gap junction, neuroactive ligand-receptor interaction, and so on. But we haven't yet answered the question, are there common pathways underlying addiction to different substances? Please pause a moment to think about how you might do this. When we did it what's this? When we connected is related to addiction, we collected not only the data but also the meta-data. Meaning that for each gene linked to addiction, we also collected the experimental details, such as the type of experiments, the parameters, the species, the brain regions where available and so on. In particular, for each gene we collected metadata on the addictive substance in the related experiments. Once we had put all these data and metadata into our relational database, we could easily write our programs to analyze the data. We focused our analysis on four substances that had the most data, cocaine, alcohol, opiates, and nicotine After we run code KOBAS four times on the [INAUDIBLE] instance one by one, we identify these five common pathways that has statistically significantly rich for all four sunstances. Interestingly, two of the pathways, the GnRH signalling pathway and the Gap junction pathway had never been linked to addiction before. But how could we have discovered two new pathways related to addiction without doing any experiments? The reason is that if you look at the genes in these two pathways closely, you would find that each of the related genes had been found in some lab, somewhere in the world, using some experimental approaches. However, because each lab saw only one gene, they could not draw any conclusions about the pathway level. Only after you apply the kind of bioinformatics analysis that we talked about this week on the kind of data and meta-data represented in the computer with a hierarchical structure that we talked about this week could you certainly see these interesting pattern?. So this is a good example of the power of bioinformatics in making biological discoveries. In summary, to facilitate communication and computation, we need to store data in database whenever possible, define an ontology for the data, and collect meta-data together with data. To discover higher level patterns in a set products, we can identify the most significant pathways and functional categories by performing statistical analysis with tools such as KOBAS. I hope this lessons can be useful in your own future research. Thank you for your attention.