In this lesson, we'll discuss the minfi packets, which are the packets for handling data from DNA methylation microarrays. So let me start by introducing DNA methylation. DNA methylation is a chemical modification of the c base that in humans only occurs in a CBG context. That means a c followed by a g in the human genome. There are 28 million of such CBGs and these cs can either be methylated or be unmethylated. In a single cell, the methylation state is binary or it actually has three values because we have two copies of each chromosome. But think of it as binary. But when we profile a collection of cells, we don't necessarily have to have that all the cells have the same methylation state at a given CBG. That means that the outcome of a methylation measurement is something we can think as the methylation percentage, also called the beta value which is a number between zero and one. The purpose of starting DNA methylation is to understand how DNA methylation changes and is associated with, for example, a phenotype or deceased status. A popular platform for starting DNA methylation is the 450K microarray, and it's popular because it's relatively comprehensive, and it's cheap. By relative comprehensive, I mean that we have 28 million CBGs in the human genome and the 450K makes just 480,000 of those. Lets start off this session here by loading the packets. And load a data query and start downloading set that contains a 450K data set. When you use the minfi packet, we are particularly interested in IDAT files which are raw scanning files from the Illumina platform. And these files, if they are available, they're available as supplementary material. They're not available for all 450K submissions, only for some of them. But I found one here that's interesting which is a study where they are trying to understand how whether or not DNA methylation changes are associated with acute mania. So they took a number of individuals who were hospitalized for acute mania, obtained serum from then and profiled the serum on the 450K array. And then they'll also have a number of unaffected controls. Well let's start off by downloading the supplementary files. This is a big download, it's going to take a little while. We've now downloaded the data set. And we see that there are actually two supplementary files. But, what we are really interested in is this file called _RAW.tar, that is a archive containing the IDAT files. So, first we are going to unpack the archive, and inside the directory that has been created, we have a list of IDAT files. We see that there are some files called _Grn for green and some files called _Red. And otherwise, aside from the green and the red, they have the same name. For IDAT files for the 450K array, we're going to get one file for each color channel, so there's going to be two files for each sample. Now minfi currently does not support reading in compressed IDAT files. So we're going to decompress the files. Oh, we see here that I have not cleaned up from when I ran the code a little while ago. There's a mixture of compressed and uncompressed files. And so we don't have to decompress them, but we can just read them in using a convenient function from the minfi packets called read.440K.experiment, which reads in all IDAT files in the directory. We have now read the files into something called an rgSet that we will look at in detail a little later. But, it seems to be something that looks like an expression set containing the red and the green color channel. Now, the problem we have right now is that we can't alter the data. But, we don't have any phenotype data associated with these samples. There's nothing in the phenotype data slot which is not surprising, we just read in the files. And the sample names doesn't help us very much because these codes here contains no information about which samples were hypervized under which conditions. To get the phenol type data we will now download the original GU process data, because that data set has phenotype information associated with it. This is also going to take a while. We have now downloaded and parsed the data file. That was a lot of work for getting a few columns of phenotype data. So, we get the phenotype data from the original GU matrix, and we're only really interested in four columns. I've done a little bit of homework and looked at the entire thing. And these are the four columns, it contains a little bit of information about the samples. And then we know the diagnosis and the sex of the different samples. So we're going to clean up this data frame a little bit. We're going to change the column names and we're going to clean up the group and the sex column to something that looks like this. That we are now going to merge in together with the empty pheno data slot from rgSet we had created. So first I continue the clean up. I clean up my sample names from my rgSet. And I make sure that my phenotype data thing here has the right roll names. Then I re-order things. So now I have my sample names from my rgSet, let me just take the head of this. And I have the head of my pD object, and these in the same order. I've guaranteed that now. And now I'm ready to assign it and I have a ready to roll rgSet. So that was a little bit of an advanced usage of geoquery. And we will now return to our methylation array. So the first thing you want to do with most microarray data is you want to normalize it. And in minfi we have a set of functions starting with pre-process that implements various methods for pre-proccessing for 450K microarray. In this case here, I'm going to pick the method called preproccessQuantile, which I called directly on the rgSet. And I get something back called a genomic ratio set, so I'm going to call it a gr set. This functions does a couple of things. It normalizes the data, but it also maps the data to the genome. This is a process of assigning each probe to a given location in genome where that particular CBG is located. So now we'll preprocess the array. And we get something back that is essentially a summarized experiment set. It's called a genomic ratio set. The genomic thing means that we have mapped things to the genome and the ratio has to do with what kind of methylation measurements we are storing in the object, in this case here, it's beta values. And we can see we have 485,000 measurements. We can get the location of the CBGs by calling granges on it. And we get a granges detailing the c on the forward strand, which the c in the CBG that may or may not be methylated. So CBGs are deplete in the human genome but they tend to cluster together into integers that would call CBG islands. For many reasons, researchers are interested in knowing when they're looking at a CBG whether or not it's a CBG inside an island or inside an area that's close to an island. These areas are called CBG shores, a little bit further away, there's CBG shelves. And if you're really far away from a CBG island, you are an open sea CBG. You get this information by using getIslandStatus, that returns a picture of rather or not the CBG is an OpenSea island, so on and so forth. Finally, now that we normalized the data, we can get beta values out of the grSet. And we're now ready for analysis. There are two ways we can analyze this data. One is basically using the lema packets to find differentially methylated positions of single CBGs that are differentially methylated. But for both biological and statistical reasons, we might be particularly interested in clusters of CBGs that change in the same direction. In the minfi packets, we have a function called bumper that interfaces to the bump on the packets, and that allows us to discover such clusters. There are other methods out there for doing that particular step. So that concludes what I have to say about minfi, there's a lot more to be read about, to be seen in the vignette for the packets.