In this session we'll discuss the Oligo package. That is a package for pre-processing and handling Affymetrix and nimble chimp micro rays. Specifically gene expression micro rays and snip micro rays. So Affymetrix chips, so called single column microarrays. They are widely used, an early success of the Bioconductor project was to provide a very good method for pre-processing and analyzing gene expression erased from Affymetrix. This package is a continuation or a second version of an earlier package in Bioconductor called the Affy package. The Affy package was specifically focused on gene expression, micro-rays from Affymetrix, and then later on, the authors realized that they could handle gene expression and snip chips, and also for both Affymetrix and Nimble Jim. Nimble Jim is another company that makes micro-rays that were later purchased by Affymetrix, and shall we say for some applications, in some areas of genomics, there have been, these arrays have been used a lot but they're not very commonly used. Let's start off by loading the Oligo packets, and we're also going to load the AL query packets, because we're going to get some data from GU in order to normalize. So, we're going to look at a specific GUID and we want to get so called cell files. So for Affymetrix, microrays, raw data stored in a binary form, typically a binary format called cell. These are not the standard files we get from GU. These are always submitted as supplementary files from GU. So the way we get these cell files is we know what the system number of the experiment is, and then we get the supplementary files. So let's run this right here, and it's going to take a little while because there's a fair amount of data. In the mean time I can talk a little bit about Affymetrix expression arrays, or Affymetrix arrays in general. Affymetrix as a company has a technology where they can make cheap, very high quality, very short articles. So, on an Affymetrix array the probes in the array are typically on the order of 25 paces long. That's not very long, and it means that the arrays, the probes are not very specific. To compensate that on most Affymetrix arrays, if you are measuring one specific on a species, or you're measuring a slip, you do this using multiple probes that all measure the same tack. These multiple probes are grouped into something called probe sets, a probe set is a group of probes that all measure the same tack. For technical reasons it's quite expensive to design an isometrics chip, but then once it's designed, it's cheap to mass produce. So Affymetrix tend to have a few designs that they keep around for a long time and not do very many custom designs. Now let's return to the supplementary file that we have downloaded. We can see that it's downloaded the file list, and then this TAR archive, that's a little bit like a ZIP archive, this is where the CEL files are going to be. So I'm going to expand the file. I'm going to create a directory inside this GSE directory, that the file is already in, called CEL. And let's see, if we look inside that we have a list of files, we can see that the file names actually are very informative, they have some GU sample IDs, and then they have underscore what we're actually interested in, which is the variable of interest. So this experiment was comparing samples from some control patients to patients with sleep apnea. So we can see we have the control samples one through eight, and then we have the treatment samples that are labeled as OSA from one to ten. Okay, so we obtain a list of our cell files with the full name. So all we have here, it's just a list of these file names here with the full path to it, and then, as we often have in Bioconductor, we have convenience functions that read this data into a data container. Let's execute this, and so what I mean by this is that there's actually other functions that read each file separately. These low level passing functions, typically return very raw types of data, and then inside specific application areas data containers have been developed that hold these things. So, let's see what our raw data is. Okay, it prints like something that looks very much like an expression set, but it's really something called a gene feature set. We can see that it has a ton of different features, or different probes, more than 1 million probes, we can see that the name of the array is called A through gene dot 1 dot 0 dot sg. So this stands for human gene version one. This is a new type, a relatively new type of isometric micro rays that are different from, called classic isometrics micro rays, for the biologist I can say that these are gene expression micro-rays that's based on random priming instead of OligoDT priming, and they have a lot of features for each gene. The features, the probe that you are using to base your RNA transcript are spaced along the entire length of the transcript. So let's look a little bit at this gene class. See it here, this gene class, this gene features set here. We can see that there is some stuff here that looks very much like an expression set on Eset, then we have some additional thing like manufacture, intensity file that's new. And this here is really a way of representing these links as a separate for we have many probes that measure a single gene, and all of this is sort of taken care of in this feature set. So let's look at little bit at the raw data. We access that using the expression access, that's not a given, but that's true in this case. We can see that we get integers that are pretty large in this case here, between roughly 200 and 10,000, and that indicates that this expression data is raw intensity measurement from the scanner. So a micro-ray scanner is typically a 16 bit scanner, which means when you scan a probe you get a number between zero and two to the sixteenth which is 65,536. We can confirm that by looking at the highest value inside the expression of data. That's exactly what I want. There's actually a probe that basically maxes out. So this is not a very, research has shown that this is not a very good scale to work on with microarray data. Usually we want to log transform these datas here. When we lock transform them, If we use a log with base 2, we get a number basically between zero and 16. Now the first thing we're going to do is we're going to clean up the field data a little bit, just for good order's sake. So we get the file name from the raw data, which is the file names and that's kind of the information we have. You can see we discussed this earlier and i'm going to install that encyclopedia data. I'm going to say that my sample names, I don't want this .cell, .gc, I don't want this gsm identify. I just want control 1, control 2, and so on and so forth. That's unique enough for me. So I'm going to do some cleanup here using regular expressions on the sampleNames here. We can see that after all of this, I get some useful sampleNames, and I'm going to put them inside and I'm going to use them as the sampleNames of the dataset. And then finally, I'm going to create a group variable which is going to tell me the different experimental groups, and let's see what came out of that. I have a pData thing here with an intake, which is not really that useful. I have a filename, and I have the group, and then I have the sampleNames out to the left. So now we have cleaned up the field data a little bit and we are ready to do some stuff. Let's start off by looking at the intensities of this raw data here. So this here is an attempt at describing the need for normalization at, of gene expression microwave data. Let's do a box plot and we get it has a well defined method, and we get this little box plot here. Each plot, each box is a different sample, and what we are showing here is a summary of the distribution of intensities in each of the arrays, and we can see they are very, very different. Remember the Y axis in this box, or not remember, the Y axis in this box plot here is on the log scale. So, a difference of 1 or 2 is a quite massive difference, and on this plot here we can see that all the different arrays have different means, they have different spread. And we also see that it looks there is three samples that are a little bit different, like the Control5 to Control7. These three samples seem to have very low intensity measurements, compared to the rest of the arrays, and they have very low variability as well. Really, so one hypothesis is that nothing was really hybridized for these samples here. Really understanding that, or assessing that, or deciding that really it's going to require a lot more exploratory data analysis for this particular experiment. Now when you have a dataset like this, most often you want to start off by normalizing it, and a very popular method for gene expression microarrays from Affymetrix is the RMA method, that I highly recommend people to use. It kind of, basically always does pretty well. Sometimes, there's a method on a specific data set where it can be, that can out perform it a little bit, but RMA always does well. So, it's like my method of choice. So, I just, how do I run that? I just run RA on the on the data. Really simple, takes a little while. It's basically because these arrays are quite massive, they have a million probes on them, this is like some of the bigger microarrays that have been built, and RMA consists of basically three steps as we can see here. Background correction, quanti normalization, and then it takes all the different probes that makes up the same gene, and output a single number for that. So, let's look at normData. So, now we're back in something we know and love. It's the ExpressionSet, and we see we have gone from 1 million features to 33,000 features. That's a quite massive reduction, and this is happening because all the probes that mention the same thing has been summarized into a single number at the gene level. We have a little look at the featureNames, Of the expression set So, these are isometrics identifiers. Isometrics identifiers used to have something with underscores in, but for the new arrays it's basically a single number. Don't think of this as integers, these are like numbers that tells you something about what was missing in this particular probe set. So, in order to annotate it, we have to go from these numbers into gene names as we have learned to do in other sessions. Let's look at the normalized data. We can do a box plot again, and now everything looks a lot nicer right. We have the same spread, the same mean, roughly the same spread, roughly the same mean, and it's ready for analysis. There's a couple of notes here. The first note is that these box plots, or these distribution, are exactly the same and that's because we run quantum normalization which is a method for normalizing data. On the, at the probe level and then later on we summarize from the probe level into the gene level. If we had one quantum normalization on the gene level expression measurements, these distribution would have been identical. The second thing to note is that we now have a log transform values, or we had that before on the box plot but we can confirm that we do have log transformed, the log two, it's always log two, microarray measurements. And it does look like the samples Control5 through Control7 are not that different from the rest of the arrays. So now we'll run RMA, you see how easy it is, and this is like a core thing we can do with Oligo. There are similar easy ways of dealing with snip arrays.