In this session we'll discuss the Rsam tools package. The Rsam tools package is a library for interface which is something called a SAM tools library. This is laid up and replaced by something called HGS or high sequencing library. This is a set of collections for dealing with files in SAM and BAM format. SAM is a text file format for representing aligned reads. BAM is a binary version of SAM, and for all practical purposes, because they are fast and more convenient, everybody should exclusively work with BAM files. So that's what I'm going talk about in the rest of this session. There's not really a big difference with the same files. Let's start off by getting an example before we talk a little bit about the content of the band files. So let's load the library and get a path down to an example band file. You'll see here that BAM path is just a character vector giving us a path to a BAM file. So the first thing you do when you read in a BAM file is you instantiate a BAM file instance. That is essentially just a wrapper around a pointer. This is a very lightweight class. There is no data in this class. It just contains a point it out to the BAM file, but you can see there's interesting information such as an index, which is really the magic that made BAM files fast, and a couple of other settings. At this point, we have not retrieved any data from the bam file. That's nice in many ways because BAM files can be really big and often you want to think a little bit before recalling them all to memory. Already at this stage even though it hasn't read anything from the file, there's a couple of high level things we can query the thing about. For example, we can talk about which sequences of chromosomes where the reads align to. The BAM file class here supports seqinfo and other frames such as seq levels and seqlengths. So we can see that the reason this example BAM file that's been aligned to two sequences that are roughly 1500 spaces long. So BAM files can be really big and offer you want to read the BAM file into junk. But before we talk about how to read the BAM files into junk, we will read in the entire BAM files as a just example file, and look at little bit at the album. You read a BAM file using a function called scanBam. So this returns us something that is a list of element one. And that seems a little weird. I just asked for reading the entire file. And that's because ScanBam supports reading in different genomic regions. So for example, if you want to read some small or some bins of the different chromosomes, you can read them in that. This is going to be the outer container of this list. The reason why we just have one element in this list, and you are also going to see that it doesn't have a name, is basically that this is the entire files. So lets say get rid of the outer, lets subset it to the first element and now we can see that, now we have all the grids. We can see that it represents us a list, it has 13 different components. Some of them seems to make immediately sense like strand and perhaps pos, like for position. Let's have a look at the first element of each of these components. If we scroll up here, we can see that qname is the name for the read flag we're going to return to $rname is really the sequence it was aligned to. $strand is, tells itself. Position is the position of the left most part of the alignment. $qwidth is the length of the read. Here's a mapping [INAUDIBLE], something known as a cigar, we're going to turn to some other things about it, interesting. And then here towards the end we get the actual read and the quality values from the read. So the BAM format is quite complicated to talk about because it supports what makes it complicated is it supports and writs sensor alignments. So first of all, if each read can be aligned to multiple places in the genome, or can contain multiple alignments, or can be associated with multiple alignments, it's also possible for the band file to contain underlined reads. And finally, the reads can be spliced with a few DNA sequencing, where classic DNA sequencing we will think of them as intels, insertions and deletions, where chunks of the reed are mapped into different parts of the genome. These things here really is what makes the flexibility of this is what makes a band form a little bit unwieldy to deal with if you really want to understand the full horror of the format. In this example here, we've shown here we are going to see some reads that are aligned directly to the genome and we're going to leave it out to talk about these more complicated representations. If you are going to work with spliced alignments, you really want to look at the genomic alignments packets that we are that can change classes from representing spliced reads. Okay, back to BAM files. So let's talk about reading in BAM files in small chunks. You can do this in two ways. You can either read in only parts of the BAM file, in the sense that you only read in certain flecks of certain reads that have certain properties, or you can just read them in by saying, I'm going to read the first ten reads on the first 50 reads. So we're going to start with the last thing first. So we do that by setting a yield size. So we set a yield size on the BAM file. And remember when we print it, we can now see that we have a yield size set here. It's equal to one. This means that every time we call scanBAM on this BAM file we're going to get one read back, and for this to work we have to open the BAM file. So we open the BAMfile. And now every time I call scanBAM on the BAMfile, I'm going to get a read it. So let's illustrate this. So we are going to call scanBAM on the BAMfile and then we're going to get the first element of this list here. And we're going to look at the sequence. We're going to call it once, we get a read back, I'm going to call it again and we get a different read. And, this way I can continue until I get nothing back from the scan, from the scanBAM. Let's clean out a little bit. We're going to close the file. We're going to set the yieldsize back to NA. Another way of reading in parts of the BAMfile is reading in pre-specified regions of the genome, or pre-specified components of the file. So let's say we are interested in reading in reads in a specific genomic region that we have represented, or set up genomic regions, then we have represented as a g-matrix object. What we do is we set up our params, ScanBamParam. And ScanBamParam is like a class that encodes our query to the BAM file. This is very similar to what have have seen with, for example B is applied for applying things [INAUDIBLE] genomes and other uses. So here I'm going to start off by setting up a ScanBamParam. I give it a which argument which is what genomic region we're going to read in and then the what argument I'm just going to set equal to scanBam what. This is where we set which flat we're going to read in. So you can see here this will adjust returns of 13 names, 15 names, and it tells us which pieces of the BAMfile we want to read. This is actually useful, because sometimes you are only interesting in the position, you don't want to retrieve the actual sequence, or the actual base qualities of the read, because they're rather big tickets and they take up a lot of space. Okay, so having specified the scanBam thing, we call scanBam with the BAMfile and we set param equal to our option we have defined before. And now we see we actually get two components back on the return list corresponding to the two different ranges we were quarrying BAMfile and then in one of them. Let's say the first one, we can see here that all the positions are going to be between 115 [INAUDIBLE]. Let's look legit. I have to check the head. That [INAUDIBLE] to start out. Start from this and 100. That's because the reads are long and these persistence are the left most part of the read. So, they overlap the persistent or the G ranges we gave. Now, there's also functionality in RS2 for reading multiple BAMfiles because you almost never have only single file you want to deal with, and this is done through a class called BamViews, which sets up a views like functionality. So it's really a collection of files. It's a collection of files together with an optional collection of ranges, giving us which part of the genome we're interested in. Let's set up an example of a bamview. And this is going to be a bamView with only a single file in it. So this is going to appear a little bit weird, because it's going to be very similar to what we had before. But we have no ranges in it. We have one sample. And we can read it, we can read in We can read in the data by calling scanBam on this thing here, and now we get an outer, new get, again, a list, but the first level of the list is the file name, and then underneath it, we have the situation before where we didn't have a name, and then we come back to the actual content of the BAM file. Now we can set ranges onto a bamView. So, that's called BAM ranges and now we set it to say BAM ranges bamView, and we set that to be equal to the genomic range at default. And now, when we call scanBAM on this bamViews. We are going to get something that's very similar to what we had before. Let's see, we have the outer thing, is going to be the file name, and then when I select the file name, I get what I had before which is the two different sequences, and then the reads below that. Sometimes we just want a quick summary of what is in the file, and for that there's a command called quickBamFlagSummary that very quickly gives you some details of what the file contains in terms of how often do you have a read that maps to multiple locations or is this split. So this here is also this outflow so it takes a little time to digest, but if your working or envision yourself working on logs and BAM files, this is stuff that you are going to have to sit down and come to terms with.