In this session we will discuss the genomic features packets. The genomic features packets provides support for something called transcript D B or transcript database objects. The idea is similar to the relationship between the B S genome packets and the individual genome packages. So let's load genomic features and let's load a transcript database that is available for download for Bioconductor. The transcript D B object you can see is from homosapiens. It comes from UCSC, it's based on the hg19 genome, and it corresponds to the known gene table from UCSC. Inside this packet there's a single object with the same name of the packets, and that's a long thing to write. So we're basically going to to rename the long transcript database object to lowercase txbd. That's really just a convenience. So what is a txdb object? Let's try to print it. Basically it contains information about genes, transcripts, exons, and coding sequences. This is a complicated structure. In human, we know that exons can be part of multiple different transcripts. Different transcripts can have different coding sequence. And one gene can often have many different transcripts. So this data packets contains all of this information and links them together. At first it's a little bit hard to use, I must say. One of the problems is that the terminology in the packets is not always crystal clear. [COUGH] Sometimes the word transcript is being used as pre-MRA. That means a full length transcribe part of the genome, but before introns have been cut out by splicing. And sometimes transcript really means the transcript after splicing has acted on, and the introns have been cut out. But let's try to examine it a little bit. And we'll do that by examining a very specific low sigh that is basically the first thing that prints out. So I'm going to pick a G ranges here. That is selects a very small region of the chromosome one. And then I've got to look at all the exons and the transcripts and the genes that are on this particular loci. So we can call genes on the txdb object. And now we get a genomic range, a g range. Giving us the start and the end of the gene. So here it's actually a little unclear what exactly we mean by start and the end of a gene. That contains multiple transcripts because each of the transcript can start in multiple places. But I think of this as outer coordinates for the gene. You can see that there's a gene ID column that is an internal ID from the database that genomic features provides. And that's also going to be a little bit confusing because it's an integer. And there's also going to be a transcript ID and an exon ID. That's also integers. So you gotta remember what the integer is. Is it a gene ID, is it an exon ID? Is it a transcript ID? Okay, let's look at which genes actually overlap this little genomic range I have. We have a single gene on the positive strand. And it actually turns out that if we do the same thing and we ignore strand There's a different gene that overlaps the same loci, but on the reverse strand. We're going to ignore that for now. Perhaps I was a little bit fast here. The gene ID we have here is an integer, but it's not just a continuous naming of the different things. That's what it looks like a little bit from the print out. I misremembered here. The gene ID here is really something known as an entrez ID. You can see up here in the output from the TxDb object that it says type of gene ID and it says entrez gene ID. So when you create a transcript database packets or object, you select some IDs to put in it. And in this case here they're entrez gene ID. I've actually double checked this by putting it into Google and searching for this entrez gene ID, and it has specifically list location here. Okay, so we have a single gene here on the forward strand. Let's call transcripts on the object. And let's look at which transcripts overlap this particular genomic range. So now we see we have three transcripts. We see that in this return object here from transcript we have no splicing information. We just have for each transcript has a separate name. This looks like a UCSC name. I recognize that it's not clear from the output here. And they teach three different transcripts, but the transcripts have the same start and the end. Because all we see here is the start and the end of the pre-MRA. These three different transcripts are different because they're different exons, but you can't see that in the output. Despite the fact that we called transcripts on TxDb. This becomes a little bit easier to see, perhaps, if we look at what exons do we have in here. So we have six exons. And these six exons are combined in different ways to form these three different transcripts. So do I how figure out how the exons are combined together to form transcripts? Well in the same way as this. Exons and transcripts and genes. There's a set of commands on a transcript db called exonsBy and transcriptsBy. And the command we're after took me awhile to fully realize this. Is subset we want to take subsetByOverlaps and we want to say exonsBy. We take it from the txdb object. And the by should be by transcript. Not transcript because that would be too easy. We are just going to call it tx. So now once we get the output. Oh, we didn't have a second [INAUDIBLE]. Once we get the output now we can see we're a little bit of play. We can see we have three elements corresponding to the three different transcripts. And the three different transcripts each have three different exons. They all share the first exon, but they're spliced together differently. The second exon in each transcript is different between the three transcripts. We can see that the last exon, there's both an exon five and an exon six. Exon five and exon six overlaps. They have the same end coordinate, but they differ in their three prime splice site. The first exon is shared between all three different transcripts. You can see that the names of this GRangesList are $1, $2, and $3. And that turns out to be the [COUGH], the transcript ID. So if you scroll up a little bit, and we look at the transcript. Here we have the three transcript. And you can see there's something called tx_id. There's an integer 1, 2, and 3. And this is what I was referring to before when I said you have these integer names, and it's hard to remember whether or not it's a transcript name, an exon name, or G name. Okay, so this is really in some sense, this is a full gene structure for this particular gene. We have the three different transcript. We have the exons. We have how they're being put together. In the same way as we have exons, transcripts, and genes, we also have CVSs, or coding sequences. Now coding sequences are also hard to deal with. It turns out that if you do something like RNA sequence. Or if you do old style of EST sequencing, you can get the RNA structure of the different genes, or you can figure out where the introns are. Knowing what the coding sequences of a gene is we know. So first of all not all genes have a coding sequence. Or not all transcripts have a coding sequence. Having a coding sequence implies by the name that it gets translated into to protein and that's not the case for all transcripts. Furthermore, a given transcript may have multiple open reading frames. And in some databases and some organisms what happens is you take a transcript, you look for all open reading frames inside the transcript, and you just pick the longest open reading frame as a potential transcript. As the potential open reading frame or coding sequence for that particular gene. That turns out to not always give you the correct result. You can hear I'm passionate about this. I worked on this back in 2008 and 2009 in Josafa. [COUGH] So lets look at what happens if you say subsetByOverlaps, cds, txdp. So now we get something really weird out as well. Look we have three different small structures. These are not necessarily open reading frames because if it was an open reading frame we should have some splicing information here. These are actually turns out, I looked a little bit at it. These are the coding sequence in the different splice transcript intersected with the exons. This is a weird thing. I find it weird that it's called cds in the packets. But we are really interested in is cdsBy. And then by equal to tx. As before very similar to. Now we get the actual coding sequence for the three different transcripts. It turns out that in this version of the database, even though there are three transcripts, only one of the three transcripts has a coding sequence. And that is transcript number two. The other two transcripts have no coding sequence annotated. I went to UCSC recently and figured out that in a newer version of UCSC of the database all three transcripts have a cds. So if we recreated this database right now, I would get a GRange's list out with length three instead of length one. So this is really the exons of the second transcript intersected with the coding sequence of that particular transcript. For comparison, lets do the exonsBy. And let's select a transcript number 2. Note that I used for for this. Yeah. So you can see that. The middle exon is fully part of the coding sequence. In the first exon, the coding sequence starts a little bit into the exon. And in the last exon, the coding sequence ends a little bit before the end of the exon. That's very common, that shows us what the three prime and five prime untranslated region of the gene is. So I hope that this here shows that there's a lot of really useful information in here. We can manipulate it, but that the naming commences at least to me, seems to be unsatisfactory with respect to what is cds, what is a transcript. And so on and so forth. We can see all of this in a different way. We can call a function called transcript length. That just gives us a length of a different transcript. And now these are transcript lengths not in terms of their pre-MRA, but in terms of the spliced RNA. And what helps us a little bit is that there's an argument to this function called with.cds_len=TRUE. And then actually we don't want this subsetByOverlaps. We just want this subset here where the, how did I do it? We say that the gene_id is equal to this entrez ID we talked about up in the beginning when I talked about gene IDs. Okay I seem to be missing a transcriptLengths. Yeah, I need a closing parenthesis. So here we see the three different transcripts. We see that each of them have three axons. We see they're different lengths because they're spliced differently. And we see that two of the transcripts has a cds length of zero, which basically says that there's no coding sequence inside those two transcripts. And the last one that have a cds length of 402. I think that those 402 bases are, Let's see here. Let's say. Two and let's say. Sum width. When I'm checking I get 402 basis nucleotides out when I look at the coding sequence. So this gives you some idea of the structure. Underneath it all, all of this is to order something called a SQLight database. And there's a set of functions for querying this database directly using SQL commands. I don't really see the big point of this. I think these transcript, exon, cds, gene, and the same thing with by. Like exonBy and transcriptBy should more than satisfy most uses here. You just gotta remember which one of the commands gives you what. Finally I will say that unlike genome packages, there actually is rather few transcript databases available from the Bioconductor website. I asked the core Bioconductor developers about this. And the answer was that there's so many different transcript annotation out there that they really prefer that users make their own txdb objects based on getting information from either BioMart or UCSC. There's a couple of convenience functions for doing this in genomic features, and it's described in the vignette. I think that it's something like makeTxDbFromBiomart or makeTxDbFromUCSC. And it basically just tells what table, what genome, and it goes online and downloads it and constructs it. Then you get a snapshot that's from a given point in time, that's useful for your own analysis. So don't necessarily think that all useful TxDb objects are hosted on the Bioconductor website, because there's only one for human.