Now that we've implements native exact matching and use it to match artificial reads against the genome, let's do the same thing with real sequencing reads. So we've already downloaded the 5X genome. The other link we've given you is to a set of actual sequencing reads from that same genome. So you can download that, and I'm going to do that with Wget. And to read the FASTQ, I'm going to use the same readFastq function that we developed in Practical 4. So I'm just going to paste that in here. And I'm going to I'm going to use that to actually read the FASTQ file. And this function actually returns qualities too, which in this case we're not interested in. So, I'm just going to use a blank to indicate that we don't want to save that to a variable. I'm going to paste in the name of this file. As this has downloaded the sequencing reads from that file and that's a set of a 1000 reads. So, now let's do the same thing we did above with artificial reads in practical six and actually match these, align these against the genome and see how many of them match. So, I'm going to keep track of how many reads have matched. So for each read r, in my list of reads. I'm going to get the set of matches from naive. And n is just a count of the total number of reads that we've processed. So now I can say, if the length of the set of matches returned is greater than zero, that means that they read aligned in at least one place, so we found a match. So I can increment numMatched. And now I'm just going to print the proportion of reads that matched correctly. So in order to run that, starting the k status running, so it's taking a little while. And only seven out of 1,000 reads match the genome. >> Okay, so that might be surprising. These are sequencing reads actually from this genome, and yet only 7 out of 1,000 of them actually match the genome. So why would that be? Well, we actually learned a couple of reasons why. A sequence obtained with a DNA sequencer might not match back to the genome exactly. One reason is because of sequencing errors, so it could be that some of the bases were read incorrectly by the sequencer. And then another reason is because there might be differences between the individual that was sequenced, in this case it's a virus, and the reference genome sequence. So in this case it's probably mostly the sequencing errors that are the reason why many of these reads are not aligning. So one of the things that we're going to try next is to just take part of the read, right? It could be that the whole read, all 100 bases, don't necessarily match. But if we just take a piece, like maybe the first 30 bases or so. That has a greater chance of matching exactly. All right, so we'll try that next. >> Okay. So, I just can make a little change to get the first 30 bases. We'll set r equal to, just the 30 base prefix of r. I'm going to run it again. And this time, we're 459 out of 1000 reads match. So, a little under one half. >> So this still isn't very close to a thousand. This is less than half of the reads aligning to our reference genome. So you might be able to think of another reason why maybe not so many of the reads are aligning. In this case its because the genome is double stranded and the reads can come from one strand or they can come from the other. But the exact matching problem we've set up is only looking on one of those two strands. It's not looking on both. So that could explain in large part why we're seeing less than half of the reads actually matching back to the genome. So the next thing to try would be not just to try to match the read to the genome, but also to try to match the reverse complement of the read to the genome. That will cover the case where the read comes from the other strand of the genome. >> Yeah, so I can do that using the reverse compliment function, which we developed in Practical 2. I'm just going to paste this in here. Now I'm going to use basically the same code from up here. But I'm going to do two steps here. So I'm setting matches to be the naive exact match between r and the genome. But I'm going to do something similar, I'm going to say matches.extend with naive of the reverse compliment of r in the genome. So what dot extend means is it just takes that list and adds on to it the results of another list. So the first line here, matches equals naive of r in the genome is going to test whether that read matches in the forward direction on the genome. In the next line, we'll add onto it the results of any matches of that read to the reverse complement of the genome. So I'm going to run this now. >> Oh [LAUGH]. So the problem here is that our. Our dictionary in the reverse complement function doesn't allow for there to potentially be an n in a sequencing read. >> Right. >> And yet, like we said before, sometimes a sequencing read has an n meaning no competence. So we just have to add one more key value pair to our dictionary there. >> So I'll just add the base n and this will just match to an n. Since in both cases we don't know what it is. >> Yeah. >> So I mean update that and run it again. >> It's taking a little longer because it has to do more matching now. >> There we go 932 out 1,000 reads have matched. >> Yeah. So still not 100% But this is to be expected, and because of sequencing errors and things like that. But once we figured out that we had missed one whole strand of the genome, and once we took a prefix of the read instead of the entire read, we were able to match nearly all the reads in our data set. But the fact that we had to take the first 30 base pairs or so is a pretty good indication that exact matching is not necessarily going to be what we want to do, right? We want to allow for sequencing errors, we want to allow for differences between individuals. So that's why later in the course we'll examine some approximate matching algorithms. >> Yeah.