In this practical, we'll be describing how to download and work with a set of sequencing reads. So, we've given you a URL to a data set. If you download that file and open it, you should see something like this. This is a fast q file, which is a common format used for storing reads from a sequencing experiment. Now if you'll look closely at these lines, you'll see that they come in sets of four. The first line in each set of four starts with an @ symbol, and then just has a name. This is just a tag to indicate which read this is. The following line is a string of DNA bases. The next line is just a plus sign, and then the third line is just a bunch of symbols. This is the quality sequence from that read. So if you remember from lecture, each quality score is Phred 33 encoded. This converts into a ASCII character. And that's what's printed in the fourth line here. >> So this file is sequencing reads from a real human being. So these are actual human sequencing reads we're looking at. >> Okay, now let's download this dataset in Python. So, once again I'm going to use wget. I'm going to paste in the URL, now our data set has been downloaded. Let's write a function to read this *.fastq file and parse it. So I'm going to just copy over a set of lines here just to refer to when writing this function. So I'm going to create a new, new function, reed fast Q, pass in a file name, and we want to store a list of sequences from this file, and we want to store a list of quality string associated with each sequence. So I'm going to say with, I'm going to open the file. I give it a file handle. And then, I'm just going to put a while loop, I'm going to say while True, so we'll loop forever, but we'll break when we reach the end of the file. So the first thing I need to do is, so each time I go through this loop, I'm going to need four lines. So the first line, if you look back at this block here. the first line is just the tag for the read. We don't care what's in this line, so we're just going to read it, but we're not going to store that to anything. The second line is the sequence, so I'm going to say, Save this to the variable seek. The third line is just a plus sign. We don't care about that. So once again, we're not going to store that. And then, the last line is the string of quality scores, so we're going to save that to qua. And you want to strip white space again here, right? So the seek and qual lines are going to have new lines at the end of them, so we want to get rid of those. >> Right, yeah. So I need to add an rstrip to the end of each line. And now that I've read these lines, I have to check if I reached the end of the file. So if I reach the end of the file at a previous state, all of these strings will just be initialized to an empty string. So I want to check. I'm going to make sure that the length of the sequence is greater than zero. So if that sequence has a length zero, then we've reached the end of the file and there's nothing left. So in this case, I'm going to break out of my while loop. If that's not the case, if we're still in the file, still reading lines, then I want to append these values to my list. So I'm going to do sequences.append(seq) the sequence that I read, and I'm going to do a similar thing for the string of qualities course that I read. And then when I'm done, I'm just going to return the sequences and quality strings. And if you're new to Python, you can return multiple values from one function just like this, by separating them with commas. Okay, now we've written this function, let's run it on the data set we downloaded. Worded. So I'm going to save these to variables for sequences and qualities. going to run readFastq. And I need to paste in there, the name of the file that we downloaded. So now you have two parallel lists, one that has all the sequences in it, and one that has all the quality strings in it. >> Right. So let's make sure that these are filled out correctly. Let me print the first five sequences. There you go. And then we'll do something similar. Let's just print out the first five quality strings. And there are the strings of quality scores associated with each read. Now that we've downloaded and parsed this data set, let's do some analysis of it. Let's create a histogram of the different quality scores, to see what quality scores are most common and which are least common. So before I do this, I want to write a quick helper function to convert the ASCIIi symbols to the quality score. So this will convert phred 33 encoded value to just a quality score. And we'll pass in a quality value. The function word just takes a character and converts it to it's asking value in integer number. So I'm going to convert that ASCII value to a character, or to a number, and then subtract 33, and then return that. So I see a bunch of hash symbols in those quality strings. What happens when you run that function on a hash symbol? >> Let's try that. So I'm going to do this, put in a hash symbol. And that's a quality score of two. >> That's low. >> Right. A low quality score, remember, means that we have a very low confidence in our value. It's not that likely to be correct. In this case, I think a quality score of two, corresponds to about a 30, maybe a 32% chance that we're incorrect. So, you notice, most of these strings of hashtags occur near the end of the reads. And as was discussed in lecture, as you step down the reading and get towards the end, your confidence gets lower. >> So try now, I see also some capital Js in there. Just try a capital J. >> Okay. Oh. >> Oopsy. >> So J is a score of 41, so that's very high. That's a very high confidence in our sent value. >> Yeah, that's, less than one is. In what, 10,000 chance that the base call was incorrect. >> Yeah. Okay. Now let's write our function to create a histogram of the quality scores. So I'm going to write a function, createHist. We'll pass in a list of quality strings. I'm going to create a histogram. This will just keep track of the frequency of each quality score that we've seen. I think the highest quality score you can see is actually 41, which we saw. Just to be safe, I'm going to go up to 50, so that should give us some extra padding on the end. And I'm going to say for each string of quality values, qual, in our list qualities. And then, in each string I want to step through and look at each quality score. So for each phred quality score in that string. So, firstly, I need to convert that to a number. So I'll say q equals. And I'll call the function I wrote above, phred33ToQ. So now I have the number associated with, or the actuality quality score for that position, so I'm going to increment that value in the histogram. And then when I'm done, I'll just return the histogram. So let's run this on our quality scores that we downloaded. And print out the histogram. So you can see the lowest quality score we have is a value of two. That's actually the lowest that can be returned. And the highest is, this should correspond to 41, I think for this data set. But it's really hard to visualize this, when it's just printed out in text format. So, let's graph it. We're going to do this using Nat Plot Lib. >> So that's just a special command that you can use in I Python, that tells Nat Plot Lib to actually put the plots right there in the same document with the text that you're typing. >> Right ,and this line we'll just import >> The module that we need. We'll call it matpotlib.pyplot. And I'm going to plot this as a bar chart. And I need to pass in the X values and the Y values. So the Y values are going to be the values of our histogram. The X values are just going to be the numbers from zero up to the length of the histogram. >> So I'm going to use range up to the length of h, and then the y values will just be h. And then to actually display it, I need to call plt.show. And here's the result. So you can see we have kind of a big spike of low quality value, then it drops down really low. And then it increases gradually and we have a lot of high quality values. So Ben, why is there such a big spike at two, but doesn't seem like there's any quality scores at three? >> Yeah, well I'm guessing that what's happening at a quality value of two, is that these are clusters where the base caller was. >> Not confident at all at what the base ought to be there. For example, if it was an equal mix of colors coming from that cluster, it will have very low confidence. And so that probability that the call is incorrect, it's going to be high. It's going to be like a half ,or something like that, and that corresponds to a very low Q value. So that corresponds to maybe two or so. Because we see that fairly often toward the ends of the reads, for reasons that we discussed in the previous optional lecture, we have a lot of occurrences with a quality value of two. >> Cool.