In this practical we'll be going over how to download and parse our reference genome. So we've given you a link to a fast day file which contains the lambda phage virus genome. If you download this into your Python directory and open it up, you should see a file that looks something like this. You'll notice, the first line starts with a forward arrow. And then just has a lot of identifying information, including the name of the organism, which is the enterobacteria phage lambda. And then, following that is just a lot of bases. There are strings of like 70 bases and just a whole file full of them. And this is the entire genome for this organism. So now I am going to download this data set in iPython and process it. So I'm going to use wget to do this. If you have the file downloaded, you can also just open it locally. But this will get it from the URL. So I'm going to do !wget --no-check, and then I'm going to paste in the URL to the data set. And when I run this, I should see an output. Something like this. And then it says the file's saved to lambda_virus.fa. >> So, that exclamation point you used for your command there, that's a special iPython trick, right? That helps, that basically allows you to run something on the command line, but inside an iPython document. >> Right. If you download the lambda virus FASTA file and just put it in the directory you're working with in Python, you can just skip this step. So, now that we've downloaded it, let's write a function to read a genome from a FASTA file. So, I'm going to define a new function readGenome and pass in a filename. And I'm going to sort genome as a string, so I'll start by just an empty string. Then I'm just going to add each line of bases to the string. So to open the file I run with open(filename 'r'). This indicates that we're opening the file for reading. If we were writing a file we would put w there. And then I have to give it a name so I'm going to call the file handle f. And now, an easy way to loop through the lines of this file is just to say for line in f. And now, it's going to read the file line by line and store each line of the file in the variable line. So, the first thing I need to do is, if it's the first line of this file, which starts with an arrow, we want to ignore that. That just has information that we don't really care about. So I'm going to say, if the first character in the line is not the arrow, than it must be a line containing bases. So in this case I'm going to add it onto our genome string. So I'm going to say, genome += line.rstrip. And what rstrip does is it removes any trailing white space from the ends of the string. So in this case it will trim off the new line at the end. If there are spaces, tabs, any other white space, it will also trim those off as well. And then when all this is done I'm just going to return genome. So let's run this. Let's read the genome. So, I'm going to say genome = readGenome. And I'm going to pass in the name of this file, which, if you use wget, is written right there. So lambda virus.fa. And to make sure that we read it correctly, let's print out the first 100 bases of the genome. So you should see an output like this. Let's also make sure that we've read all of it. Let's look at the length of this genome. So, the length of the lamba phage genome is about 48,500 bases long. Okay. Cool. Now that we have this stored, let's do some analysis of it. Let's count the frequency of each base. So I'm going to do this using a dictionary. I'm going to have a key for each base and a value for the frequency, the number of times that base occurs. So initialize each frequency to zero. Just like that. Now I'm going to loop through each base in the genome. So I'm going to say for base in genome and I'm going to increment the value at that key in our counts dictionary. So I'm going to say counts(base) += 1, just like that. And then when that's done, I'll just print out our counts dictionary. And so here we go. You can see for each base, we have a number indicating how many times that base appeared. You can see that G and A appear slightly more frequently than T and C, but there's really not a large difference. There's a pretty even split between them. And if you add up these numbers they should add up to 48,502, which is the length of the genome. We can also do this using a Python module, which would make our life easier. So let's do that. Import collections. So, we're going to use the collections module. And I'm going to run, collections.counter and I'm going to pass into counter, a string. In this case, I'm just going to pass in the genome string. And counter will do exactly what we just did. It will loop through every character in the string and count the occurrence of each character. So if we do this, the output is a counter object, but inside that object, you should see something that looks exactly like our dictionary, with each base along with its frequency.