In this practical we'll be implementing a substring index for a text t which we can use to match a pattern against it. So in order to do this I'm going to import a Python module called bisect which allows us to do binary search on the list. And what I'm going to do is I'm going to create an index object. And this object is going to have two methods. The first is going to be just the initialization method, which preprocesses the string. And second is going to be a query function. So first I'm going to create this function init. I'm going to pass in a text t and a kmer length k. I'm going to start by defining a few class variables. So I'm going to set k to equal to the k that's passed in. And I'm going to create a list for the index which is going to start off as an empty list. And now what I'm going to do is I'm just going to loop through my text t and I'm going to take every kmer of length k in that text, and I'm going to add them to the index along with their offset. So I'm going to say, for i in range(len(t))- k +1. So this will give us every index such that the kmer doesn't run past the end of the text. And then I'm going to append to index the kmer starting at i of length k. And I'm also going to add on the offset i. And I've put these in parenthesis, this is a tuple in Python, which is just two associated values. And then when all this is done, I want to sort the index so that it's easy to look up. >> So in lecture the way we described this was we were keeping the list in order as we were adding items. This time we're just going to add them and then we're going to put them in order at the very end. So this sorting step might take a long time, if we have a very long text t, we might have a lot of these pairs that we have to sort. But this is time well spent because we're going to be able to reuse this same processed text over and over again for each query. >> Yeah. So once we've created this indexed object And all the preprocessing has been done, we're going to create a function called a query, which we can call. And this function, takes an argument p, which is the pattern that we're going to match against the text. So the first step is to just find the first k basis of p to look them up in the tables. So, I'm going to say kmer equals p up to length k, and we need to do self.k. And now I want to find the first position in the list where this kmer occurs, so I'm going to use bisect to do that quickly. And say i = bisect.bisect_left. And I pass in the list that we're searching for, and then the object that we're searching for. And remember this is a list of tuples so I'm going to pass in the kmer, and then I'm going to put -1. And this sends all the indices in the list are greater than negative one. This one assure you that we always get the first occurrence of that. So now I'm going to just create an empty list for hits which is where this kmer occurs. And I'm going to say, I'm going to start from i in our index list and I'm going to just keep looking in that list for all of the entries that have the kmer we're looking for. So while i < len(self.index): I'm going to say if that location index is not equal to our kmer Then we're done. We've passed that segment of our list. Remember that our list is sorted so all of the kmers of the same type will be together. So once we find the one that's not equal to that, we can just break. If it is equal to that, then we want to append that index in t to our list. So in this case I am going to get the second value of that tuple and append that to our list of hits. And then I'll increment i. And when that's all done I just return hits. And so hits will be a list of all the indices in t where p matches or where the first k faces a p match. So now lets use this object to actually match a full pattern against the string t. So I'm going to create another function, queryIndex which is going to take a pattern p, a text t to match against, and an index which has been created from t. The first thing I'm going to do is just get the length of k from that index. I'm going to keep a list of offsets where it matches. I'm going to say for i in index.query(p). Remember, the query function returns a list of possible places where p could start. It returns a list where the first k bases of p matches the first k bases of t, but we still have to check that the rest of p matches t in that location. So I'm going to say if p(l:), this will take the rest of the bases after the first kmer, matches t(i+k:i+len(p)). Then in this case, our p matches t exactly at that point. So we can just append i to our list of offsets. >> So this is what we call verification in the lectures. We're just verifying that this index is actually is at a place where p matches in its entirety. >> And so once we've done this all we have to do is return our list of offsets and we're done. So let's create a simple text and pattern to match against. I have this one so p actually matches against two places in t. So the first step is I have to create my index. And I have to pass in a kmer length. I'm going to make this less, it has to be less than p, so I'm going to do a kmer length of two. And then I'll just run my function queryIndex, and pass in p, t, and my index. And the operate is two indeces, 7 and 14. We can just verify by printing the substream t that there is a correct match at both of these places. >> Yep.