We discussed how to build and query a k-mer index, an index that's built by taking all the k-mers of the text T and adding them to a data structure that maps each k-mer to a list of all the offsets where it occurred in the text. So this kind of data structure is called a multimap. It's a map because it associates keys, k-mers, in this case with values, offsets in the genome. And it's a multimap because a k-mer may be associated with many different offsets in the genome, right? A k-mer could occur many places within the genome, within the text. So what kind of data structures can we use to implement a multimap? So we'll discuss two of them here. The first one is based on ordering, like the index of a book, and the second one is based on grouping, like the aisles of a grocery store. So let's talk about the first data structure, which is based on ordering. So here at the top of this slide, I have a key-value pair. The key is a 3-mer from the text T. It's the very first 3-mer. And the value is the offset where that 3-mer occurs. So we make a key-value pair for every 3-mer at every offset within T. Here we go. There they all are. And then finally, we order them. We put them in order alphabetically by 3-mer like this. So now, this is our index data structure. It's simply a list of 3-mer offset pairs ordered alphabetically by 3-mer. So now how do we query this index? Say we have a pattern P from which we extract a 3-mer, and we'd like to query the index with this 3-mer. And the way we're going to do this is with something called binary search. So, what is binary search? So, going back to our index analogy for a moment, let's say we're looking up a term in this index. We're looking up the term memory. And, so we flip to the exact middle of the index and we look in the middle. And we find the key term that's there, and let's say the term directly in the middle of the index is light. So light comes alphabetically before memory, so we know that we can completely ignore the first half of the index, up to and including the term light. So memory must be in the second half, so then what do we do next? Well we do the same thing but just for the second half of the index. So in this way we can keep going iteratively throwing away half the index each time for each iteration. And eventually we can hone in on exactly the term that we are looking for. So this is called binary search. So let's see an example using our index. Let's say this is our pattern p from which we extract the 3-mer TGG. And first we take TGG and compare it to the 3-mer that's in the middle of the index. The one that's alphabetically in the middle which in this case is GTG. Our query is alphabetically greater than that. TGG comes after GTG. So we can ignore in the index, every entry up to and including GTG, and we've effectively divided the problem in half. And each time we divide the problem in half in this way, we call this a bisection. Session. So what do we do next? We do the same thing again, but on the remaining entries of the index. So we can compare TGG with the middle k-mer here which is TGC. And TGG comes after so we can bisect again. We can throw out the first half. And finally, we can compare TGG to the middle element that we have here, but there's only two elements left. So let's just say we compare TGG to the first one and we find that that's a match. So that match corresponds to an index hit. We wanted to find offsets where TGG occurs. We've now found the first one in the index, and it occurs at offset 7. So, the total number of bisections that we need to perform in order to find our key in the index is approximately equal to the logarithm base 2 of the number of keys in the index. Why the logarithm base two? Well, because for each of our bisections, we're repeatedly dividing the problem in two. So we'll implement this idea and some related ideas in Python later. But before we move on, let me make a final point, which is that Python actually provides a set of functions related to binary search that are useful for us here. So these functions are all in a Python module that's called bisect. And a Python module if you haven't encountered it before is just a collection of related Python functions and classes. And this module is about binary search. So one function in this module is called bisect_left and this function takes two parameters. The first parameter a is a list that's already sorted. It's already ordered in ascending order. So if this is a list of strings then the strings should already be in alphabetical order. Or if this is a list of integers then they should be in ascending order. And the second parameter x is an item and the function returns the offset where the item x can be inserted into the list a while maintaining the order of the list. So it's the leftmost position where x can be inserted in a such that a is still in order after that insertion occurs. So, bisect left is a useful function to us. In this example we call bisect left with the parameter a, which is this list up here. And then the argument 2 and so bisect left is going to return the left most index where we can insert 2 into this list such that the list is still in order. And that's offset one, right here, so if we wanted to insert 2 into this list and have this list still be in order we would put it right here between this 1 and the first 3. We would stick it right here and then we would shift all these entries over by one. Here we're calling bisect_left with the parameters a and 4. Bisect_left in this case returns the offset 3. That means if we wanted to stick 4 in this list, we would do it between here and here. So we'd do it between the 3 and the 6. And then in this final example we're calling bisect_left with a and the parameter 8. And it's telling us that if we want to insert a, 8 into this list, then the leftmost place we can insert it such that the list is still in sorted order is at offset four. Now we could have also inserted it here at offset five or here at offset six. And the list would still be in sorted order, but bisect_left is always going to return the leftmost offset where we can insert it so that the list is still in sorted order. So this bisect left function is exactly what we need in order to do queries of this index data structure that we built. So for example if we're querying with this pattern here and we take some 3-mer from the pattern, let's say we take the threemer GTG. We can use the bisect left function on our index and the parameter GTG. And it will return the offset of the first position where we could insert GTG while maintaining sorted order of the list. And that corresponds to this entry right here, which is the first entry that has GTG as its key. So if we wanted to find all the places where GTG occurs, we could use bisect_left. It would point as here, and then we could look up this offset and then, keep going. Find another GTG and find its offset. And then, keep going and find another GTG and find its offset. So we would report that GTG occurs at offsets zero, four and six.