Up until now we've discussed a few different ways of solving the exact matching problem. Naive exact matching, Boyer-Moore, and then index assisted offline methods. But exact matching isn't quite what we want. The read will not necessarily match the genome exactly, and it's point of origin. There are a couple of reasons why we expect there to be differences between the read and the reference genome. So, one of the reason we might expect differences between the read and the reference is because of sequencing errors, like we discussed before. Sometimes the sequencer will make mistakes. It will miscall a base in the sequencing read. And when that happens, that base might no longer match the reference genome. Another reason we might see differences is because the reference genome is not the same as the genome that we're studying. Before we said that if you have two unrelated humans, then their genome sequence are something like 99.8 or 99.9% similar. But not identical, which means that every once in a while, there will be a difference between a base in the sequencing read, and a base in the reference genome. So that's another reason why we expect there to be differences. So algorithms for exact matching are not going to be sufficient. We need algorithms that can do approximate matching. Allowing for differences between the pattern and the text. So, what kinds of differences might we encounter? Well, here's an example. Here's an alignment of P to T. In this alignment we see one difference. This particular kind of difference is called a mismatch or a substitution. Here is another kind of difference. This time the difference is that there's an extra character in P relative to T. The dash that you can see at the top in red is called gap, and so this G, that's highlighted in red, is an insertion in the pattern P with respect to the text T. Equivalently, we could think of this as a deletion in a text T, with respect to the pattern P. Here's another kind of gap we can find, like this. This is just the mirror image of what we saw before. So this is an insertion in the text T with respect to the pattern P. Or the other way around, a deletion in the pattern P, with respect to the text T. We want to be able to talk about the distance between two strings. In other words, we want to be able to describe how different they are, how many differences there are. But we have to define exactly what we mean by distance. So the first kind of distance we'll define is called hamming distance. So if you have two strings, X and Y, that are of the same length, we can define the hamming distance between X and Y as the minimal number of substitutions we need to make to turn one of the strings into the other. So in this example we would need, let's see, one, two, three substitutions. To turn X into Y or vice versa. So we would say that there's a hamming distance of three between these two strings. Now here's a slightly different definition of distance. This is called edit distance, or sometimes it's called Levenshtein distance. The edit distance between two strings equals the minimal number of edits required to turn one string into the other. Where a single edit could be a substitution, or it could be an insertion or a deletion. So let me point out two important differences between edit distance and Hamming distance. So first of all, we're not limited just to substitutions in this case. So with edit distance we have the option of using insertions and deletions as well as substitutions. The second difference is that X and Y are no longer required to be the same length for edit distance. This makes sense because the only way we can make one string longer or shorter is by using insertions or deletions. Here's an example. What's the edit distance between this X and Y? Well, if we look, we can see a substitution here. And then if we look a little further along, we might notice that this string of As here has one more A than this string of As here. So that means there are two edits. One substitution, and then one insertion in X, with respect to Y. So we would day that the edit distance is 2. Here's another example. So what's the edit distance between this X and Y? This example is a little harder. It turns out that the edit distance, again, is 2. But when you looked at the original X and Y, it probably looked like the edit distance was going to be greater. We'll come back to this point in a later lecture when we talk about algorithms for finding edit distance. But for the rest of this lecture, and for the rest of the lectures that relate to approximate matching, we'll stick to hamming distance and to substitutions. We'll worry about insertions and deletions later when we talk about edit distance. So, here's our naive exact matching algorithm that we saw before. Let's say that we wanted to adapt our naive exact matching algorithm to allow mismatches. In other words, we want to look for occurrences of the pattern P within the text T that are within some hamming distance. So exact matches are within a hamming distance of zero. But we also want to allow for matches that are within some hamming difference greater than zero. So can we modify this algorithm to do that? Yes, it's actually pretty simple. The main thing that we have to do is prevent the inner loop from immediately giving up as soon as we encounter a single mismatch. Instead, we'd like the inner loop to keep track of how many mismatches have been seen so far, and then only when we exceed the maximum number of mismatches allowed, do we break out of the inner loop and give up on that alignment. So here's an implementation of this idea, and we'll see this implementation again in an upcoming practical session. So we can easily adapt the naive exact matching algorithm to the approximate matching setting. But if you think about, I think you'll agree that it's quite a bit harder to see how we can adapt, for example, Boyer-Moore, to the approximate matching setting. So what we'll study next though, is a method that does allow us to do this in general. It's a method that surprisingly will allow us to use any of the exact matching methods that we've seen so far as a tool to solve the approximate matching problem.