We're looking for a method that allows us to take the exact matching algorithms that we've studied already, but apply them to approximate matching problems. So let's say we start with our pattern P, and the first thing we do is divide it up into two pieces, two halves, which are labelled u and v here. So u and v are two non-overlapping substrings that cover P, basically just halves of P. And I'll refer to these sometimes as partitions, so u and v are two partitions of the string P. The idea is this, if P occurs in a text T with one difference, with one edit for example, then either u or v must appear exactly, without any edits. In other words, the edit can change the sequence of u with respect to t so that it doesn't match anymore. Or the edit can change the sequence of v so that it doesn't match anymore. But it can't change both, so one of them will still have to match exactly. Which means that we can start to solve this approximate matching problem by using an exact matching algorithm, any exact matching algorithm, to search for occurrences of u and v within the text T. So that's the one edit case. What about the general case where we might want to allow more than one edit? So let's say we want to allow, say k edits. We can use something like the same principle but instead of having two partitions of the pattern P, we have k + 1. So the new principle is if P occurs in T with up to k edits, then at least one of these k + 1 partitions of P must appear with zero edits, with no edits. In other words, it'll occur exactly within the string T. So, this is a general principle now. So if we want to find occurrences of P with up to say, three edits, we can set k equal to 3. We can divide P up into four partitions. And then we can use some exact matching algorithm, maybe Boyer-Moore, maybe an index-assisted algorithm, to look for the occurrences of those four different partitions. If we want k to be 10, then we can divide P up into 11 partitions, etc. So why does this principle hold? Well basically it's because if you have k edits and you go about distributing them across these k + 1 partitions, every time you put down an edit, you're changing the sequence of the partition so that it doesn't match anymore. But if you only have k edits to use, and there are k + 1 partitions, then you can't change all k + 1 of them, you can only change up to k of them. So here's a specific set of examples where I'm using these blue boxes to represent the partitions and these red Xs to represent the edits. So in the very first case at the top, I have five partitions and I'm putting down four edits. And the way I've put them down, three of the edits occur in one partition and then one of the edits occurs in another partition, but then three partitions are unaffected. There aren't any edits in those partitions. In the second example, I've put down four edits again, but now they're spread across three different partitions. So now two partitions don't have any edits in them. The third case on the bottom is, in some sense, the worst case. It's the case where the edits are distributed as evenly as possible across the buckets. So in this case the four edits are each in a different partition. But even so, because we have five partitions but only four edits, there's still one partition, the second from the right, that's not affected by any of the edits. So the principle that we're using here is a little bit like the pigeonhole principle, which you may have heard of before. So, the principle goes something like this, if you have ten pigeons and you're going to put them into nine pigeonholes then at least one pigeonhole is going to have more than one pigeon in it. So, in this example the upper left hand pigeonhole has two pigeons in it. Actually, our situation is a little different. It's as though we have eight pigeons and nine pigeonholes. In which case, we know that when we put the eight pigeons into the nine holes, at least one of them's going to be empty. In other words, at least one partition is not going to have any edits in it. So, here's an illustration that shows the whole process. Let's say we've divided P into five partitions, perhaps because we're looking occurrences of P that have up to four edits. And next thing that we'll do is we'll use an exact matching algorithm, any exact matching algorithm, to find exact occurrences of each of these five partitions within T. So let's say the only match that we find is a match of partition P4, right here. We find partition P4 occurring within T. So that's our hint that there might be an approximate match of P to T in the neighborhood of this partition match here. So the next thing we have to do, is we have to have a verification step. Just like we talked about before when we were using the index, and we were querying the index with a k-mer extracted from the query. We had to verify that wherever the index told us was an index hit that the entire pattern P occurred in the vicinity of that hit. Just like that here. When we have a partition that matches exactly within T we have to use a verification step to determine whether the entire string P occurs within the neighborhood of that partition hit. So that's what we're doing here. So that's how we would apply this principle. And, just to summarize, this principle, which we'll call the pigeonhole principle since it resembles that example with the pigeons and the holes, this is the bridge that allows us to use exact matching algorithms. All the ones we've learned so far, you know, the naive exact matching, Boyer-Moore, index assisted algorithms. We can use those algorithms to find approximate matches by dividing the pattern P up into partitions, looking for each of those partitions using an exact matching algorithm. And then, wherever we find an occurrence of one of those partitions, we can do a verification step where we look in the neighborhood to see if we see a full occurrence of P with up to the maximum number of differences allowed.