In this practical, we're going to implement Boyer-Moore. So, to begin with, I've pasted a lot of code in this notebook, most of it you don't have to worry about. It's doing the pre-processing for computing the good suffix and bad character rule, which was talked about in lecture. What I'm going to point out is we have at the very bottom a class called Boyer-Moore. And this has functions we can call for the bad_character_rule, the good_suffix_rule, and the match_skip. Match_skip is from the case where a pattern matches the text exactly and it essentially just computes the good_suffix_rule in that case. So, let's just test that these bad_character, good_suffix, and match_skip rules work correctly. So, what I'm going to do is I'm going to pretend we have a text that might look something like this, and then a pattern like this. And what I'm going to do, I'm going to define that pattern. And before I call the bad character rule, I need to create a Boyer-Moore object. So I just do that by putting the name of the class, and then I have to pass in the pattern and this will do the pre-processing for that pattern. Now that I've done that, I can call the function bad character rule. And I have to pass in the offset rhythm mismatch occurred and the mismatch in T. So, in this case where PSTCAA and the text is this larger string, you see that look counting backwards, our first mismatch is going to occur here at index 2 in the pattern and the mismatching character is a T. So I run that and I get a skip of 2. So this makes sense, because in this case, we're going to have to move our p along until we find a T that lines up with the mismatch of T in our text. So once we move it two indices along, that first index of p will match up with the text, so we have a skip of 2. >> Mm-hm. That's the amount we're going to shift the pattern along as a result of the bad_character_rule. >> Yep. So now let's do something similar, but for the good_suffix rule. So I'm going to keep the same t, but I'm going to change our p here to be this. So the last three characters of P, in this case, will match against our text. So I have to create my Boyer-Moore object and then I can call good_suffix_rule. And in this case, I only had to pass in the index where the mismatch occurs, which in this case is 0, because comparing backwards from the end, the last base matches. So do the two after that, and then the base index 0 in p is the first one we see that mismatch is with T. So I do that, I get a skip of 3. So in this case, we have to move our pattern along until a prefix of p matches a suffix of this part of t. And in this case you can see that it's going to happen when we moved p along three bases, and the first A and P is going to match against that A and T right there. And finally, let's do the same thing with the match skip. So in this case, I'm going to let p be, oh, let me change my text a little bit. >> So this will be an instance where the pattern matches entirely, so all the character comparisons will be matches. >> Yep. So once again I just call p_bm.match_skip and in this case, I don't have to pass anything in. And the upward of that is two. So you can see, we've basically just done the good_suffix_rule again, in this case, because once we move this pattern along two bases, now that first AC is going to match against the second AC in our text. And so, we have to check again at that point. >> Mm-hm. This is just a special case of the good_suffix_rule. >> Yeah. Okay, so now that we've gone over those preprocessing functions that are going to help us out, let's actually implement Boyer-Moore. So I'm going to create a new function, and as arguments this function, we want our pattern p, we want our Boyer-Moore object, which is the preprocessing for p, and then our text t that we're going to search in. And I'm going to start by creating an index i. I'm just going to keep track of where we are in the text. I'm going to start create a list for occurrences, this is just a indices where p matches to t. And I'm going to loop through while i is less than length of t minus length of p plus one. So this is going to loop through all the positions in t where p could start without running past the end of t. Now, the first thing I'm going to do is I'm going to create a variable called shift. This is going to indicate the amount that we're going to move along after this comparison. So, if skip basis, shift will be more than one. If we were doing naive exact matching, that would be a case where shift was always one, but because we can use the good_suffix_rule and the bad_character_rule, a lot of times we can shift by more than one, which will save us comparisons. And then I'm going to create a variable of mismatched which we'll update as we go if we find a mismatch. So, now I need to loop through the pattern p from the end to the beginning. So to do that I'm going to say four j in the range and I'll let the p minus one to negative one. And then the third argument, negative one, means we're going backwards. And we're going to stop just before negative one, so we'll stop at zero. Now I'll say if we have a mismatch. So if the character in pattern p at j is not equal to character and pattern t at i plus j, then we have a mismatch. So, the first thing to do is to calculate our bad character rule and our good suffix rule, to see how many bases we can skip. >> So, I'm going to use my Boyer-Moore object for this, just like we did above. Bad_character_rule, and I have to pass in the index, which is j, and the incorrect character in t, which is going to be t index i plus j. So that's our shift using the bad_character_rule and we're going to do a similar thing with the good_suffix_rule. And in this case we just pass in the index of the mismatch which is j. So now we have three possible amounts we could shift. We could shift by just one, we could shift by the amount that the bad_character_rule allows us to, or we could shift by the amount that the good_suffix_rule allows us to. And which ever is the largest will save us the most time and the most comparisons, so I'm going to set shift equal to the max of shift, skip_bc, and skip_gs. And then I'll just set mismatched equal to true, and if we found a mismatch there's no sense in comparing the rest of the string, so I can just go ahead and break. Now, once we finish that comparison, we can see if we matched or not. So, if there wasn't a mismatch, which means that we match the text exactly, then, we want to add our occurrence spot onto our list of occurrences. And we can still skip some using the match skip function. So I'm going to do skip_gs, p_bm.match_skip, and then similar to what I did above I'm just going to find the max of our current shift which is one and the amount we can skip with the good_suffix_rule. And then after that's done we're going to update our position by shift. And so once we finish going through this, we just return to our list of occurrences. And this is our Boyer-Moore function. Okay, so now let's test this on a simple text and pattern. So. Do something like that. I'll choose a pattern and then I'll run op three, my Boyer-Moore object, this is going to be the pre-processing. Now that I've done that, I can actually run my function Boyer-Moore and I pass in my pattern, my pre-processed object, and my text. And the results are, we have two offsets. We'll just double check this, I'm going to look at in text t from seven, and that is the correct pattern, and we can do something similar at 14. There you go. So, both cases I've found my correct match.