0:15

But I just want to start by saying this is one of the coolest algorithms that we'll

cover in this course.

And it's not an algorithm that anyone would come up with without a lot of hard

work, but understanding this algorithm really gives somebody an appreciation for

what's possible with careful algorithmic thinking,

even for such a simple problem as this.

It's a quite ingenious method.

So here's the intuition behind the algorithm.

Say we're looking for the pattern B followed by a bunch of As in text.

So we're going along and so, well, we found a mismatch right away.

So we'll move over first position and [COUGH] we'll go BAAAA and

we try to mismatch at the end in this case.

So now what the brute force method is going to do is back up

to move over one position at B doesn't match the A.

Back up again at B doesn't match the A.

And keep going, matching the first B against all these As.

[COUGH] Before getting to the next character in the text.

1:29

So we've matched five characters and we mismatched on the sixth.

But the key point is that when we got that mismatch,

if all we have is As and Bs, we already matched BAAA four As and

we got a mismatch so it's a B.

So at that point, we know what the previous characters in the text are, and

we know that if we move over one position we're going to be trying to match

B against A.

So we should be able to take advantage of that knowledge that's the intuition

of the method of the Knuth-Morris-Pratt.

We going to need to back up the text pointer in this case.

When we get the mismatch, we can just keep going.

[COUGH] We don't even have to match the first B, we know it's there.

So we can just keep going in the text pointer and

start on the second pattern pointer.

3:12

we [COUGH] move to the state that is indicated by that character.

So if we're in state 2 and we're reading an A, we move to state 3.

That's what this line in the table says.

If we're in state 2 and we read A,

that happens to be the pattern character is A, if we see an A we move to state 3.

If we see a B or a C, we go back to state 0.

That's what this machine says.

It's a tabular representation and a graphical representation.

If we get to stage 6, then we just say accept, we found the pattern.

And you can see how that is, the pattern, if we match a pattern character we

move right through this machine until we get to the halt state.

So, let's first take a look at exactly how the machine works

to make sure everybody follows that.

And then we'll look at how do we rebuild this machine.

So, that's our machine and that's our texts off at the top.

4:35

At every step we consume a text character, just read a text character.

Now we're in state 1 and reading an A.

And in the graphical representation it says stay in state 1 or in this

representation it says if you're in state 1 and you see an A stay in state 1.

So that's what we'll do, we'll read the A and stay in state 1.

Now we're in state 1 and so you can say that in state 1,

it's reading As and we're looking for something that's not A.

No matter how many As you have, you'll read them right here in state 1.

So now we're in state 1 and reading a B and it says go to state 2 in that case so,

that's where we'll go and read the B.

Now we're in state 2 and reading an A, so we go to state 3 and read the A.

Now we're in state 2 and we see a C.

In state 3 when we see a C, we're supposed to go back to state 0.

That's a mismatch, our pattern C is not until the end.

So now we have to start the search all over again.

Although in the finite state machine doesn't know that.

Its just plotting along according to the state transition suppose to do.

So we read the C now we're back in state 0.

So now we're in state 0 looking at an A.

And we move to one, and read the A.

And then we see another A, so we stay and want to read the A.

Now we see a B, and we go to state 2 and read the B, then we go to state 3 and

read the A, state 3 and read the A, state 4 and

read the A, state 5 and read the C, and now we're in state 6.

That means we found the pattern.

6:41

So let's just take a quick look at what the interpretation of this DFA is.

So what does it mean after we just read the ith character of the text?

Well the number of the state is the number of characters

in a pattern that we've matched up to the current point.

So that is at state three, we've matched three characters in the pattern.

And what's happened is that we've matched a prefix of a pattern,

the first three characters of a pattern.

And actually, it's the longest prefix of the pattern.

That's also a suffix of the first

i plus 1 characters of the text from 0 to i.

That's the interpretation of what it is.

So the DFA is in state 3 after the first six characters of the text.

8:20

It's simply a two dimensional array that's indexed by the pattern character.

So, if you're in a certain, if you're reading a certain pattern character.

Sorry, if you're reading a certain text character, then for each [COUGH].

We reset the pattern pointer to be the entry given

in the table that corresponds to the current one.

So let's go back to the table to be sure about that.

So when we're in a particular state like 2,

and we see an A, so this would be j.

We refer to that column and

whatever text character we happen to see, that's how we update j.

So that's what this code does, and then if we ever get to

the transition state in the final state, we just return i- M.

Now this is just like that alternative implementation of

the brute force method that we gave, but

the key idea is that, notice that i never gets decremented.

All we're doing is incrementing i and changing j,

either always just referring to the table.

When we have a match, the table automatically increments j.

We have a mismatch, it figures out the right state to go to.

So, that's a very simple and

economical and fast implementation of subscreen search.

10:27

And the importance, again, of no backup is that,

since the textpointer never decrements, you could use an input stream.

And just replace text divided by,

just read the next character in the input stream.

And this algorithm is going to work fine, no matter how big the input stream is.

It'll just go right through, it's got no memory of what the text is,

but it's got some memory of what the pattern is that's built into the DFA.

11:24

Okay, so let's take a look at what it means to construct this DFA.

So it depends on the pattern, you start out with one character,

one state for each character in the pattern, plus an extra accept state.

Let's look at what it means to build the thing.

So the first thing that we do, one state for each character in the pattern,

the first thing that we do is deal with the match transitions.

So that's when, you're in state j,

that means you already matched j characters in the pattern.

Then if the next character matches,

so the character that you have is

the character that's supposed to match.

So that's path of j that's the j plus first character,

then you've know that you've matched j + 1 characters.

So all that means is, we can put in the match transitions.

So we have A B A B A C and these guys go A B A B A C.

If you're in state 0 and you see an A, you go to state 1.

You see 1 you see B, that's state 2, and so forth.

That's how we get the match transitions to get us all the way through

the pattern to the accept state, so that's the easy part.

And then the hard part is the mismatch transitions,

what are you supposed to do if you come against a text character

that does not match the current text character?

13:14

So for example, if you're in state 0, the pattern starts with an A.

If you don't see an A as the first character in the text,

if you see a B or a C, then obviously you want to stay in state 0.

So you can think of state 0 as scanning through the text looking for an A.

It's going to stay in state 0 as long as it sees a B or C.

As soon as it sees an A, it'll go to state 1, but that completes state 0.

You know what to do if you have an A, you know what to do if you have a B or C.

What about state 1?

So we know that if you're in state 1, you saw an A in the text.

And if you see a B in the text, what are you supposed to do?

14:00

Well, there's two different cases,

if you see a B, you go into state 2.

If you see a C, then that's not going to match the first

character A in the pattern, so you go back to state 0.

But if you see an A, it's just as if you saw an A at state 0,

so you might as well stay in state 1.

So, that's how we fill in the BFA for state 1.

if you see a B, you go on to state 2, you matched.

If you see an A, well, that matches the first character, so stay in state 1.

And if you see a C, clearly you have to go back to state 0.

Okay, what about the next one?

So if we're in state 2 and we see an A, we know that we go on to state 3.

And what about the mismatch case?

Well, in that case, it's a B or a C and again, if you're sitting on a B or

a C, you'd better go back to state 0 to keep looking for an A.

State 3, well, now it gets to be a little more complicated.

State 3, see a B, you succeeded, if you see a C, you go back to state 0.

It's not so bad, if you see an A, it's just like before, that's going to

be like the first one, so it seems like we're going along pretty fine.

And again if you're in state 4 and see an A, you go to 5,

see a B or C then you better go back and look for an A again.

This one is the one that's a little more complicated.

15:39

If you're in state 5 and you see a B, you go back to state 4.

And that's a little more work to figure out why that's the case.

And that last case is kind of the essence of the algorithm.

So we'll look at a systematic way

To be able to figure out what you do on a mismatch in each case.

In this case, you only needed that for that one stat there,

otherwise it was elementary reasoning.

So that's the full DFA for Knuth-Morris-Pratt demo of,

at least thinking about how it's going to be constructed.

Okay, let's look a little more carefully and

systematically at the construction process for the Knuth-Morris-Pratt DFA.

16:21

So the start is clear, we're going to go through the pattern,

and for systematically filling the match transitions.

If we're in state 0, and we see an a we want to go to state 1.

If we're on stat 2 and we see a B, we're going to want to go state 2.

So we look up the pattern character and

then whatever that one is, we want to go to the next state.

So we can fill in at least that much automatically.

17:20

So as pointed out at the beginning, as motivation for

Knuth-Morris-Pratt, you know a lot about the text at that point.

You know that the j 3 minus 1 characters of the text are,

you lop off the first character of the pattern, it's from 1 to j minus 1.

And then it's followed by the text character that you're looking at.

So what you want to do to know that, and what you could do,

is simulate the DFA that you have built on that part

of the pattern and then take the transition for

the character that you just find.

So let's look at this.

So let's run this machine, if we had seen it on the text,

the BA BA, that's what we want to do, we want to put the machine

in the state as if we had backed up, but we don't want to backup.

So if we see a B, we stay in 0, if we see an A,

we go to 1, then if we see a B we go to 2,

and if we see an A, we go to 3.

So [COUGH] now we're in state 3, and

[COUGH] if we had a mismatch, then for

the fifth character, what we'd do on a mismatch here,

is we have to look at what happens if we get an A or if we get a B.

If we run it on BA BA and we got an A, then we should go back to one.

So what that says is if we had a mismatch, and

we saw an A in 5 we would need to be in state 1.

Because if we had had run the thing on the characters that we know,

BA BA A, we would wind up in state 1.

And similarly, [COUGH] if we were [COUGH] if we got the mismatch on a B,

again, if we did BA BA would be in state 3, and

we're in state 3 and we saw B, we'd go to 4.

So and again to summarize, if we we're in state 5 and we see a C,

we know that's a match, we go to 6.

If we see an A, we know that the previous five characters in the text were BA BA A,

so we can just simulate the machine BA BA A, we're in state 1.

And if we got a mismatch that's a B, then that's dfa['B'][5],

then we know that the previous five characters in the text were BA BA B,

and that would put us in state 4.

20:29

And, as we noted in the simulation, this is the only non-trivial one for

this example.

Now there's a little problem with that,

is that it seems to require j steps to do the simulation.

In order to figure out these mismatched transitions,

I had to go all the way through the pattern shifted over one,

to figure out this state 3.

So that seems to be a bit of a problem.

But actually it's no problem at all, because we can

run this simulation one character at a time, as we're building the machine.

21:08

All we need to do is keep track of this state that we would be at,

if we had run the DFA on the pattern starting at position one.

Once we get going, it's pretty easy.

But let's illustrate it by saying okay,

we maintained the state x which is where we would be if we

had run the machine and the pattern shifted over one.

Now, when we come to do our mismatches to figure out where the mismatch

transitions from five are, all we do is look at, if we were to get an A,

would be as if we're in state x and we got an A, so that's 1.

And if we were to get a B, it would be if as we were in state x and got a B and

that's state 4.

So what we need to do is to compute the mismatched transitions,

is keep track of state x.

And that is where would the thing be, if we had run it

starting at the pattern one position shifted over.

And we want to update that, so when we're moving to the next state, if

we had a match on C, state x gets updated to where it would have gone if it got a C.

Because for the next character, when we move j over for

the match on C, we'll have x updated.

So that's the key, is keeping track of the state where the machine

would be if we had backed up, or if we had run it on the pattern shifted over one.

So let's take a look at a demo that does the full construction for the k and p DFA.

24:04

And we're in state x.

If we found an A, we'd go to state 1, if we found a C, we'd go to state 0.

And maybe you noticed, that's just taking the entries corresponding

to the entries we need from xs column and putting it in js column.

That's all it is.

And then we need to update x which is where it would be if we matched the B.

So we'll stay in, x will stay in state zero.

25:16

So now the mismatch transitions are A and C,

and that's what we have to fill in for column 3.

But those mismatch transitions were already computed,

that's where x would go if we happened to be state 1.

So we move the 1 and 0 from column 1 over to column 3.

And, again, we update x and

when we see a B, x goes to state 2.

26:11

All right, so now it's straightforward.

Straightforward we have to fill in B and C.

We go to xs column and copy over B and C from xs column.

Then we update x, to c and a, now we're ready for state five.

We've got x all computed.

And we need to do A and B and we get those from x.

If it's an A, it's a 1, and if it's a B, it's a 4.

Let's just moved them over.

And then when we do the C and get the accept, there's no mismatch.

That's where you do update x to get ready.

But when we get the state 6 we're done.

And again it's just as a double check

axis where your B, if you see a BABAC.

That's the construction of the Knuth-Morris-Pratt DFA.

27:28

So we've got x and for every entry of the DFA in xs column,

that corresponds to everyone except for the match.

We just copy xs column to js column backed to we just copy them all.

And then overwrite the one corresponding to the match case, not to j + 1.

That's all, and entering the column it corresponds to the pattern character

that's where do we go from match that's j+ 1, all the rest of them are copied from x.

One for loop to copy from x, then set the match case.

And then the only other thing to do is to update x.

And how do you update x?

You take the machine to where it would go, if it weren't state x,

and cut the charAt, X charAt, the pat.charAt.

28:22

So that's the DFA construction, amazingly little code.

So the only sort of flaw in this code is that it

does take time proportional to r times m,

where R is the radix and M is the length of the pattern.

And as we'll see, there's ways to get around this.

But for relatively small alphabets, this is no problem,

because the search is so efficient this way.

29:36

Quite remarkable.

Each pattern character you access once when you're making the DFA.

And each text character is accessed once when you're simulating a DFA,

and that's in the worst case.

Now the space, again, is proportional to RM because we have all those mismatches.

It is possible to develop a version of Knuth–Morris–Pratt

that constructs the automaton in time and space proportional to M.

It's actually a non determinant of automaton because it's either a match

were all mismatches, and a mismatch might involve multiple hops.

So if you're interested, you can read about that version of KMP.

But it's sufficiently more complicated that you should be prepared to

study it carefully to really understand it.

This algorithm is really interesting because of its history.

It was actually independently discovered by two theoreticians and a hacker.

30:48

So there's a Knuth whose one of the fathers of computer science, who

read a paper that was a very asset by Steve Cook.

A very esoteric theoretical result that he realized implied,

that it should be possible to solve the substring search problem in linear times.

The theorem really didn't give any way to solve it.

But it indicated that is should be possible to solve it in linear time.

So Knuth worked on it and figured out a way.

And then Pratt, who was a student of Knuth at Stanford at the time,

figured out a way to take care of this mismatch independent of the alphabet size.

And in the meantime, across the bay at Berkeley,

Jim Morris was busy writing a text editor.

In those days, people were using typewriters and

other people were realizing that computers would be really good at editing text.

And so many people would work on text editing and formatting systems.

It was kind of a badge of honor in those times.

Morris actually worked for the computer center at Berkeley, and

I wanted to have a really bulletproof text editor that everyone could use.

And one of the things he wanted to do was avoid backup,

because backup was just really inconvenient.

Involved a lot of code, and just something that he didn't want to have.

And he basically came up with the same algorithm.

32:27

And the community was kind of small at that time, and theory needs practice,

and that's where this paper came from, 1977.

Morris then went on to get his PhD, and to work at Xerox Park, and

unfortunately later on, another systems programmer came and

took a look at Morris's code and couldn't understand it and

put the brute force out algorithm back in, but that part of the sad story

is maybe a less successful example of theory meeting practice.

That's the Knuth-Morris-Pratt algorithm,

one of the most ingenious algorithms that we'll see.