We just learned that Boyer-Moore preprocesses the pattern P in order to build lookup tables that help it use the bad character and good suffix rules. Let's focus on this idea of preprocessing in some more detail because preprocessing is an idea that's going to send us down a somewhat different path in the coming lectures. So, the naive exact matching algorithm did not use preprocessing. Right? If you give me a pattern P and a text T, I can run the exact matching algorithm, but if you give me a pattern P today but no text, you say I'm going to give you the text tomorrow, there's nothing for me to do until you eventually do give me the text. With Boyer-Moore on the other hand, we can preprocess the pattern P. Boyer-Moore is an algorithm that preprocesses the pattern P, so if I'm Boyer-Moore and you give me the pattern P but not the text T, there's something I can do. I can build the lookup tables for the bad character and the good suffix rules, then later on, when you give me the text T, I can use those lookup tables to execute Boyer-Moore. In fact, if we're in a situation where we're going to be matching the same pattern against many different texts, so the P is going to stay the same but the T is going to change, then I can reuse my preprocessing for the pattern. I can reuse the lookup tables for the bad character and the good suffix rules over and over again for each new text that you give me. I don't have to re-preprocess the pattern each time. So the cost of preprocessing is something that can be amortized over many problems. It might be costly to do once, but because we use it again and again, we essentially make up for that cost over time. So, what about the idea of preprocessing the text T? So at first glance, this might not seem like such a good idea because the pattern is usually short whereas the text can be very very long, maybe billions of letters long. So the task of preprocessing the text sounds like it's an awful lot of work, but we also just discussed how the cost of preprocsssing can be amortized over many problems. So if we're in a situation where we have to solve many matching problems where the text T is the same but the pattern P can vary from problem to problem, then it might be worthwhile to preprocess the text T. In other words, we might be able to amortize away much of the cost or all of the cost of preprocessing. So, an algorithm that uses a preprocessed version of the text T goes by a special name. It's called an offline algorithm, whereas an algorithm that does not preprocess the text is called online. So, notice that this definition is the same regardless of whether we preprocess P or not. Right? So an online algorithm may or may not preprocess P, and an offline algorithm may or may not preprocess P. The distinction is whether or not we are preprocessing the text T. So, let me outline a few scenarios, and I'd like you to think about for each of these scenarios whether it's more online or offline. Is it an offline situation or an online situation? And I'd suggest that you pause the video after I ask each of these questions so you can think about what your answer would be. So first of all, the naive exact matching algorithm. Was that an online or an offline algorithm? This was online. The naive algorithm did no preprocessing at all, so it definitely didn't preprocess the text. So it's an online algorithm. Okay, second question. Is the Boyer-Moore algorithm online or offline? It's online. Boyer-Moore does require some preprocessing, but it preprocesses only the pattern P. So it's online. Okay, third. This one's a little trickier. How about a search engine, like the kind of search engine you would use to search the world wide web, when you type a query into a search engine and then press search. Do you think that the algorithm it uses to return those results to you is online or offline? It pretty much has to be offline. In this scenario the text T is enormous. It's the entire World Wide Web. So if the search engine didn't preprocess the World Wide Web in some way, then every time you entered a new search term and hit search, it would have to go basically scan the entire World Wide Web. And there's no way you would get a quick answer if that's what it was doing. But when we search the web, we know that we get a quick answer. We have an answer in less than a second. So it must be that the search engine has preprocessed the World Wide Web into some form that allows it to jump very quickly to the search results that you asked for. So it is offline. Next question. How about our problem, the read alignment problem? Is it online or offline? So, if we have many sequencing reads, many snippets of DNA, and we'd like to figure out where they all originate with respect to the same reference genome, this seems like a place where an offline algorithm would be more appropriate. It would be even clearer if I told you that I'm actually going to give you one human DNA sequencing data set this week, and then I'm going to give you another one next week and another one the week after that. In each case, every time we analyze a new data set, and then within a data set, each time we analyze a new read, the pattern P is changing. But the text T is not changing. The genome does not change. It's actually the case that genomes, like the human reference genome do change a little bit over time. They get updated, they correct errors, and they fill in gaps and things like this. But in general, they stay static. They change maybe once a year or something like that. So this is a situation where an offline algorithm seems more appropriate. So far though we've learned about the naive exact matching algorithm and about Boyer- Moore, and these are both online algorithms. Boyer-Moore is very fast, and it's a very widely used online algorithm. But for read alignment, now would be a good time for us to think of some offline algorithms. Algorithms that involve taking the reference genome and preprocessing it somehow in a way that allows us to match many reads very quickly.