Indexing is a very powerful idea. As we discussed, indexing is the reason why, when you type a query into a search engine, a worldwide web search engine, you can get an answer back in less than a second. Indexing is also at the core for a lot of different methods for analyzing sequencing data and methods that solve the read alignment problem. So the question of how to build faster and smaller and more flexible indexes, indexes for genomes, is an active area of research right now. And some of the advances in the last several years that have come in read alignment, have related to how we can manage to fit the genome into an index, into a relatively compact index so that it can fit in memory, but we can still query it very quickly. So, let's start by looking at the familiar K-mer Index here, which we discussed in a previous lecture. The idea behind the K-mer index was very simple. We extracted every substring of length K, every K-mer, from a reference genome, and instead of saying, text T, I'll just say, genome here, since the genome is what we're indexing. And we also discussed variations on that idea. So one variation here was to extract not every K-mer, but only some of the K-mer from the genome. So, for example, we could extract just the K-mer's that are at even offsets in the genome. And to compensate for the fact that our index didn't contain the odd K-mers anymore, we would actually query the index twice. Once, with a K-mer from the pattern from an even offset, with respect to the pattern and once with a K-mer from a non-offset, with respect to the pattern. These ideas are used in many, many practical tools for solving the read-alignment problem. And, more generally, for bio sequence analysis. We also discussed a variation on this idea, which we called a subsequence index. Where we extract subsequences from the genome instead of substrings. So subsequences are not necessarily contiguous. So what we're extracting is this non-contiguous pattern out from the middle of the text T. This idea is also pretty widely used in practice. Here's an example of a tool that uses this idea. And as you'll explore in a programming assignment in this module, it has an important advantage over the substring indexes, which is basically that index filter tends to be more specific. In other words, a larger fraction of the index hits actually lead to matches of the pattern P within the text T. So, here's another idea, a new idea that seems very similar to the sub-string index. But, it actually leads to some very different and interesting new index data structures. So what I have here in black is a genome. And let's say that instead of extracting every sub-string of a certain length from this genome and putting that into an index, we take every suffix. So what do I mean by that? Well, here in black is a genome, and then in red is a list of all the suffixes of the genome. So we'd like to extract and organize all the suffixes into some data structure that allows us to perform rapid queries in the same way that the sub-string index did. So one simple idea is to just take all the sub-strings, I'm sorry, all the suffixes from the genome and then put them in alphabetical order, just like this. So again, here in black is our genome, here in red is all the suffixes. And we're just going to put those suffixes in alphabetical order. And now what we have is simply a list of all the suffixes of the genome, but they're in alphabetical order. And if we want to query the index, we can use our old friend binary search. So for example if our query is ab, if our pattern p is ab, then we can use binary search to find all of the suffixes that have ab as a prefix. Because the suffixes are in order, all of the suffixes that share some prefix, that have the same prefix are going to be consecutive in this list. So you may be wondering if maybe this data structure isn't perhaps too big. Right? Because it's organizing all the suffixes of a genome and let's say that your genome has length N. Then if we add up the lengths of all the different suffixes of the genome, we get this expression, which grows with the square of N. It grows quadratically with N. So by the time N gets to 3 billion, as is the case for the human reference genome, it's about 3 billion characters long. This expression, the number of characters in our index, is just going to be far too large. It's going to be really, really big. Way too big to be practical. But in fact, we can sidestep this issue with a simple trick. So specifically, we can represent a suffix of the genome by its offset into the genome. Instead of representing it explicitly as a string, we can represent it as just one integer that tells us the offset with respect to the beginning of the genome. So for example, what do I mean? So here are all the suffixes of a text and they're in alphabetical order and then listed to the left of those suffixes, Is a list of integers. And each integer corresponds to that suffix's offset with respect to the beginning of the text. So if the text is abaaba, then the very first suffix, alphabetically, is the one that consists just of the character a, and that's the suffix that's at offset 5. And so, if we do this for all of the suffixes, we can represent all those suffixes, in alphabetical order, with just a list of integers. And this list is called a suffix array. And before I say much more about the suffix array, I should point out that to make this a useful data structure, besides the array of integers itself, we also have to store T itself. We have to store the genome or the text. So we have to keep the genome around because that way we have the list which tells us the alphabetical order of the suffixes but if we want to know what the sequence of any particular suffix is, we also have to go look inside T. Just knowing the offset isn't enough, we have to know the entire string that that's an offset into. The most important fact is that besides the genome itself, the only thing we need to store is this list of integers. And even though the total number of characters in all the suffixes of the genome is a quantity that grows quadratically or with the square of the length of the genome. The number of integers in this list, that is the suffix array, grows only linearly. In fact it's just equal to the length of the genome. So this leads to a much more manageably sized and a more practical data structure. Now the suffix array is just one example from a family of indexes, called the suffix indexes. And, another kind of suffix index, which is shown on the left here, is a suffix tree. A suffix tree is like the suffix array, in that it organizes all of the suffixes of a text in some way. But, unlike the suffix array, it organizes them using the principle of grouping rather than the principle of ordering. The suffix array put all the suffixes in order. So, like we said before, this is like the idea behind the index of a book. It uses ordering. Where as the suffix tree uses grouping. More like the aisles of a grocery store. Another popular kind of suffix index is called the FM index, which is shown on the right here. The FM index is based on another idea called the Burrows-Wheeler transform, or the BWT. And the Burrows-Wheeler transform was originally developed for compression. But it so happens that it's also used to build a specific kind of suffix index, the FM index, and this index is very, very compact. So to give an idea about how large these three data structures are, let's consider the question of, if we used each of them to build an index of the entire human reference genome, which again is about three billion letters long, then how large would that data structure be, how much room would it take up on your computer? So we can see that the size varies quite a bit from structure to structure, the suffix tree is pretty big. And, by the way, my estimate here is probably a pretty optimistic estimate on how big it would be. So depending on how you implement the suffix tree, it could be substantially bigger and then what's written here. But it's still not terribly big, it could fit, for example, in the memory of a large server. The suffix array is much smaller than that. But the FM index is Is very compact, so the reason for this is that the FM index really just consists primarily of the Burrows-Wheeler transform genome which is the same exact size as the genome itself. It's really just the genome with the characters in a different order permitted. And since because it's so small it can easily fit in to memory on a typical computer. So the FM index is widely used and is the basis for some very popular read alignment tools. In the Bowtie and BWA families, for example, there are tools called Bowtie and Bowtie 2, BWA, BWA-SW and BWA-MEM that all use this idea of the FM index.