So now that we know what a sequencer does, and what kind of data it produces, the next question for us is, how do we analyze this data? How do we analyze the sequencing data? We want to answer questions like, how is my genome different from your genome? Or what does my genome tell me about my predisposition to disease, and questions like this. So, unfortunately we can't answer these questions just by looking at the reads. The reads themselves are far too short. A single read isn't even long enough to cover a single gene in its entirety. Human genes are on the order of something like thousands of bases long. But like we said, one of these second generation sequencing reads is only on the order of hundreds of bases long. So to use an analogy if I tore a few words, a few words out of the middle of a newspaper article, and I showed them to you, you probably wouldn't be able to tell me very much about that article or what it says, or what newspaper its from. So, we just can't infer that much meaningful information from snippets that are this short. So to answer our scientific questions, the first problem we have to tackle is we have to take these reads, these short snippets of DNA, and glue them back together somehow. We have to have to stitch them back together, so that we can infer the sequence of the input DNA. This is analogous to putting together a puzzle. So, say you want to put together a puzzle. How do you do this? Well, you can start by spreading out all the pieces on the table, and then besides the puzzle pieces, you actually have something else that's very useful. You have the picture of the completed puzzle that's printed on top of the puzzle box. And this picture gives you a useful shortcut. It's sort of a guide. So, for example, if you discover that a particular puzzle piece, like, let's say this puzzle piece up here, which has a lot of bright red on it, we might say, oh, well, there's only one flower in the picture that looks so bright red. I guess that puzzle piece probably comes from somewhere around here. All right so in fact, it turns out we can use pretty much the same kind of strategy when we're putting together reads from a DNA sequencer. An important fact about human genomes is that if you take two unrelated human beings and compare their genome sequences, they're actually very, very similar. They're about 99.8 to 99.9% similar, depending on exactly how you count. So that's only about one or two differences every one 1,000 bases or so. So even though my genome is not exactly the same as yours, we can still use my genome, or someone else's genome, as a kind of template or a guide, in the same way that you can use the picture of the completed puzzle to help you put together the puzzle. So, let's say that these purple sequences on the left are some sequencing reads. And they're from your genome. And then this long string on the right, this black string is my genome, or someone else's genome, or let's say that it's the human reference genome. In other words, that's the genome that was put together by the Human Genome Project, back when the Human Genome Project finished around 2001. So we can take one of your sequencing reads, one of the sequencing reads from your genome, and essentially hold it up to the reference sequence. And look for the spot where that read matches most closely, and the spot where it matches most closely is our best guess as to where it originated, as to where it belongs with respect to that genome. A key point is that we have reference genomes like this. We have them for many different species, not just for the human genome, but also for the fruit fly, and the mouse, and the honey bee, and the cow, chicken, rat, corn, the list goes on. We have many, many hundreds, thousands of these referenced genomes that are just available for download in public databases. So for the first half or so of this course, we'll be discussing some very useful algorithms and data structures for solving this problem. We can call this the read alignment problem. Given a read, a sequencing read, and given a reference genome, how do we find where the read best matches, where it aligns most closely to the reference genome? The topic that we'll discuss in the second half is about the case where we don't have a reference genome. So maybe we're studying some exotic plant, whose genome no one has ever sequenced before, or were otherwise working without the benefit of a reference genome. So in this case, since we don't have a reference, there's no guide to help us put the puzzle together, in which case we have no choice but to do something analogous to what you would do, if you wanted to put together a puzzle. But without the picture of the completed puzzle, you would just have to hold the puzzle pieces up to each other and then look for pieces that look like they have patterns that overlap. So like, for example, in this picture, in this jumble of puzzle pieces here, we might see this puzzle piece which has bright red on it, and this other puzzle piece that has bright red on it, and we might say okay, well maybe they go close to each other. And then try every possible way of putting them together. So, there's an analogous way, there's still a way that we can put together the puzzle, even if we don't have the picture of the completed puzzle to help us. And that is the kind of problem that we'll examine in the second half of the course. But speaking more broadly now, when there is no guide, when we do have to solve this assembly problem rather than the read alignment problem, this is going to lead us to a very different set of algorithms and data structures. And it's actually a very difficult problem, an inherently difficult one. But it's also a very interesting one. So that's what we'll discuss in the second half.