The assembly problem is this. We're given many sequencing reads which are shown here in blue. And these are reads that are derived from the genome that we'd like to assemble. In this example, the genome is shown in red down at the bottom. Now, this slide is a little misleading, because it's drawn as though we know where all of the reads originated with respect to the genome sequence, which of course we don't. What we actually get is a jumble of reads, where we have no idea where they came from with respect to the genome. And of course, we also don't know the sequence of the genome. We don't know what that sequence is. So our goal is to reconstruct the genome sequence, these red question marks down here, given what we have, which are the blue sequencing reads. So before we tackle that problem, let me start with some helpful vocabulary. So let's define the term coverage. Coverage refers to the amount of redundant information that we have about the genome. So in this example, if we focus on just this one highlighted position of the genome, and we stack up all of our reads on top of that genome. So we're saying now for the moment, that we know where the reads come from with respect to the genome. Then this highlighted position of the genome is covered by five reads. So we would say that this position has a coverage of five. And in another sense, what this means is that if we want to know which base appears at that position in the genome, the reads are giving us five distinct pieces of evidence, or five different votes, for what that base should be. So in this case actually, if we look just two positions over to the right here, so now we're looking at this highlighted position, we see this position also has a coverage of five. But if we look at the reads covering that position, we can see that they do not all agree on which base appears there. Three of them are voting for the base g, and then two of them are voting for the base a. So we'll talk a little bit later about why this sort of thing might happen. So that's what coverage means when we're talking about coverage at a particular position in the genome. We can also define overall coverage. Overall coverage is basically the coverage averaged over all the positions of the genome. So we can calculate overall coverage by taking the total length of all the reads and just dividing by the total length of the genome. So by the way, this means that we have to have some idea what the length of the genome is in order to calculate average coverage. So in this example and the example on this slide, there are a total of 177 bases in all the reads, and the genome is about 35 bases long, so the average coverage is about seven. So we might say we have 7-fold coverage. So now let's tackle the problem of how we can piece together the genome sequence. So let's say that here are the sequences of two different reads, and these two reads came from the same genome. And one thing that you'll notice about these two reads is that a suffix of one read is very similar to a prefix of the other. In fact, I'm now highlighting exactly how they're similar. So, this suffix prefix match is long, and it's almost an exact match, except for one difference here in the middle. And we'll discuss in the moment why we might see differences like this. But the fact that the suffix of one read is very similar to a prefix of another is a hint. It's a hint that's telling us that these two reads might have originated from overlapping portions of the genome. And in other words, they could have come from so close to each other on the genome that they actually overlap. So, here's a principle that we'll call, the first law of assembly, because so much of what we discuss later on is going to depend on this principle. So if a suffix of some read A is similar to a prefix of some other read B, then A and B are likely to overlap with respect to the genome that they came from. So looking at this picture down at the bottom of the slide, this red string here is the genome. And the two blue strings represent two different reads, two different reads, and both of them overlap the same region. So because of that the suffix of one of these reads, the one on the top, is going to match a prefix of the other, the one on the bottom. So you can see how this principle, the first law of assembly, allows us to start to sort of glue things together. So, when we find pairs of reads where one read has a suffix that's very similar to the prefix of another read, that's a hint that we can glue them together in order to get a larger piece of the genome. And this is a bit like putting two puzzle pieces together. In this example here, the two reads overlapped, but they didn't have a perfect match. There was one mismatch in the middle here. It was almost perfect except for that one mismatch. So, why might there be differences like this? Why do we have to allow there to be differences between these reads when we look for overlaps? Well, there's a couple of reasons. So, actually, this question is very close to one that we asked before when we were studying read alignment, and we wondered why exact matching was not sufficient for the read alignment problem. And we came up with two reasons, and one of them is the same here as an answer to this question, as it was then. Basically, the first answer is because of sequencing errors. So once in a while, the base that's reported in a read is just wrong. The sequencing software just miscalled that base. And so that can lead to differences between the reads when we go to look for these suffix prefix matches. The second reason this might happen is due to something called polyploidy. So humans, for example, have two copies of every chromosome. Every person has two copies of each of their DNA molecules. And one copy is inherited from the mother, and one copy is inherited from the father. And these two copies are not exactly the same. So this can also lead to a situation where there's an overlap between two reads. There's a suffix of one read that matches a prefix of the other, but they have a difference. And in this case the difference is real, it's not a sequencing error. It's just because the two different copies of the genome had different bases at that position. Another important point to make here, and this, I'm going to call this, the second law of assembly, is that the more coverage we have, the more and longer the overlaps we have between reads. So this is important because again, the overlaps are the glue that we're going to use to assemble the genome. So here's an example where we have a genome shown in red. And we have two different datasets, one that's shown above the genome up here, in blue, and another dataset that's shown below the genome down here, in blue. And the dataset on the bottom has more coverage, right? Deeper coverage than the dataset on the top. So, there are more reads, and they're more tightly packed together than on the top. So it's not hard to see that there are going to be more overlaps and longer overlaps down here in the bottom data set than there are up in the top data set. So for example, let's just focus on one of these reads. So, it's not hard to see, for example, here we're looking at the overlap between two reads that are next to each other in the data set on the top. And then, if we take the same read as this one down here, that's this read right here, and look at how it overlaps with its neighbor to the right, the overlap is longer. So this is just one example of how the more, the greater the sequencing depth, the more overlaps and the longer the overlaps are going to be between the reads.