We talked about how sequencing by synthesis works. But before we look at some real sequencing reads, we need to understand a little bit more about these sequencers, and how they can make mistakes, and then how they convey uncertainty. So, here's our cartoon again showing the template strands attached to the slide. This cartoon helped us to understand second generation sequencing, but it leaves out an important detail which is that before we add any bases or any polymerases, we first take these template strands and amplify them. That is, we make many copies of them, where all the copies are clustered around the original strand. So as you can see here a cluster of clones are now surrounding each of the template strands. So, we need these clusters rather than the individual templates because otherwise when we go to take our photograph of the glowing terminator bases, there wouldn't be enough light coming from a single template. So we need the light coming from all of the clones that are all close to each other on this slide in order to get enough light for our photograph. However, there's a problem that can occur. So let's concentrate on just a single cluster. And let's say that we're in the middle of the sequencing process, and in one of the steps, in one of the sequencing cycles, just by accident, we end up adding a base that isn't terminated. So, in this diagram, one of the bases that we added was spuriously not terminated. So, for example, it could be this blue one right here. Because it's not terminated, it will not block the polymerase, and because it doesn't block the polymerase the polymerase will keep going. And we're going to get another base added on top of that blue base, so this red base that you see here. Now that red base is terminated, so then the preliminary stops going, and everything's okay after that. But there's already some damage that has been done because one of the templates is now out of sync with the others in the same cluster so that when we go and snap our photograph and we see light coming from all of the various templates that are in this same cluster, we see not just one color of light, but two different colors of light. We see the blue light coming from all the templates that are where they should be, and then we see a little speck of red light coming from the template that's ahead of schedule, that's had an extra base incorporated into that template strand. So, let's say we keep going. The sequencing reaction keeps going, and we go through a few more sequencing cycles. And we end up in a situation that looks like this, where we have some templates that are on schedule that are showing up as orange in our photograph here, and then we also have some that are ahead of schedule. All right? So we had a few more spuriously unterminated bases that were integrated, and so now three of the template strands are out of sync with the rest of the template strands. So as you can see that as we move from one sequencing cycle to the next, the number of strands that will fall out of sync will tend to grow. So, there's a piece of software that analyzes these images and attempts to figure out what all the bases are, and this software is called a base caller. And the base caller has to deal with ambiguity, like we can see here. So sometimes the base caller will be very confident, like if all the light coming from the cluster is exactly the same color, or sometimes it won't be so confident like if the light coming from a cluster is an even mix of two or more colors. If the base color is not so confident we would like to know that, so that when we go to analyze these reads we'd like to know which are the bases that we shouldn't be quite so sure about that could have been some other base instead. So for each base, for each base call, the base caller reports an important value, which is called the base quality. The base quality is simply the base caller's estimate of the probability that the base was called incorrectly. In this equation, the Q is the base quality, and the p is the probability that the base call is incorrect. Now of course we don't know what p is for sure, but it's something that the sequencing software can estimate. So why do we use this particular expression? Why don't we just report p? Well, the scale that Q is on, this expression that you see here, actually makes for some easier interpretation. So, for example, if the base quality is 10, that corresponds to a 1 in 10 chance that the base call is incorrect. If the quality is 20, that corresponds to a 1 in a 100 chance, and if the quality is 30, that corresponds to a 1 in a 1000 chance. So, as we add factors of ten to Q, we're multiplying by factors of ten over here, and all these base qualities are nice, small numbers. So how does the base color get a value for Q, given a picture that looks like this? So in this case we can see that the majority of the light coming from the cluster is orange, so I'm going to say I think the nucleotide of this position is probably a C, orange means C. But there's some uncertainty, so we want to report a base quality that conveys this uncertainty. So that when we go to analyze the read we can keep this in mind that the base isn't totally clearly a C, it's actually a C with a little uncertainty. So let's estimate the probability that we're wrong. And simple way that we can do this is to take all of the light that is not orange and quantify it somehow, add it up somehow and then divide by the total amount of light that we see coming from the cluster. So, in other words, what fraction of the light that we see contradicts our hypothesis that this is a C? So, in this case, we count three green bits of light, and we count nine points of light total. So that comes out to a fraction of around 1/3. So in reality, base callers do something much more complicated than this, but this, you can see, is a not unreasonable way to estimate that probability p. And so if we let p be 1/3, like we estimated here, and then we plug it into our equation, we get a value of q that's about 4.77.