So in the previous section, we introduced character tables, and we saw that we can think about a multiple alignment as a character table of its own, for which each of the m columns of this multiple alignment can be viewed as a character. Our goal is to use this multiple alignment to reconstruct the most likely ancestral sequences. To do so, we're going to need some way of determining how good an assignment of ancestral sequences is for a given tree. For example, say that we're given the following tree structure for the four species that we've been considering, and that we then assign the following ancestral sequences to internal nodes of the tree. We're going to define the parsimony score, then, of this tree, as the sum of the Hamming distances along each edge of the tree. This tree has parsimony score equal to eight. And this discussion brings us to the small parsimony problem, in which we're given a rooted binary tree whose leaves are labeled by strings of length m, and our goal is to assign ancestral sequences to this tree in order to minimize the parsimony score. Now, this seems like a difficult problem. Is there going to be any way of simplifying it? If we treat the columns of the multiple alignment that we're given as independent of each other, then it means that we'll be able to solve the small parsimony problem on each column of the multiple alignment separately. In other words, we can work with one symbol of each string at a time. So we'll now design a dynamic programming algorithm that will solve the small parsimony problem, assuming that each string is just a single symbol. So first, define T_v as the subtree of the given tree T whose root is node v. Then define s_k(v) as the minimum parsimony score of the subtree T_v over all possible labelings of T_v, such that the node v is labeled by symbol k. Note that according to this definition, the minimum parsimony score of the entire tree T is equal to the minimum value of s_k at the root, taken over all symbols k. The idea of our dynamic programming algorithm, then, is going to start at the leaves of the tree, then compute the values s_k upward from the leaves to the root, and then use this fact to find the minimum parsimony score. In order to implement this algorithm, we're going to need a recurrence relation for the score of a node v in terms of its two children, which I'll call its "son" and its "daughter". So we can define sigma_i,j to be 0 if the symbols i and j are equal to each other. And if i and j are not equal to each other, we define sigma_i,j to be equal to 1. Because you're a dynamic programming expert by now, I'm going to leave proving this recurrence relation to you. So once you've demonstrated this recurrence relation, let's see how it works for an example. So at each node v, an array is going to hold the set of scores s_k(v) for each possible symbol k. Here we're going to be working with DNA, so there will be four symbols k. Because the leaves are preassigned in our tree, we're going to set their scores equal to 0 for the symbols of those leaves, and infinity for all other symbols. For example, here the leaf is assigned C, so we set the score of C equal to 0 and the score of T equal to infinity. Why would we do this? Well, it prevents us from then going back and changing the leaf node to t later. It is something that we want to prevent, right? So when we move up a level, we're going to apply the recurrence relation. In the left most node, the two children are fixed as equal to C. So the score of this node is going to be 0 if we assign it a C, since there will be no conflicts between these two edges. If we assign this node any other symbol, then there will be two conflicts between the edges, and its score is going to be 2. One node over from this, we will obtain a score of 1 if we assign this node an A or a C, and a score of 2 otherwise because we'll have two conflicts. The remaining two nodes proceed similarly. So, note that I'm using red to indicate the minimum score at each step and there may be a tie as we've seen in two nodes here. When we move up one more level in the tree, let's walk through the computation of, say, the array of the left node. The score of A corresponds to the minimum parsimony score over all possible assignments to this subtree, assuming that A is assigned to that subtree. In this case, the best that we can do is not to take an A here, but to take a C here, since the parsimony score of that subtree will be equal to 0. And then that will provide us with a conflict of 1 on that edge. In this case, we do want to take A because the parsimony score of this subtree will be 1, and then A will agree here. In a sense, we're breaking the tie between A and C if we assign that equal to A. So we would have gotten a score of 1 from this subtree, and a score of 1 from this subtree, so we assign A a score of 2. C should be pretty straightforward because C is a minimum at each of these two nodes. So if we assign a C here and a C here, we get parsimoniy scores of 0 and 1, and we have no conflicts here. So this is really what the recurrence relation is saying. So, two more examples. For G, if we assume that we assign a G there, then we don't want to assign Gs here because that would give us parsimony score of 4 here and 4 here, and although we wouldn't have any conflicts along these edges, we would have a score of 4. We can do better than that if we simply assign Cs to both of these nodes, in which case we get two conflicting edges, so a contribution of 2, but then a contribution of 0 and 1, which gives us 3. The case of T follows analogously because it's essentially the same as G. So hopefully, you can see from that how we can get the right node on this level, as well as when we move to the root. So when we move to the root, the minimum parsimony score is going to be corresponding to symbol C, which means that we should assign C to the root because we said that the minimum parsimony score of the tree was the minimum at the root over all symbols. So, to reconstruct the other ancestral states, you're going to need to flex your dynamic programming muscles once again and implement a backtracking approach that goes from the root down to the leaves and assign symbols. I want you to note, when you do this, that the red elements correspond to minima, but they don't necessarily correspond to an optimal assignment of ancestral symbols. So, for example, we're going to get into trouble if we assign a T to this right-most node in the tree. So, I want you to make sure that your backtracking approach is able to reconstruct the tree shown on this slide before you try to implement a solution for the small parsimony problem. Once you have done so, you might find it interesting to apply the SmallParsimony dynamic programming algorithm to the tree we previously constructed for coronaviruses. What are the ancestral coronavirus sequences that you obtain? Also, it turns out that you can find the most parsimonious labeling of the internal nodes in an unrooted tree by placing a root on an arbitrary edge, solving the small parsimony problem for the resulting rooted tree, and then simply removing the root.