The first way that we will construct evolutionary trees is by using a distance matrix. And one of the most common ways of constructing such a matrix is to begin with a multiple alignment. Here is a toy multiple alignment that we'll use throughout this talk. We can form a distance matrix from this multiple alignment in which D sub i,j is simply the number of differing symbols between the i-th and j-th rows of the multiple alignment. i.e., there are three mismatches between the first and second rows. So we simply set D sub i,2 and D sub 2,1 of this distance matrix equal to 3. Of course, we have encountered several other different notions of distance already that we could also use to construct a distance matrix. But constructing a distance matrix from a multiple alignment is very common. In order to construct a multiple alignment of coronaviruses, researchers chose the spike protein, which identifies and binds to receptor sites on the host's cell membrane. Here is the spike protein for the SARS coronavirus. You may have noticed that evolutionary trees take the shape of a graph. So we'll define a tree as a connected, acyclic graph. Because the tree is connected, it means that it holds the tree in one piece. And because it's acyclic, it means that the tree can branch out without growing back in on itself and forming a cycle. When we use a tree to model a phylogeny, the present day species should be at the ending nodes of the tree. More formally, these ending nodes are called "leaves", which I've shown in dark in this tree of life and they're simply nodes that have degree one. The remaining nodes, shown in a lighter color, are called "internal nodes", and they're simply nodes of a larger degree than one. Now, trees come in all different shapes and sizes, but the fact that they are connected in a cyclic by definition leads to some common properties. For example, we can always find two leaves in a tree as long as the tree has at least two nodes, of course. Also, a tree with n nodes has exactly n - 1 edges. For example, the tree on the left has 14 nodes. So regardless of what shape the tree has or how it's nodes are connected, we know that it must have 13 edges. In order to model the most common ancestor of all the species in trees, we will sometimes place a "root" in a tree. If a tree is rooted, shown here by the green node, then the root can be drawn at the top of tree with the leaves at the bottom of the tree. By reorienting the tree in this way, we know that time flows downward from the root to the leaves in the sense that each edge of the tree connects an older species to a more recent species. For now though, let's assume that we are working with unrooted trees. This leads us to a computational problem of finding the tree that "fits" a distance matrix. Of course, I should have known that Son would somehow show up today. The problem that he has is that we have been careless and we haven't defined what we mean for a tree to "fit" a distance matrix. If we come back to the toy distance matrix that we had for four species, then I will go ahead and show you a tree that fits it. Here it is. We will need to assign weights to the edges of this tree so that the sum of weights along a path that connects two leaves corresponds to the distance matrix value for those two leaves. Here we have weights of 1 and 2 and they combine to give us a distance of 3 between chimp and human. The distance between chimp and seal is 6, and edge weights 1, 3, and 2 give us 6, and so on. So you can verify that each possible path through this tree, if you add up the weights of the edges along the path, that gives you the distance matrix value. Now we will use lower case d to indicate the distance between leaves and a tree. If you're given a tree whose edges are weighted, then it's a pretty straightforward problem to just compute the distances between each pair of leaves in that tree. Of course, what we really want to do is solve the reverse problem, which is finding a tree whose distance between leaves fits a given matrix. You think this is now well defined? If you're not sure what I'm driving at with this question, feel free to pause and take moment to try this exercise. Regardless of how long you spend on this exercise that I gave you, you will never find a tree fitting it, because such a tree simply does not exist. For this reason, we need to distinguish between distance matrices that have trees fitting them and distance matrices that don't have trees fitting them. A matrix that does have a tree fitting it is called "additive". There is another problem though, which is that an additive matrix could have more than one tree fitting it, and it will. Here's the tree we had before fitting the toy distance matrix, but we can simply stretch out the edges into longer paths and still have a tree that fits the distance matrix. But I would say that when you look at these two trees, the top one just seems nicer, right? Of course, we need to rigorously define what we mean by "nicer". The bottom tree has two nodes of degree 2. So we can get the graph on the top from the graph on the bottom by simply compressing all paths that contain degree 2 nodes. This leads us to define a simple tree, as a tree that has no nodes of degree 2, and the good news is that if a matrix is additive, i.e., there is a tree fitting this distance matrix, then there exists exactly one simple tree that fits it. Now that we're armed with some better terminology, we can return to the Distance-Based Phylogeny Problem that we had before and restate it as finding the simple tree that fits a given distance matrix, as long as that matrix is additive.