We define the motion of Markov chain Monte Carlo sampling as a general paradigm for generating samples from distribution through which it is otherwise difficult to perhaps interactable to generate samples directly. Markov chains gives us the general way of of approaching this problem, but they, but the framework leaves open the question of where the Markov chain comes from. That is how do we design a Markov chain that has the desired stationary distribution. We talked about the Gibbs chain as a general solution to this problem in the context of graphical models, but we also mentioned that the Gibbs chain has limitations, in terms of its convergence rates for certain types of graphical models. So what happens if we have a bump, graphical model for which the Gibbs chain doesn't have good convergence properties? How do we design a Markov change for that? Well, it turns out that there's a general class of methods called Metropolis-Hastings, and the Metropolis-Hastings algorithm gives us a general approach for designing a Markov chain with a desired stationary distribution. In fact for a given stationary distribution we can construct a whole family of different Markov chains that explore the space differently, and then, try and select among those or construct one among those that has good convergence properties for our distribution. So the key insight behind the Metropolis-Hastings Method is the notion of a reversible chain. So what does it mean for a chain to be reversible? Imagine that we have a chain with a particular stationary distribution pi and we're going to consider two different experiments. In the first experiment, we're going to pick a a point in this space, we're going to according, a point in the state space according to the probability distribution pi, so say we land on this one. And then we're going to pick a random edge emanating from the state that we picked according to the transition model defined by the chain T. So say we pick this edge, that's the first experiment. The second experiment is exactly the same thing, we're going to basically do this experiment twice. So now we're going pick a state and once again, we're going to pick an outgoing transition from that state. A chain is reversible if, when I do this experiment, the probability of traversing this edge, this red edge over here, is in this process is exactly the same as the probability of traversing this green edge in the other direction. That is, if picking I and then traversing to J occurs with the same probability as picking J and then traversing to I. So, written in, mathematical notation, this tells us that pi(x) which is a red state, times the transition along the red edge is equal to the probability of x prime. That was my green experiment times the probability of transitioning in the opposite direction. Now, it's very easy to construct Markov chains that are not reversible, so this is by no means a universal property of Markov chains. But it turns out that reversible Markov chains have an elegant properties that allow them to be constructed so as to give rise to particular stationary distributions. And, specifically we can show and it's not a very difficult proof, we'll show it in a moment, that if this equation, which is called the detailed balance equation. If this equation holds and T is a regular Markov chain, then T has a unique stationary distribution, which we know from the fact that it's regular, but furthermore, that stationary distribution is pi. Let's go ahead and prove this. And it turns out to be actually a very, a one line proof. So, here we have we take the two sides of the detailed balance equation and we put one of them on this side and the other one on this side. And because the two sides are equal, if we sum up over x, then the two summations are also equal. And, now we notice by looking at the right-hand side, that pi of x prime doesn't depend on the p summing x or on the variable x. And so we can move pi of x prime out the summation, and so, this can now be rewritten as pi of x prime times sum over x T of x prime x. Okay? And this summation over here, because t is a probability distribution over it's outgoing over the edges that are outgoing from x prime, is necessarily equal to one. And so if we rewrite what we just proved in this one line derivation, we just proved that the sum over x pi of x T of x to x prime, is equal to pi of x prime, which is exactly the definition of the stationary distribution. So, this is a one line proof that a reversible chain gives me the, side sorry, a chain that is reversible relative to a particular distribution pi necessarily has pi as a stationary distribution. Now, why is this a useful transition between the definition of the stationary distribution pi in terms of this equation down at the bottom versus this alternative definition of the stationary distribution pi? Because this distribution, this definition down at the bottom involves the summation over what is in general an exponentially large space, whereas this one is, as we'll see in a minute, much more constructive. and that it allows us to construct T, so as to match pi. And so, that is exactly the idea behind the Metropolis-Hastings the definition of the Metropolis-Hastings chain. So how does the Metropolis-Hastings chain work? It starts out by saying, well, we want to move around broadly in the state space. And so, we're going to have a distribution queue, which is kind of like a transition model. So this is my proposal, this is my distribution Q. Q is going to, to roam freely over the state space, unlike the Gibbs chain, it's not going to necessarily just take little local steps from one state to a state that is very nearby. It can look very far away and and go into large, into very distant parts of the state space. Now, that by itself of course, is just going to give me a stationary distribution that has absolutely no relationship to the stationary distribution pi that I care about and so what I'm going to have is a critic. The critic is the guy who says, whoa, wait a minute, you can't go there right now, because it's doesn't that's not going to give you the right stationary distribution. So what does the critic do? The critic listens to the proposal that was made by the proposal distribution Q, so it says, oh, you want to go from x to x prime. Well, what is the probability with which I'm going to let you do that? That's the acceptance probability, and that's a binary random variable for each x and x prime. What is the probability that we accept that proposal? So, that gives rise to the following process. I haven't told you how to pick Q or A or anything yet, but let's just understand the process. At each state x, we sample a potential successor state x prime from my proposal distribution Q. So x, and we sort of have a proposal to go to x prime and I note, denoted this using a [INAUDIBLE] line. And now the acceptance, the acceptor says, do I accept that? And if the proposal is accepted, then I actually make that transition. And if the proposal is rejected, I just stay where I am. So what transition model does that give rise to? Because, ultimately, this just gives us a transition model from x to x prime. So there's two cases. Case number one is if x is not the same, sorry, x prime is not x, that is x prime is a state other than x. Well, in this case, the only chance that I have of moving to x prime is if I proposed to move the x prime and if that proposal was accepted. So I need to multiply these two probabilities and that gives me the expression, this expression for the transition. What about transition from x to x? Well, there is two different cases. Either, I propose a move to x and it was accepted or x also gets the benefit of all of the transitions that were rejected by the, from other states that were proposed but where, where the move wasn't accepted. And so, what we have here is Q of x to x which we assume is always accepted, that's sort of one of the actions in this model. And then we sum up over all possible other transitions of the probability that they were proposed times the probability that they were rejected, which is one minus the probability that they were accepted. That gives rise to a transition model. So this is a general definition of a Markov chain, but as stated, there is no reason to expect it to have a particular stationary distribution. And so, now this is where we bring back the intuition from the detailed balance equation. And we're going to use the detailed balance to construct an acceptance probability that has the property that we want. And so, here is the detailed balanced equation rewritten, and now let's write down and we're going to assume in this in this particular example that x is not equal to x prime. And so we're going to, because this is trivial for the case x is equal to x prime, this equality always holds pi of x T of xx is equal to pi of x T of xx. And so let's look at the case x is not equal to x prime and our goal is to construct A. So the goal is to construct A such that, such that detailed balance holds for Q. So I'm given Q, the set up is I'm given a proposal distribution Q and now I'mm going to pick A so as to make detailed balance hold. And if detailed balance holds for Q comma pi, then I'm going to then that is going to guarantee that the process overall has the correct stationary distribution. So, this is the equality that we want to have happen. And, now, let's go ahead and just define this as a constraint on the acceptance probability. So we have acceptance probabilities A occurring on both sides of the equation. So I'm going to divide, move both As to one side and move everything else to the other side. And so I get a constraint on the ratios between the acceptance probability from x to x prime, and the acceptance probability from x prime to x, and that has to be equal to this ratio over here. And so I'm going to assume, without loss of generality, that this ratio notice that this ratio, we have we have we can decide this in any way that we want. We can either consider the ratio from x to x prime as the numerator or from x prime to x. I picked this one under the assumption that this is a sum value rho, sorry, this is equal to rho, which is less than one. And, so now if we pick A, so now we need to pick the values of the two acceptance probabilities, x and x prime, so as to make this equality hold and one simple way to do that is take the smaller of the two, which we assume was the numerator, and we make A(x) two x prime equal to rho and we make A x prime to x equal to one and that gives me immediately that this [INAUDIBLE] holds. Now, of course, we could have picked these probabilities to be, otherwise, we could have picked for example, A x to x prime to be half of rho and A of x prime to x to be equal to half, but notice what that would do. That would just reduce the probability that our proposals are accepted, which would just make the chain move half as quickly, because you'd end up staying at the same state twice as often as you would if the acceptance probability was higher. So these are in fact the highest numbers that we could possibly pick while still making this ratio constraint hold. And so that gives me, when you actually put it together as a general formula for the acceptance probability, it gives us this expression, which, if you look at it, you will see that it gives the right values for both xx prime and for x prime to x. So that problem that we have to pick A given Q and the question is what about Q? How do I pick Q? Well, that turns out to be the $64 million question, picking Q is an art, and it's not an easy choice in many applications. But let's think about some conditions the Q must satisfy before we think about how to pick one. So, one condition on cue is derived simply by looking at the form of expression for A. And we can see that this here involves a ratio, and if you have ratio, then, one condition is that you better not have a denominator that's equal to zero. And, so one of the cases that one of the things we must guarantee regarding Q in order for to be a valid set of acceptance probabilities is that Q itself must be reversible in the following sense. That is if the if the denominator, if, if the numerator is greater than zero then the denominator is greater than zero and vice versa. That is, if you can propose a step from x to x prime, you'd better be able to propose a step from x prime to x with nonzero probability which guarantees that this ratio is always well-defined. Now, the problem which constrains on Q is that it actually induces a, a tension in the design of Q. now on the one hand, you'd like Q to be very broad. So if you were at this state over here, you'd like to, you'd like Q to sort of go really far away from the current state, because the further away you can go, the faster you move around in the state space. You don't want to stay local. You don't want to make the same mistakes that the Gibbs chain does by staying very near to the current assignment. You want to move around quickly and that improves mixing. On the other hand, if you look at the implications that this formula has, when the proposal distribution is very broad, and so that you have one state that for example, has low pi and the other state has a very high value of pi. Which is what you get when you move around from a hilly part of the space to a low part of the space, you have very big differences in terms of the height of the two stationary. So what I'm drawing here is the height of pi and this is my state space x, and if I move around very far, I'm liable to get into situations where pi of one state x might be considerably larger than pi of x prime. And if you think about what that implies regarding the acceptance probability, the acceptance probability can get very, very low, which in turn, implies slow mixing again, because you might not, you might be taking very global steps but you're taking very few of them. And so this is a real sort of tension in the design of distributions, of proposal distributions queue. We can always take an example of a of a Markov chain where proposal distributions can make a very big difference. And, this is a problem that you wouldn't necessarily immediately think of as a graphical model, but it can be formulated as one and it's a matching problem and we'll see it in other settings as well. So here we have a bunch of red dots as we see over here and we have a bunch of blue dots, and we want to match the red dots and the blue dots and there's some sort of preference function. So, these edges over here are annotated I'm going to mark in green, are annotated with some kind of weights that tell us how happy is the red, is this red dot to be matched with this blue dot. And this happens a lot, for example, in in various correspondence problems where we're trying to match sensor readings to objects for example, in a tracking application. It also happens in in image problems where you want to match features and one image that features another, so it's a very common problem. So, one formulation of the matching problem as a graphical model is that we have a variable Xi for each red dot. So the green, so the red dots are going to be the variables, the blue dots are going to represent values. And we're going to have that Xi is equal to j if I match the i to j. [COUGH] So that can now be written as a probablity distribution over a space. And so, we can consider a probability of an assignment of values to to variables and we're going to say that, first of all, has to be a matching. So if we don't have that every x, every red dot is matched to a different blue dot, that's probability zero assignment. So this would be the case if you had two red dots that matched to the same blue dot. And otherwise, we're going to define some kind of weighted combination of, and so, is, sorry, it's it's going to be an exponential of the sum of the quality of the matches between the red dots and the blue dots. So, we're summing up the quality of the edges that participate in the matching. So for example, if this is my matching over here and notice that I didn't get to match all the red dots, that's well actually here. I'm going to match all the red dots. So, now we have a matching on a, on the, for each red dot, we assigned the blue dot as a value. And and now the and now the overall quality is the total sum of of the quality of the matching. And that defines, and that can be exponentiated to define a probability distribution. So and we're going to represent these qualities visually in this example as, as distances, so that we're going to assume for example that this red dot prefers to be matched to this blue dot more than it prefers to be matched to this further away blue dot. So one could imagine running, it gives sampling distribution on this model by taking a variable and say one of the red dots and picking a new value for it, that is a new blue dot that it wasn't matched to. This is going to change things very, very slowly, because most of the blue dots are going to be taken by a different red dot. And so, those assignments, where a red dot is matched to a previously taken blue dot, are going to have very low probability, zero probability in fact, and so those are going to be impossible. So the only thing I can do is I can make the red dot match a different blue dot match, match one of the unmatched blue dots and that's going to take a very long time for things to change. Here is a different way of doing this, which is the Metropolis-Hastings which is one way of constructing a proposal distribution and using Metropolis-Hastings. And that's called an augmenting path proposal distribution. So what that does is, let's assume that this is my current matching, and I'm going to define the following proposal. I'm going to pick one of the variables, Xi, say this one. And I'm going to say, I'm going to pick another value for it pretending that everything is available. So I'm going to ignore the fact that these guys already matched. I'm going to pick this one. I'm would I'm, I'm picking it probabilistically, so I could have picked a different one. I could have picked the same one that I was matched to before. I could have picked a further away one, let's say I pick this one. Well, now, this guy, poor guy, this one is unmatched. this, sorry, this red guy over here, so he have to have to pick a new partner and so, say it picks this one. This red guy is unmatched now and so it might pick one and say it picks this one. And that leaves me with this red guy and say, this red guy picks this one, and notice that now I have a new legal matching. Every, I've closed the loop and I've defined what's called an alternating cycle, where basically, I can switch all of the black edges to green edges and that gives me a new legal matching and that's my proposal. And with a little bit of a numerical calculation, one can figure out the acceptance probability for this proposal distribution and it turns out that it's actually really a good proposal distribution. So let me show you some examples on the next slide. So this is the results on the chain, if I just apply Gibbs sampling. And you can see over here, four chains, so the four colors represent the four chains. And what you see on, as, in the x-axis is number of steps and on the y-axis is some kind of statistic that I'm tracking in order to gauge whether the chain is mixing. As it happens, this is the probability that the first red dot is matched to the first blue dot, but it doesn't matter, any statistic would have illustrated the point here. And you can see that there's a lot of long range fluctuations in the chain, that is the probabilities change over time. The probabilities, by the way, are computed over windows of size 500, from the samples in that window. And you can see that the chain take, that the chain takes a long time to move from one region of the space to the other. The proposal distribution that I just showed you by contrast is totally different. You can see that the probabilities are almost totally flat and that there is, basically, all four chains are the same and there is, almost the same and there is very little change in windows over time. And there's an even better proposal distribution that I didn't show you that modifies the augmenting path by just a little bit and that gives you effectively perfect mixing across the entire time. If we look at the other metrics that we defined for evaluating mixing, we see, we can compare the probabilities of these different statistic of, of of different statistics across the two chains. So, for example, this might be, just for example, the red chain and this might be the green chain. And what we see here are the are the statistics that you get for different kinds of probabilities for the probability that this variable is matched, I'm sorry, that this dot is matched to that dot for example. And you can see that the Gibbs chains, the estimate that you get are very noisy and the different chains give you divergent estimates. The second Metropolis-Hastings, sorry, the first of the Metropolis-Hastings gives you things that are almost on the diagonal, and here, things are effectively exactly on the diagonal, perfect mixing. But to summarize, Metropolis-Hastings is a very general framework for building Markov chains, so that they are designed to have a particular stationary distribution. It requires that you come up with a proposal distribution and that proposal distribution has its needs to have certain properties in order to be valid. But once you give me such a proposal distribution, the detailed balance equation, which is this nice simple equation, tells me how I can construct the acceptance distribution, so as to it enforce the fact that we get the right stationary distribution. So, this is great in some ways because it gives us this huge flexibility in designing proposal distributions that explore the space quickly and move around to far away parts of the space. But as we saw, picking the proposal distribution is actually an important design point and it makes a big difference to performance. And it's not, there isn't like a science of how you should do this effectively. And so, there are and so this is something that is really an art and requires nontrivial amount of thought and engineering.