One of the most generally useful class of sampling methods one that's very commonly used in practice is the class of Markov Chain Monte Carlo methods. And those are methods that allows us to design an intuitive sampling process that through a sequence of steps allows us to generate a sample from a desired target distribution that might be intractable to sample from directly. So what are these Markov chains and how do you use them. So, a Markov Chain is a, is a method for sampling from a distribution p that is intractable to sample from. And so we see one example of such a distribution p. if you'd like to sample from a from say Bayesian network. Where we have some evidence. We don't really know how to sample from that. If you'd like to sample from a Markov network, we don't really know how to sample from that either, in general. And so here we have examples of distributions p. And we'd like to come up with a way of generating even one sample from that distribution p. So mark up chain gives us a general mechanism for doing that. And what the Markov Chain does is it defines an iterative process by which the first sample that you generate is not going to be from a distribution p but ultimately, as you move along, you are going to get closer and closer to generating a sample from p. So, so let's understand what that what that means so we have a Markov chain, and a Markov chain is defined over a state space which we are going to use x's to denote. And so here is an example of such a state space. This is a very simple state space, it starts with the zero point over here and you can imagine, and it has, I, the, four negitive numbers to the left and four positive numbers to the left. And a Markov chain defines a probabilistic transition model which, given that I'm at a given state, x tells me how likely I am to transition to a different state, x prime. And that is a probability distribution as indicated in this formula here. So that's for any x, the probability, the sum over the probability over the states you can transition is exactly one. So for example if in this case we have our little grass, grasshopper that starts out at state zero. We can see that it has a probability of 0.25 of transitioning to the right. A probability of 0.25 of transitioning to the left. And a probability of 0.5 of not making any progress and staying exactly where it is. And in fact that same general probability distribution is actually in this example replicated across the different states in the chain, with the exception of the states at the end, where if the poor grasshopper tries to go to the left when it's at negative four, it's going to hit the wall and it's going to end up staying where it is regardless. And so the probability in this case of staying is actually 0.75, corresponding to the two different cases that we just talked about. But anyway, this is a Markov chain, and it and you can imagine. Simulating a random process by which a grasshopper traverses this chain. And so, initially, it starts out with at say, at state zero. And then and then, what happens, as. And then, as it moves, it selects to move left with probability of quarter. Right with probability of quarter. And and stay at the same place with probability 0.5. Once the states move to the left, it now does exactly the same thing. So let's think about the temporal dynamics of this type of process. We can ask what is the probability that a time, that it's step t + 1 so this is the step. what is the probability that at that time-step the state, at time t1, + 1 is equal to some value x fine. So we can get that by a recurrence relationship that looks at the state of time T. So if we had previously computed the probability distribution over where the grasshopper might be at time T. We can sum up over all possible states where x, where the grasshopper might be and ask if it was at state x at time t, what is the probability that it ends up at being that it ends up going to x prime. So this together gives me a distribution over pairs. X comma X prime, where this, which, which measures the propability that at time, t grasshopper is at state X and the T plus one is at X prime, and since I'm only interested now in asking about T plus one, I sum up, or marginalize the time T step, state X. So if we go back to our grasshopper example, we can simulate this process. And here is the first three steps of it. So at time zero, the grasshopper is at state zero with a probability of one. At time one, it's going to be at negative one with a probability of a quarter and it's positive one with a probability of a quarter, probability half it's stuck at the same state. And now we can simulate the next step. So, it's, at the next time step, the probability that it moves, manages to move, all the way to -2 is 0.25 squared, because it considers two successful moves to the left. Here you have two successful moves to the right. At stage zero you have, for example, a sum of different events, which is the probability that it stayed in the same state twice. So.5 squared. Plus the two events that correspond to moving to the left once and back to the right, or moving to the right and back to the left, each of which happens with probability 25^2.. So this is an example of how you would do this. Now it turns out from many of these chains and we'll describe conditions in a moment ultimately as the process evolves, the probability distribution kind of equalizes which means that the probability for the time t your at state x prime is almost the same as the probability that at time t + one you're at x prime. So sort of it kind of the distribution over where you might be tends to equalize. And and so, we can then consider what's called the limiting process, the limiting distribution that you would get as you simulate the process for more and more steps. And that is typically denoted by pie, which is called the stationary distribution. It also has a bunch of other names. But, stationary is the most common. And if you plugin. You see, basically take out this approximate equality. And you can see that what we have is a condition on Pi. Is that the distribution of ti, at one time step is the needs to be at, the, the probability of X prime in the stationary distribution needs to be equal to this summation over here. Where Pi now appears both in the left hand side and within the summation on the right hand side. Now it turns out that this concept of the stationary distribution is actually at the heart of Google's page rank algorithm. So what they're actually computing. At least in the original page rank. Is the probability that if you take a random walk on the web that you end up on a particular website. So this notion of a stationary distribution is actually quite powerful. As we've seen. So, lets take a simple example, which is the three state mark off chain shown here. So for example note that from state one there's a probabliity of 0.75 of going to state two and a 0.25 of going to, staying in the same state. And we can now write down a set of equations that represent what the fix what property the fix distributions needs to satisfy. And if you do that you're going to get the following equations. So for example it tells me that pi of x1 has to be equal to 0.25 times pi of x1, because of the self transition here plus oops zero point, yes, sorry. pi of x1 is equal to 0.25 times pi of x1 + 0.5 times pi of x3, okay? Because there, if you were at pi that, you can end up in x1 in one of two ways. Either by starting out on x1 and staying there, or by starting out at x3 and moving to x1 which happens with probability 0.5. Similarly, pi is X two, you can end up either by being at x2 and staying there, which is this transition or by being at x3 and moving to x2 which happens with probability 0.5. And so this is a set of three simultaneous equations and three variables. this by itself is an underdetermined system because you could multiply all of the pis by a factor of 100 and it would still be a solution. But we can add to that the one constraint that says that all of the pis have to be equal. The sum of all of the pis has to be one, which is because it's a probability distribution. And once you do that you end up with a unique solution to the system of linear equations. And it's not difficult to plug those pis into the system and confirm that indeed this is the stationary distribution that satisfies these equations. By the way, for the grasshopper example that we showed on the previous slide. The stationary distribution is the uniformed distribution. so this has to be one of the dumbest way of generated a sample from the uniform distribution. But it does in fact generate eventually a sample from something that's very closely uniformed distribution. So, when does a Markov Chain converge into a stationary distribution, and I said many of them do but it turns out that not all of them, and so a condition that guarantees convergence to a stationary distribution is something called regularity, so a Markov Chain is regular if the following condition holds. And notice the order of the quantifiers. If there exists found number k, integer k such that for every pair of states. So this is the universal quantifier. The probability of getting from x to x prime in exactly k steps is greater zero. Now, notice what that means. It means that you pick the K first. And only, so you can't have a different K for different pairs of states. You can pick K to be 1000, a million. It doesn't matter for this purpose. But you need to pick it first and then there needs to be a way of getting from X to x primes in exactly that number of steps, with probability greater than zero. It turns out that is a sufficient not a necessary but a sufficient condition to guarantee that the Markov chain converges to a unique stationary distribution regardless of a starts date. So it converges and it converges to a single stationary distribution that's characterized by the equations that we saw on the previous line. Now what are some sufficient conditions for regularity because this one is a little bit hard to check. and so one sufficient condition for regularity that people often use in practice is first that every pair of states, x and x prime need to be connected with a path of probability greater than one. Sorry with a probability greater than zero. And, for every state there's a self transition. So you can always transition from a state to a self. So if you think about that, that means that if you take K to be the diameter of this graph. So if you set K to be the the distance between the furthest. Pair of states. Then, you can get, between every pair of states, using exactly k steps. Because you can take less than k, and then just stick around for a while until you hit k, because of the self transitions. So this is a way of guaranteeing that, That, for this value k, you can get from every state to every state. And that's what people typically do. They typically add these self transitions to guarantee regularity. So to summarize, we've defined a notion of a mark up chain. Which defines a general dynamical system, or an iterative sampling process, from which we can sample trajectories that traverse the space of the chain. Under certain conditions, such as the regularity condition that we defined. Which is sufficient, although not necessary. This iterative sampling process is guaranteed to converge to a stationary distribution at at the limit. That is, after we generate enough samples. And so this provides us with a general approach to sample from a distribution indirectly. Which means that if we have a distribution from which it's intractable to sample this provides us with an alternative mechanism.