We previously defined the notion of a Markov chain that allows to generate samples from an intractable distribution. But we left unanswered the question of, assuming we have a Markov chain that has the desired stationary distributions. How do we go about using it in the context of a concrete algorithmic procedure? Well let's imagine that somebody has provided us a Markov chain. for, a distribution, sorry. The, imagine that our goal is to compute some probability relative to a distribution P. But, for whatever reason, P is too hard to sample from directly. And so how do I use the Markov chain to get around that? First, we construct the Markov Chain, p, whose unique stationary distribution is P. SO so it needs to be a regular Markov chain, usually. And then we're going to generate samples from this distribution. So we're going to start out by sampling my, initial state from some arbitrary distribution, p0. P0's not the same as P. So obviously I can't use the zero sample as a way of estimating an event relative to P. And so what I do now is I start walking along the chain. Starting sampling xt + 1 from the transition model. And because the chain converges to a stationary distribution eventually I'll have a sample that is very close to to the to be in sample from the distribution P that I care about. So, now, here's the sticky issue. And this is really a sticky issue. when have we walked long enough that we could actually use our samples? Because we don't want to use samples at the beginning, because they're far from P. and so we need to walk around long enough for the chain to get close to its stationary distribution. And that state is called mixing. So mix is when we, is when P of t is close enough. Pi. So that your time T distribution is close to the stationary, and that's the point that which I could say, good enough, I can start collecting samples. So how do we know if a chain has mixed or not? And the short answer is you don't. and so this is, as I said, the sticky part of Markov chain methods. So in general, you can never really prove that a chain has mixed. But in some cases you can show that it hasn't mixed. And then if you run lots and lots of tests, and none of them have convinced you that the chain hasn't mixed, then you sort of resign yourself to assuming that it has, in fact, mixed. So how do you convince yourself that a chain hasn't mixed? You compute chained statistics, and we'll show some examples of that in a moment, in different windows within a single run of a chain. So for example, here I have a single run of the chain. And I run for a while And then I look at little windows, usually they won't be this small and I compare this window. This window computing various statistics. For example, what is the probability of hitting a particular state? And I say, is the probability of that in this window close to the probability of that in this window? And if it is, then maybe I've reached at least some kind of convergence point, in terms of where the chain is reached. Now this of course is not a sufficiently good answer because it could also be derived in a chain that has these sort of two very high probability regions but are very hard to get from each other. And if the chain is kind of ambling along in this part of the space and never hitting the other part of the space. Then, you're still going to get very similar probabilities across a window of a single run of the chain. Because all of the samples are taken from this part, and none of the are ever taken from that part. And so we don't know that we have them mixed. So, a more reliable statistic is to take these, more reliable evaluation criterion is to take these statistic across different runs that are initialized in different parts of the space. And then you might hope that one chain is traversing, one run of the chain is traversing this region, and another run of the chain is traversing this region. And so now the statistics would show a difference and indicate that mixing hasn't taken place. So, what statistics might, So how, what might we do it, more concretely? So let's look at two examples. Here is, two example runs of a chain that we'll describe a little bit later. and what we measure here. And this is the first statistic to measure in Markov chains, is the log of probability of of a sample. So you compute, The log probability of sampling. You can't always compute the log probability directly. You might compute an un-normalized log probability, as we'll talk about, as we'll talk about later. But basically, you compute the log probability, or some constant factor thereof. And now, you can compare two runs. And this is a run that's initialized from an arbitrary state. And this is one that's initialized from a high probability state. And you look at those, and you say, has it mixed, relative to this criterion? The answer is maybe. It looks okay, but you're not entirely sure. But we can look at other statistics. Oh, sorry. Let's look at another example. What about this one? Here is exa, again an example of two runs, one of which is initalized from an arbitrary state, and initialized from a high probability state and you can see, that the log propability values are no where close to each other and so in this case the answers is definetely not, these are two runs of the chain, and really, neither is mixed. And so you need to run for a lot longer which, you know, this comes out, this goes up to 600,000 so it kind of indicate to you how much time this might take. A different statistic, a different way of looking at this, is for a different kind of statistic. So now we have for example, the probability relative to a window that we compute in the chain. So remember, all of this is relative to a window in the chain, after we hope that mixing has taken place. And now we compute the probability that, within states in this region, what is the probability of, that, the states are in some sets, so for example the set of states where. X3 is equal to two. And now we compute that statistic using the two initializations of the chain, so this is chain one, or run one, and this is run two. And now we do a scatter plot for for different statistics. So each of these points. This is the probability say of X3 equals X3 equals value two. This might be the probability that X1 is equal to zero. This might be some other probability that, you know, X5 is equal to seven. And so each of these is a statistic and what you see here is a scatter plot. One is the estimate that they get from the one chain and from the other, and looking at this It should be obvious that this first chain, has not, that, we have not got mixing on the left hand side, because you can see that there are all these points here that have high probability in one of the two rungs, and a probability zero in the other, and vice versa. Whereas here, most of the estimates are clustered around the diagonals. So that you're getting similar estimates from the two chains. And so again, this one is, the first one is a clear no and the second is maybe. And if I do a lot of these statics and they all come up with maybe then I'm willing to then trust the answers and the that is taking place. So now that I've started collecting samples, how do I use these samples? Well, one important observation to keep in mind is that once the chain is mixed, all of my samples are from stationary distribution. That is xt is from pi, so as xt1, + 1, t2, + 2, t + 3 and so on and so forth. And so we can use every single one of those samples. Because they are all from the correct distribution. So once I've determined that a cert-, that I've sampled long enough for mixing, or believe that I've sampled long enough for mixing. We should collect and use all of the samples. And in fact, there are you might read some papers that say. Oh, I'm only going to collect every hundredth sample. There's actually papers that prove that using every single sample is better than collecting every hundredth sample. But then you might ask, well why would those papers tell me that I should only collect every 100 samples as opposed to collecting all the samples if they're all from the correct distribution? Because, and this is undoubtedly true, adjacent samples, ones that are near by to each other in time are correlated with each other. Because how, even if xt is from the right distribution Pi and so is xt + 1. Xt + 1 is still going to be close to T, close to xt. And so you're not really getting two different samples. You're getting two that are very close relatives to each other. Now as I said it's important to recognize that phenomenon. Because it's important to realize that just because you've reached mixing and collected a thousand samples, doesn't mean you have a thousand samples worth of information. So you shouldn't go back and apply the, apply the you know, one of the bounds that we saw in assuming that you have iad samples. The samples are not. I ID. So that, that doesn't mean you shouldn't use them, using them is still better than not. Now, this is where you get bitten twice by the same phenomenon. The worse a chain is to mix. So the longer you need to wait for the initial distri, for the initial samples to be good enough, the more correlated the samples are because the slower you are moving around in the space in general. And so if a chain is bad, it's bad in two different ways. It's bad because it takes you longer to mix, and it's bad because the samples you are collecting are not as useful. because of the correlation structure between the samples. So to summarize, first the algorithm and then the implications. Here is how we would actually end up using a mark off chain. So first of all I'm running C chain and parallel. And I'm going to sample and initial state from each of them. And then I'm going to repeat until I reach some determination that mixing is taking place. And what I do is I forward march each of the chains using a random walk process, where I generate the T plus first sample from the Tth sample for that corresponding chain. Then I compare windows statistics, of the type that we showed earlier in the different chains, to try and determine whether mixing has taken place. And then, I repeat until I am con, confident in the mixing properties. With that, I now repeat until there's sufficient samples to collect data. So I start out with an empty sample set, and then for each of my chains I generate a sample. Then I put it in my sample sets. And I repeat them until I have enough enough samples. And then finally, when I've collected enough samples, I can go ahead and compute the expectation in anything that I care about, whether it's an indicator function or a more complicated function. Summarizing the implications, Markov chains are a very general purpose class of methods for doing approximate inference in a, in general probabilistic models, not even necessarily probabilistic graphical models. they're often very easy to implement because the local sampling that we're doing is often quite straightforward to execute. And, it has good theoretical properties as we sample, as we generate enough samples that are sufficiently far away enough from our starting distribution. Unfortunately, these theoretical guarantees are very theoretical in many cases. And so this method also has some significant cons. First, it has a very large number of tunable parameters and design choices. What's my mixing time? Which statistics do I measure? How close are the statistics to each other in order for me to declare that mixing is taking place? how many samples do I collect? do I what window size do I use to evaluate mixing. All of these are design choices, they all make a difference, and so there's a lot of, finicky tuning that needs to be done when running an MC and C method in practice. The second is that bit depending on the design of the chain, this can be quite slow to converge because it's very difficult to design chains that have good mixing properties. And finally, it's very hard to tell whether a chain is working. That is it's not straight forward to determine whether the chain has mixed. and how whether my samples are sufficiently uncorrelated to each other that I'm getting a reliable estimate of whatever it is that I'm trying to figure out. So, a lot of advantages but also some significant disadvantages.