We've spent the entire course talking about probabilistic graphical models, we've talked about variants of probabilistic graphical models on the presentation site, different inference algorithms that can be used to answer various types of questions about probabilistic graphical models, and algorithms that can be used to learn probabilistic graphical models from data. Now let's take a step back and think about probabilistic models and look at what they might do for us. So, first, why probabilistic graphical models? So probabilistic graphical models are in some sense the synthesis or the marriage of foundational ideas from statistics and important techniques from computer science. On the statistics side, we get sound probabilistic foundations that tell us about probability distributions and how they can be used to think about uncertainty and various manipulations on the probability distribution such as conditioning and techniques such as value of information or decision making that can be used to build on top of probability distributions to make decisions. From the computer science side, we get the idea of taking this probability distribution over very high dimensional space and using ideas such as graphs, as a data structure for exploiting algorithms that, that. That, allow us to manipulate these high dimensional objects very efficiently. And it is these ideas that we got from computer science. That allowed us to take, these sound probabilistic foundations and make them applicable to the kinds of complex real world problems that we're now facing, in many scenarios. Now, a key component of the ability to take a high dimensional probability distribution and considering, consider it as an object, is the fact that we have now obtained a declarative representation of our knowledge about the world. So that declarative representation, which is a model, a probabilistic graphical model, is a stand alone unit that we can consider separately of any algorithm that we then use to manipulate it. Now that's important because it allows us to refine the model, separately from refining the algorithm, that that we're using. So we can develop the model, and consider whether it's the appropriate one for application. And then we can apply one or multiple different algorithms to try and answer questions regarding that model. It also allows us to separate out the model construction phase from the inference phase, and the model construction can involve one or both of elicitation from a domain expert, or learning from Dana, and we talked extensively about learning methods and also to some extent about elicitation methods. Now, when would we apply probabilistic graphical models? So we would apply probabilistic graphical models when we have noisy data and uncertainty. That's pretty much true in any application. So that, that's pretty much a given. they're also particularly useful when we have lots of prior knowledge that we want to embed into our model it's often easier to do this in the context of probilistic graphical models then in a variety of other approaches for example in many traditional machine learning settings its actually quite difficult to incorporate prior knowledge in as natural a way as one can do here. probabilistic graphical models are particularly useful when we have, when we wish to reason about multiple interrelated variables. So, if you're trying to predict just a single variable probabilistic graphical models might not be the most ideal solution and there might be. A range of other methods that are equally or more competitive from, say, traditional machine learning. But when you're trying to reason about multiple variables simultaneously, then the ability to incorporate the interrelationships between them is often very powerful, and, and here, probabilistic graphical models can really shine. Another useful place for probabilistic graphical models brings significant value is when we want to construct richly structured models that are constructed from modular building blocks. So when we're trying, for example, to do fault diagnosis or medical diagnosis, we often have this little building blocks of models that can recur across different models, for different diagnostic problems and this architecture allows us to take these building blocks and construct very rich models from those very quickly by reusing components. So, we've talked about probabilistic graphical models in terms of the set of design choices. There is design choices on the representation side. the inference algorithm that we choose to use, and on the learning algorithm that we might adopt should learning be required. It turns out that these design choices, although the declarative representation allows us to separate them and consider each them in isolation, and develop each of them separately, they're not actually totally separate. And it's important to consider the interactions between them. So on the representation side it's clear that which representation we choose affects the cost of both inference and learning. We've talked about differences in the cost of inference for different types of, of graphical models and also we saw that in the learning setting there is very different learnability and complexity issues when you're learning say directed or undirected models. The inference algorithm is also interrelated with everything else. So the inference algorithm for example is used as subroutine in many of the learning algorithms that we showed, for example in MRF or CRF learning even in the complete data case and in any setting when we're learning with incomplete data. And so, thinking about the complexity and properties of our inference algorithm is critical in designing good learning algorithms. And we also know that some inference algorithms are only usable in certain types of models and so we're aiming for a particular inference algorithm that affects our choice of representation. And finally the learning algorithm also imposes significant constraints on the modelling. Because our ability to well identify the, the correct patterns from the data impose significant constraints on the expressive power, for example, our model. And only certain types of patterns might be identifiable given the amount of data we have, and the type of data we have, complete or incomplete, for example. So lets look at one example of some of these trade-offs in the context of image segmentation. So here one of the first design decisions that we need to make is whether we're going to try and model this as a Bayes net, so it's a directed graphical model. And MRF or a CRF, where we're trying to predict the segment labels or the, the pic, the super pixel labels S sub I, given features. X. So we're trying to do P of S, given X, rather, than a join distribution. What are some of the tradeoffs that we might want to consider for this kind of decision? Well one is the naturalness of the model. So initially just thinking about this. It seems fairly clear that a directed model where we have directed edges from one super pixel to the other doesn't seem like a particular natural, a particularly natural choice in this context because which direction would we pick. And what implications would this have if we went, for example, using directed edges from the top right corner of the image to the bottom right corner of the image, it just doesn't seem to make sense. And so a more natural modeling paradigm here seems to be the MRF or the CRF. But that still leaves us the question of which of those should we use. And here for example, one possible design criterion that we ought to consider is that we might want to use a very rich feature representation. That is that these features X should be richly structured and count different gradients and different gradients of texture and color across the image for example, different histograms. These rich features are often highly correlated with each other and we know that the correlated features are hard to model in the sense of capturing correctly their correlation structure and so in this case. A CRF seems like it might be a better choice. Because it avoids the issue of how do we model the correlation between these, the correlation structure between all these rich features. And so it allows us to use rich features. Without falling into traps of incorrectly modeling the correlations, which might diminish our performance. Now another related question is that we need to consider is the cost of inference in this kind of model, and that comes back and influences the modeling. So for example for certain types of models as we've discussed for example when the when the connectivity when the structure in the potentials in the MRF or the CRF is associative or regular. Then, we know that we can apply very efficient inference algorithms that utilize, min cuts and other combinatorial graph algorithms that are much, much faster than applying say, message passing schemes. So that's all very but then it imposes on us the constraint of making sure that our, that our potentials in fact are assocative, and that does have implications about the baility of our model to correctly capture the structure and the distribution, so for example it is very easy to capture using associatives potentials the fact that we espect adjacent. Pixels to have the same class, to have the same label. That's very easy to capture using this type of potentials, but not quite so easy to capture is for example the, the constraint, not the constraint but the tendency that cows are on top of grass. As opposed to on top of water or on top of road or on top of sky. That's a potential that goes between one class and the other, and it's very difficult to come up with with a with an associate to function that actually captures these kinds of targets. Another thing to consider is training costs when you think about this. So, here in fact, MRF or CRF are more expensive than Bayesian networks. Now we might still in this case want to take the hit because of the, because of the benefits on the modeling side, but we've already seen that in the case of complete data tr, training a Bayesian network is considerable more computational effective than training an MRF or a CRF and so, it's a place where we're taking a pay, we're taking a hit for a design decision we made on the representational side. And then finally. if we have to learn with missing data that sets up yet another constraint that we want to think about and might not be obvious going in. So say that we have a scenario where the segment labels are not observed. That is we're trying to learn, to segment. In an unsupervised way. So from unsupervised data. Well, in this case, the S's are unobserved, they are latent. Which means that optimizing P of S given X doesn't even make sense because s is never observed. And so in this case we simply cannot learn a CRF. In this context because there is no the objective of optimizing the labels S given the features just is completely undefined in the setting and so at this point we have to resign ourselves to learning using a different representational scheme. And so this is another case where we see the intertwining between different choices that we make and different constraints that we have in that in this case learning with missing data interacts with the choice of representation in potentially unexpected ways. Now, one important component, one important benefit of probabilistic graphical modules and their modularity, is that they allow us to mix and match different ideas within the modeling framework. And the reason why that's important is because it gives us a much greater richness of options than by considering individual topics that we, that we learned as a solution in and of itself. We can often combine those in interesting ways. So to see one example of that let's go back to our image segmentation that we just talked about and think about models that actually misdirected and undirected edges. So here, we might, for example if we're trying to learn image segmentation for unsupervised data. We might have undirected edges over the labels S, as an MRF, just as we see in this diagram, which is the natural directionality or rather the lack of directionality that we discussed before. But we have directed edges from the S's to the features XI. And this is a model that is half undirected, it's undirected at this level and it's directed going from the S's to the X's. Now why is that good? It's good because, first, it gives us a meaningful optimization objective, that is we're trying to optimize using our model parameters and our model structure, the ability to explain the images that we've seen using a using the segment characteristics or the class label characteristics. But why is it, so in that respect it's a generative model that by forcing us to in some sense generate the observed data can allow us to learn something about the statistics or the patterns in, in the distribution. But by allowing, by making this model directed as opposed to undirected which we could have done by doing an MRF over the entire set which would also have been a generative model, it actually greatly facilitates the learning because each of these potentials, P of XI given SI can be learned separately without trying to optomize over the model as a whole and so is considerably more efficient. So this is the case where mix and match is, a, a better strategy than a unified model that is entirely directed or entirely undirected. Another type of mix and match can occur on the inference side. We learned about a variety of different inference algorithms, and we might be tempted to just take a model and just throw it into a black box of one of those inference algorithms or another. But it turns out to be much more beneficial in certain cases, to consider applying different inference algorithms to different pieces of the model. So for example we might use something like belief propagation or Markhoff chain Monte Carlo on some subset of variables, but on certain sets of variables we can do exact inference which gives rise to much more accurate results over those subsets and perhaps makes the convergence properties of our algorithms considerably better as well as the accuracy that we get. So, let's take an example, this is an bipartite gr-, MRF that we have seen before where we have a set of variable of A on the one side and B on the other as we have seen before and let's assume that there's fairly dense connectivity between the variables and maybe even full connectivity between the a's and the b's. Now, let's think which algorithm we might want to use for this kind of a model. If there is a very small set of As or a very small set of Bs, then we can use exact inference, but that's not usually the case, and so we might have to resort to a proximate inference. One possibility is to apply belief propagation, but if you look at this graph, you can see that it has a very large number of fairly tight loops. And we know the tight loops are a problem for belief propagation and so we might be worried about the quality of the convergence and the quality of the answers that we get. Another approach is to use something like the Markov chain Monte Carlos such as Gibb sampling. but in this case if we have a lot of A's and a lot of B's then we're sampling over a large very high dimensional space. And we know that sampling methods in high dimensional methods have, again, limitations in terms of the rate of convergence and the quality of the answers. So one possibility that we might consider here is a variant of of Markov chain Monte Carlo where we only sample for example only, over the A's, say there's fewer A's than B's. So we sample over the A's and then for each assignment over the A's, so for each little assignment, A1. of M up to AK then, which is one of our particles, we maintain. A distribution. Over the Bs, in closed form. So we're effectively doing exact inference over the B's and sampling only over the A's which is going to considerably improve the rate of convergence as well as the statistical properties of our estimator. And this is just one example that we in that we can talk about other examples of current belief propagation that we can construct larger clusters in where in each we run exact inference there's many, many such examples where mixing different inference algorithms can be tremendously beneficial. Mix and match on the learning side is equally useful. So again, we can apply different learning algorithms to different parts of the model. So, going back to our image segmentation example, and now we're assuming we're back in the case where we're doing learning from fully observed data. So we're, going to consider CRF. it's very commonly used to to train, a high quality. easily trainable models such as a support vector machine or some kind of logistic regression for P of S given X. And we can train that very efficiently. We can use sophisticated machine learning tricks like kernel methods or various other ideas to get really high accuracy node potentials. And then fix those and use CRF learning for the higher order say the pair Y's or even high order potentials and it turns out that when you do that it's actually less susceptible in many cases to over fitting and its much, much faster to train. So by putting together different pieces of functionality different ideas it turns out that we can build some very nice integrated applications and I just want to end this discussion with one example of a cool application that puts together a whole range of different of different ideas into a single framework. So this is a a work from Microsoft Research, which has a longstanding tradition in the use of probabilistic graphical models, and it puts together ideas from, from speech recognition, from diagnosis, and from decision making under uncertainty. So let's watch the video and then talk about it briefly. Hi, thanks for coming to the triage kiosk I am in touch with the best doctors in the world. Are you here for your child or yourself? No, for my child. So, how can I help you? She's been having diarrhea. Oh, no. Sorry to hear that. Hi there. What's your name? Manu /g And how old are you? Five. So, five years old, right? Yeah. Almost six. Okay. Six years old. Have you felt nauseous? I don't know. So what we can see here is the use of value of information to figure out which questions ought to be asked, that are going to be the most useful for improving the quality of the diagnosis and the ability to make decisions in this, Bayesian network that we see over here on the right. I'm sorry. Has your child been vomiting at all? No. Okay. So lets see, has there been a fever? Slight. Have you been feeling tired lately? Yes. Lets see, how about abdominal pains? He's not been complaining about it. Has your tummy been hurting? Yes. Oh, I'm really sorry. I hope it will heal better soon. Another component here is the use of the current model, the current scenarial situation to guide the speech recognition system so as to get better quality of results from speech recognition system. So instead of doing totally unguided speech recognition system, one uses the current scenario as the model, as, as the system is understanding it to guide the priors on what the person is probably saying. Okay. Do you happen to know if your child has been losing weight? No. That's okay. Have you noticed whether your child's stool looks funny? Is it black or darker then usual? It is darker then usual. Okay, let's see. Considering everything you've told me, I'd like to schedule an appointment with the doctor just to be sure. However, not to worry. I am not that concerned at this point. So to summarize. Probabilistic graphical models provide us with an interpreted coherent framework for reasoning and learning in complex uncertain domains. And people over the years have developed an enormous suite of tools that can be used for accomplishing different tasks within this framework. And we can put them together within the single unified framework to solve much harder problems than any single tool can do alone. We've seen, over the course of different examples in this course that this framework is used in a huge range of applications and there's many more in which this can be used. And so overall, we can see that, there's a lot of places where one can continue to work in this paradigm, both in utilizing it, for applications that might be of interest and new ideas that come up and also, improving the foundational methods for representation inference and learning, that could allow us to address yet new problems.