0:00

Over course of the last unit that we've covered in the class we've discussed a

range of different inference methods, some exact, some approximate sometimes

solving the problem of computing marginals, sometimes for computing map

assignments. And the question is if we're faced with a

particular task that we're trying to deal with and we've designed a graphical model

which inference algorithm should we use. So what are some of the choices and what

is some guidance that we can offer about what which of methods we want to use.

The first question we should answer for ourselves is

which task is more suitable for our problem?

Are we trying to solve the problem of computing a map assignment, or are we

trying to solve the problem of computing marginals?

And these offer sometimes not obvious trade-offs in terms of what's right for

our application. [SOUND] So, computing marginals by and

large is a less fragile way to go, because we don't pick just a single

hypothesis. We compute a probability distribution

over a range of different options and we can see for example, is our top option a

little bit more likely than the second best or a lot more likely than the second

best? And.

That gives us some guidance as to how robust our inferences are.

that also gives us some amount of confidence that we can provide in are

answers and maybe give to the user as guidance in terms of how much to trust

thee the answers produced by the system. And as we've also seen computing

probabilities is important for supporting decision making once we need to integrate

for example, with the utility model. On the other hand, map is suitable in its

own set of contexts. for example, if we're trying to compute a

coherent joint assignment for example, in the context of a speech recognition

problem or an image segmentation problem. And sometimes for important that we get

something that makes sense as a whole than then we get the best possible

answers to individual pieces of the inference problem like individual pixels

or individual foams. but in a way that doesn't make at in the

context of of the larger problem. And so the notion of coherence is

sometimes important and is worth the trade-off of reducing the robustness.

We've also seen that from computational perspective map has a range of model

classes that are more tractable than in the case of computing marginals.

So for computational reasons we might sometimes adopt the use of a map solution

just for improved efficiency. And we've also seen that, especially when using

approximate inference, the map assignment often provides us with some theoretical

guarantees in terms of how close our answers are to the correct answers for

example, in the context of dual decomposition we've seen that, where as

it's difficult. We get a similar level of confidence in

terms of our, the accuracy of approximate inference for competing marginals.

So on the other hand, when running a

proximate inference, there are again different trade offs for these two

classes of problems. So,

in when computing marginals one often in gets for approximate inference algorithm

the fact that errors just by being soft but just by having soft assignments,

errors often are kind of tenuated. And, the source of inaccuracy in one part

of inference is often doesn't make it's way significantly into other parts of the

model. And so you often get more robust answers

in approximate inference, when competing marginals.

On the other hand, on the map side, one can at least gauge

in many cases, at least in some algorithms such as dual decomposition,

whether the algorithm is working or not by looking at the value of an assignment

that we get from the algorithm. So, again, somewhat different tradeoffs.

And in some applications, people actually try both and see which one works better

in terms of the performance and the downstream tasks that one cares about.

[SOUND] So what are some algorithms that we've discussed for doing marginal

inference. So first is just good old exact inference

say, variable elimination and clique tree, and by and large if the problem is

small enough that exact inference fits in memory.

4:19

In terms of the sizes of the cleats and the subsets, it's probably good to use

exact difference. You run into less problems this way and

if it fits in memory, it's probably going to be very fast.

If one is not in this fortunate situation, we've discussed.

Different types of algorithms that are approximate.

there's the range of algorithm's due message passing over the graph, of which

loopy message passing is one of the more common ones.

But not the only one. And then there is the class of sampling

methods that sample from the distribution.

And these are different categories of algorithms.

And frankly, it's often difficult to tell in advance which of them is going to work

better for a given class of models. and you talk a little bit about factors

that might favor one of these algorithms over another in a subsequent slide in a

couple minutes. What about algorithms for map?

Once again, we have algorithms that perform exact inference and in this case

its actually a broader spectrum then in the case of computing marginals.

So in addition, to cases where we have low tree width, which is the category for

which marginal algorithms work we've also seen other examples such as those with

associative or regular potentials and multiple other cases.

And once again if you can perform exact inference that's really often the best

way to go. you're · going to, it, it's often going

to give you the best performance if it's tractable.

Then there's the class of methods that are based on optimization.

And those can be exact methods which puts us in this category, but more often we're

going to find ourselves in the case where we have to use some kind of approximate

methods such as a duley composition algorithm that we've talked about/ And

those methods often as we've said lend themselves to some kind of at least being

able to estimate the performance of our algorithm relative to the optimum answer

even if we can't get to the optimum answer.

Finally, there is the range of algorithms that perform search over the space.

And this can be simple hill climbing, search which is is a fairly

straightforward application of standard search methodologies, but also it's,

is quite common to use some kind of sampling.

Like, Markov Chain Monte Carlo sampling to explore a range of different

assignments in the space, and then select among those the ones that have the

highest, or the one that has the highest score or

the highest yeah the highest log probability.

And that is quite commonly used technique, it doesn't provide the same

set of guarantees that you might get from the optimization methods.

But it's often very easy to implement and so's quite frequently used.

So if we're resorting to the use of approximate indifference.

What are some of the issues that might make our lives more complicated or might

favor one algorithm versus the other? So one complication has to do with a

connectivity structure. Just how densely connected the model is.

And by and large, the more densely connected the model is, the worse it is

for message-passing algorithms. So, message-passing algorithms don't like

densely connected models because the messages are propagated over very short

cycles, and that can cause both issues with convergence.

As well as with ac, with lack of accuracy.

So, lack of convergence and lack of accuracy.

Sampling methods are less effective by the connectivity structure.

The second com complicating factor is the strength of inference.

That is the extent to which nodes that are connected to each other have tight

coupling in terms of the preference for certain combinations of values.

In general, the stronger the influence, the harder it is for both categories of

oh of algorithms because it creates strong coupling between variables.

Which can complicate both message passing algorithms as well as sampling

algorithms, because for example if you think about plain gib sampling it makes

it very difficult to move away from the current configuration to a different one.

Strength of influence becomes an even worse problem when the influences go in

different directions. So for example, if you have loops where

one path is tending to make these variables take on one configuration.

And a different path is trying to make them take on a different comm combination

of configurations. So, for example, as we saw in the

misconception that where one path wants a pair of variables to agree on their

values and the other wants them to disagree on their values.

That really does create significant problems for both classes of methods.

And the reason for that, and this I think is at the heart of what makes approximate

inference hard, is when you think about the shape of the likelihood function, if

you have multiple peaks in the likelihood function that makes life difficult for

most approximate algorithms and the sharper these peaks the more complicated

things get. And so that is and, and multiple peaks

are generated by strong influences that go in opposite direction.

And so that's really where a lot of the complexity and approximate inference

comes from. Okay, so now what assuming you have a

model that has these problematic these problematic issues.

Well so how do we address them? First, is to look at the network and

identify the problem regions, that is subset of variables that are tightly

coupled and maybe are subject to opposing influences.

And then we try and think of how we might make the inference in these problem

regions. More exact, so if here we have some

tightly coupled region with maybe opposing influences. How do we prevent

our approximate inference algorithm from Falling into the trap of, of, of this

particular area being giving imprecise answers or, or lack of convergence?

So for example, for doing cluster graph methods we can put this entire problem

region in a cluster. It costs us something on the

computational side because we have to deal with larger clusters.

But it might be well worth it in terms of the improvement of performance.

In sampling methods, you might consider having proposal moves over multiple

variables. So, instead of sampling individual

variables, we can sample this entire block.

Using again, a somewhat more expensive sampling procedure, but again that might

end up being well worth while in terms of the overall performance profile of the

algorithm. And finally, if we're thinking of this in

the context of a math problem, we can put this entire set of variables in a single

slave, which again, costs us something, on the computational side, but can

significantly improve the convergence profile of the algorithm.

So really when we're faced with a problematic model, one that doesn't

immediately succumb to traditional inference techniques, it turns out that

the design of inference is often a bit of an art.

That is we need to study our model in depth, think about how to deal with

different pieces of it, and which inference method is best equipped to

handle the different pieces. And we often find that complicated

models. You can get the best performance by a

combination of different inference algorithms, where some pieces are handled

using exact inference, such as these larger plots, in the context perhaps of

an approximate inference method such as sampling or belief propagation.

And so one needs to think creatively about the inference problem in the

context of these more problematic models.