0:36

So first let's reconsider the learning problem from very high up.

The learning problem can be viewed in terms of three different components.

The first is the hypothesis phase, the classic of models that we're interested

in considering. The second is the objective function that

we're trying to optimize when we're applying the learning algorithm.

And the third is the optimization itself that aims to optimise the objective

function to give us some good model. And pretty much all of the learning

algorithms, that we've discussed, can be viewed in terms of these three different

components with just different combination of objective function or

optimization algorithms hypothesis base. So let's look at each of these components

in turn. The hypothesis space is what we're

searching for, and there's really two different pieces that we can be searching

for, the first is the parameters. Of the algorithm, and the second is the

structure. And in some cases we're searching for

one. In some we can search for the other.

And in many cases we're actually searching for both.

And this sets up a different depending on which choice we make, it sets up a

different learning problem. Some is a, in some cases it's a

continuous learning problem when we're searching for parameters.

Searching for structure is usually discrete and when we're searching for

both, it's actually hybrid search problem.

Then the second part of the hypothesis space is when we impose a certain set of

constraints on what we're searching for. And this can be done for multiple

reasons. We can constrain the hypothesis space for

computational efficiency. A smaller space is often easier to search

for. And admits different algorithms that

might be, computationally more efficient. the second is that we might restrict the

search space to reduce the model capacity.

That is the, the complexity of the models that we're considering.

And that can reduce the risk of overfitting because we're searching over

a smaller set of possibilities. And so, less likely to overfit, random

statistics in the training set. And then, a third form of con-,

constraint is when we have some kind of prior knowledge that we want to encode

and that allows us to guide the learning algorithm towards models that are more

reasonable than in a totally unconstrained way.

The second piece is the objective function.

So a most, perhaps the most commonly used objective function is that of a penalized

likelihood. so penalized likelihood has a log

likelihood component, which measures the likelihood of the log likelihood of both

our graph structure and the parameters relative to the training set.

But that has to be accompanied in most cases by a second term which is this

regularizer, which serves to guide the model of the learning toward simpler

models that are less likely to over fit the data.

And that can, that regularizer can include one or both of a parameter prior,

which guides the parameters usually towards more smooth models.

So for example, in the case of MRF's, we have

L2 or L1 as the regularizer. And in the case of Basemat,

we had some kind of Dirichlet prior in many cases.

Those are some examples that we've looked at and both of these tend to smooth the

parameters and avoid over fitting to the statistics of the training set.

And then the second form of regularization is the structure

complexity penalty, which of course, is only relevant if

we're also searching over structures. Which aims to push us towards models that

have fewer parameters or fewer edges. Again, reducing over fitting.

An alternative paradigm, which we also discussed, is what we called the baysian

score. The Bayesian score is a score that's use

when we're really searching over graph structures alone and integrating out over

parameters. So this is a case where a hypothesis

space is really just a space of structures and the parameters are kind of

implicit or integrated out. And the Bayesian score or, which is the

log of the probability of the graph given the data, is equal to the log of this

guide, the probability of the data given the graph which as a reminder is called

the marginal rythm and the log of graph prior.

Now, importantly, as a, as just as a reminder, the graph prior is a

regularizer. So, log of P of G is a regularizer over

graphs. But log of P of D given G, that is the

log of the marginal likelihood incorporates within it a regularization

term as well as we've seen before. And also serves to reduce the

over-fitting of the model. To the data.

The final component is the optimization algorithm which of course depends both on

the space that we're trying to optimize over as well as the objective that we're

trying to optimize. So if we're trying to optimize purely

over the continuous space, then in some cases we're very lucky.

We showed, for example, that for Bayesian networks with multinomial CPD's, we can

actually optimize the likelihood in closed form, and finding unique optimum

5:56

for the parameters. And this also happens when we have a

penalized likelihood, or even a full Dirichlet distribution over the

parameters. In other cases we're not so lucky and we

need to somehow optimize the function when we can't have an algorithm that

finds it in close form. So in this case we use gradient ascent or

some other kind of perhaps second order method that allows us to hill climb over

the likelihood function. And we saw that this happens for example

for both MRF learning as well as learning with missing data.

For both Bayesian and Cyrus. And of course gradient ascendant's naive

form is a fairly slow method. But there are second order methods that

build on top of gradient ascent, for example some kind of conjugal gradient,

or LBSGS that uses similar computation but that converges much faster.

Finally, there is the expectation maximization algorithm, which is an

algorithm that's specifically geared for learning likelihoods, learning with

missing data. So optimizing the log likelihood function

in cases where the log likelihood is multimodal because of the case of missing

data and we talked about some of the tradeoffs of.

Local optima, for example, in this method.

If we're trying to do discrete optimization, trying to search over the

parameter space, again in some cases, we can be lucky.

So when we're trying to do optimization over tree structures, we saw that a very

nice simple algorithm that's very computationally efficient, that of

finding a maximal weight expanding tree, can allow us to find the optimal scoring

tree in a very efficient, very efficiently, in polynomial time.

In other cases this space is more complicated, and we can do an

optimization that's guaranteed to give us, the optimal answer.

And, in this case, we typically end up doing some kind of local hill climbing,

where we And in this case we typically end up

doing some local hill climbing where we do some things like add delete or move

edges. And

And that gradually gets us to a better to a better optimum, there are also

algorithm's that takes larger steps in the space.

And finally an interesting combination is when we have to search over both the

discreet and continuous Have space together when we're trying to

optimize over both parameters and structures and that ends up being in many

cases quite complicated because when your taking a space on, on the discreet side

you also have to optimize the parameters for that new structure before you can

score that structure to figure out whether that was a good move to take.

So this tends to be computationally expensive in many cases and there's some

tricks that one can do to reduce the cost on that.

Now, every learning algorithm that we've discussed has some set of

hyper-parameters that we need to set before we can run the learning algorithm.

So, these are include for example, if we're doing say a Dirichlet prior over

the parameters it includes the equivalent sample size.

And perhaps the the prior itself, more broadly.

That usually a hyper parameter that we need to set.

if we're doing say, CRF or MRF learning, we need to set the regularization

strength for either the L1 or the L2 prior, .

If we're doing expectation maximization, we need to figure out when to stop.

And we already talked about the fact that early stopping is a useful strategy,

because it reduces over fitting. For doing structure learning, we need to

figure out the strength of the structure penalty.

How much do we want to penalize models that are more complex?

If we're doing, say, MRF or CRF learning. We have to figure out how many features

we want to add. And which of those are, we should add

initially. And when we're doing learning with latent

variables then there's the there's the decision of how many values a latent

variable ought to take for example when we're doing clustering how many clusters

should we consider. Each of these is an important decision

that we need to think about and. The question is, where does this come

from? Did we just invent this from thin air, or

do we where do we get the answers to this.

one bad answer that we shouldn't do is try and estimate these on the training

set. That is, find the hyper parameters that

maximize the objective function on the training set.

Because if we do that, that is effectively calling for a choice of these

hyper parameters that are going to over fit to the statistics of the training

set. And so a common strategy, perhaps the one

that's most commonly used, is to have a separate validation set on which these

hyper parameters are set. Which means that we we train.

On the training set. And then we evaluate on the value, on the

validation set. And we pick the high pro perimeters that

give us the best evaluation on the validation set and not on a training set.

Now. If computational cost allows us its good

to do this not on a single training set and a single validation set and at that

point we end up usually doing something like cross validation where we split up

the training set into a training set and a validation set in several different

ways and pick the hyper parameters that give us the best cross validation scores

and then we use those hyper parameters on the each try to train on the entire

training step. This of course gives rise to the question

of what it is that we're actually evaluating.

So what are some of the evaluation criteria that we might use to figure out

whether a mall is good or not. So one obvious one is log likelihood on a

test set. So if we're actually trying to fig, to

find a model that gives us good predictive performance on unseen data,

then log likelihood of the model on the test set is a good way of measuring.

But in many cases that's not the objective that we actually care about.

And so in that case we might often evaluate the learned model using a

task-specific objective. For example segmentation accuracy for

trying to do image segmentation or Or speech recognition accuracy.

Word error rate. it's called W, E, R.

Would be another task specific objective on which we can evaluate a model, and, if

you really want to get a hyper forming model, it's useful to, think about

optimizing the objective that you actually care about, in the context of

task. Finally when we're trying to do knowledge

discovery, it's often useful to think about the extent to which the model that

you learned matchs, with prior knowledge, so if you're doing clustering for

example, if you have some notion of, clusters, that you, that you think exist

in your data tryin to see, what that the clustering algorithm.

Is, more or less compatible with those, is a good idea to do a sanity check, on

the model as well. As well as for example, if we have some

notion on edges, that ought to exist in the network, trying to see a match with

those is also a useful thing. So now we've learned and now let's think

about some. Possible error modes that might occur and

how we might diagnose them and how we might fix them.

So one error mode that occurs often is under-fitting.

Under-fitting means that the model is not sufficiently expressive to capture the

patterns and data. And this we can recognize when the

training and the test performance are both low.

And in that case, it suggests that we just haven't put enough expressive power

into the model, and it can't even get good performance on the training set.

at that point, we can have several different solutions.

We can decrease the amount of regularization, because maybe the

regularization is driving us towards a regime where we can't fit the patterns in

the training set. We can reduce.

Structure penalties, to allow more edges and more interactions to be learned.

And then, a very important one is to add features into the model.

Often done by a very targeted error analysis.

In which we look at the errors the model is making, and say, wait a minute.

In this kind of, in this kind of instance.

I would have expected to see this answer, and I didn't.

So how do we improve the model by adding features, so as to give us, better

results for, for this set of instances. I, the complimentaryt error mode is over

fitting, so over fitting can be diagnosed in cases where the training performance

is high, often rediculously high, where as the test performance is low, and that

indicates the model has fit, patterns that are in the training data that

release the statistical noise and don't generalize to other, unseen instances.

And in this case the solutions are complementary.

You increase the regularization. You impose capacity constraints by

reducing the by forcing the model to pick within the subclass.

And you reduce the set of features so as to eliminate or reduce the capacity of

the model to overfit. A different error mode happens when our

optimization algorithm is not doing a good job of optimizing the objective that

we've laid out for it. So that happens when the optimization

might not be converging to a good or the global optimal and.

People tend to associate this error mode in cases where the problem is

multi-modal, so we're converging to a poor local optimum but in fact, it, this

can happen depending on whether we carefully, design our optimizational

algorithm and can happen even if the problem is a convex problem, it can for

example when we don't set our learning rates appropriately and we never actually

converge to, to an optimal. So a way to try and diagnose that is to

compare models learned with different learning rates with different random

initializations and basically see whether these models give rise to very different

values of the objective function. And if they do it suggests that we should

be very careful about how we optimize and maybe try multiple random starting points

or set the learning rate more carefully. Now.

This, previous. analysis called for comparison of the

objective function. That is the objective that we're trying

to optimize. A very different error mode happens when

our objective function is not actually a good match for the performance method

that we are trying to optimize. So for example, one important thing to

check for, is cases where we have two models say that we learned in some week

say as we discussed for the previous slide and in mod, the objective function

for model one is significantly better than the objective function for model two

that the performance objective that we care about say segmentation accuracy for

model one is considerably lower from the performance for model two and this

suggest the case where the, we are optimizing the objective is not.

a good surrogate for optimizing the performance that we care about and that

happens quite often when we are using a regularize likelihood as an objective

where regularize likelihood is not necessarily a good match for the

performance for the objective that we actually care about.

And an important mistake that people often make is to say, well, let's try and

change the optimization algorithm to give us better results in this case.

That is a bad idea because it's. You can't fix a bug in the objective

function by breaking the optimization algorithm.

The right thing to do is to redesign the objective function to match better the

desired performance criterion as, so as to get a better alignment between those

two. So to summarize let's think about the

typical learning loop that we use when trying to apply machine learning in

practice. First we design the model template which

tells us for example the set of random variables that are involved and whether

the model is directed or undirected. And perhaps some notion of the kind of

features that we might include. We then select the hyperparameters via

cross validation on the training set. We train on the training set.

With the chosen hyper parameters. And then we evaluate performance on a

held out set. To see which of the error modes, if any,

that we've have discussed on the previous slides, is actually occurring .

This gives rise to a process of error analysis.

What kinds of errors are we seeing? And how do we redesign the model or the

objective function or the optimization algorithm, in order to address those

problems? .

That gives rise to a new model template and perhaps some other changes.

And this cycle then repeats. When we're done and only when we're done

we can now report results on a separate test set which is not the same.

As the held out set because we've effectively used that held out set to set

a lot of model parameters and reporting results on that held out set would be

misleadingly optimistic in regards to the performance of the learned model on real,

unseen data because this held out set is not actually unseen data at this point.