0:17

So, let's look at the more general case where now we have a general Bayesian

network, not necessarily binary valued, and our parameters now consist of a set

of parameters theta X for all possible values of the variable X.

And the set of parameters theta Y given X for all values of the parent X and the

child Y and that's the parameterization of this Bayesian network in the general

case of table CPDs. Now let's think about the likelihood of,

of the parameters, this set of parameters, theta what is the likelihood

of the function look like? Well, the likelihood function, remember,

is the probability of the data given the parameters,

which decomposes because the parameters are IID, as the product of all data

instances, xm and ym relative to the parameterization [INAUDIBLE].

Well, so we're going to break that probability up using the chain rule for

Bayesian networks. And that's going to give me a

decomposition of this joint probability as a product of the probability of the

parent X times the conditional probability of y given x, which is just

again, the application of the chain role in this very simple example.

And now, I'm going to basically, break this up first into two separate products.

So I'm going to move this product here. So I have one product over, just the

first set of terms and then a second product over the y given x terms, and

then I'm going to observe that the first set of terms, the probability of xm only

depends on the parameters theta x, and the probability of ym given xm only

depends on theta y given x. And so now I've broken up the likelihood

function into a product of two or are called local likelihoods,

6:52

And if we now [SOUND] consider maximum likelihood of estimation,

it turns out that the maximum likelihood estimate of x, of theta x given u is

simply, as one would expect, the fraction of X = x, among faces,

where u equals little u. So what happens with maximum likelihood

estimation when we have a model with shared parameters?

Let's consider that question in the context of a hidden Markov model, where

initially, we're actually just going to look at a Markov chain, that is a case

where we have just a single state variable S and we transition from one

state to the other. So let's look at the likelihood function

in this context before we consider what maximum likelihood estimation would look

like. So the likelihood function of the

parameter vector theta, which dictates the transition model given a set of

observations, alpha of the states between time 0 and

time t decomposes using the Markov assumption.

As a product of t from 1 to t of the probability of the state, at time t,

given the state at time t+1.1. Now we can further reformulate this

product by, looking at how it decomposes across specific pairs of states, Si and

Sj. So we can first multiply over all

possible pairs of states, i and j, and then, consider all of the time points

in which we transition from Si to Sj. And then, we have the probability of that

transition from Si to sj. Now, the critical observation is that

this probability is the same regardless of the time points,

this is because of the time invariance assumption that we have in these models.

And so the parameter, specifically, that we would be multiplying in this context

is simply the transition parameter theta Si to Sj.

Now, if you look at this expression which has all these products in it, the product

of i and j, and then the product of the time points consistent with the

transition or in which the transition from i to j took place. We end up with

the product over here which multiplies over all ij of the parameter associated

with that transition exponentiated by the sufficient statistics MSI to Sj, which is

simply the number of times in which the, in the data stream that we saw, we had a

transition between those two states. Now, if we now consider what maximum

likelihood estimation would look like for this parameter theta hat sorry,

for the parameter theta Si to Sj, the maximum likelihood estimate they've had,

Si of Sj, would be exactly what we would expect,

the number of transitions, in which Si transitioned to Sj divided by the number

of total occurrences of the state Si within the time sequence.

Let's now elaborate this to the context of a hidden Markov model just to see what

it would look like. So here we have, in addition to the first

component of the likelihood function, which is the same one that we had before,

just the transition probabilities of the state sequence, we also have an

additional component which looks at every state from one to big T of the

probability of the particular observation, O of T, given the state S of

T. Now we can just do the exact same process

of reformulating the likelihood function. we've already seen what the first term

looks like. It's exactly the same as the one that we

had on the previous slide theta Si to Sj exponentiated with

sufficient statistics. And we have an, a very analogous term

that looks for over pairs of observe, of states i and observations k of the

parameter that corresponds to the kth observation in the state i.

exponentiated with the sufficient statistics,

which is the number of times that we saw in the same state the in the same time

point, the observation k and the state i.

And so, adding now to our set of sufficient statistics, in addition to the

count of going from Si to Sj, we also have sufficient statistics that count the

number of times, time points, that we had the state Si and the observation okay.

So to summarize maximum likelihood estimation in the case of discrete graph

Bayesian networks. For a Bayesian network that has disjoined

sets of parameters in the CPDs, that is where each CPD has its own set of

parameters, the likelihood function decomposes as a product of local

likelihood functions and this is important, because we're going to use

that later on, one per variable.

So the likelihood function is a product of small likelihoods or local

likelihoods, one per variable.

Now, when we have table CPU we get a further decomposition.

Specifically even the local likelihood now decomposes as a product of.

Likelihood's forced specific multinomials.

One for each combination of the parents. And so now we get a further decomposition

that is basically going to allow us to estimate each of those multinomials

separately via the same likelihood estimation that we had in the original

case of estimating parameters for a single multinomial,

because if you just have a product of these likelihood functions each of them

can be optimized separately. Finally in the very common case of

networks that have shared CPDs, so this is a case unlike the first bullet where

we had disjoint sets of parameters, here we have shared CPDs.

We simply accumulate the sufficient statistics over all occurrences or uses

of that CPD, and then, once we do that maximum likelihood estimation can be done

in effectively the same way. Now, one important observation that

arises from the form of maximum likelihood estimation is worth

remembering and thinking about when one uses this.

so let's look at the form of the, of the maximum likely estimate for, for

parameter theta x given u and as we've seen that is a ratio between M, between

the number of accounts of the substantiation little x and the parent

assignment little u divided by the sum of all M x prime u for different values of X

prime or written otherwise, it's M x u divided by M u.

Now, what are the consequences of this expression when the number of parents

grows large? When the number of parents grows larger,

the number of buckets, that is, the number of different possible assignments,

little u, increases exponentially with the number of parents.

If you have two parents, it it grows exponentially with,

to the number of different buckets u, but if you have fifteen or 20 parents,

then the number of buckets into which we need to partition the data, increases

very quickly. And that means that most buckets, that is

most assignments little u, will have very few instances assigned to them which

means that, effectively most of these estimates which divides M of xu divided

by M of u, will be done with few or even zero instances Mu), of u, a which point

any estimate is just totally bogus. And specifically, the that means that the

parameter estimates that we're going to get in most of the multinomial

parameters, once u gets large are going to be very poor.

So this concept is called fragmentation of the data and it has the following

important consequence. If the amount of data is limited relative

to, to model dimensionality, we can often get better estimates of the parameters

with simpler structures, even when these structures are actually wrong.

That is, even if we make incorrect in dependence assumptions that are making,

that are removing edges that really ought to be there,

we can sometimes get better generalization in terms of say, log

likelihood of test data than when using the much more complicated correct

structure.