0:01

Continuing in our discussion fo parameter estimation.

Â previously we talked about maximum likely of estimation, which tries to optimize

Â the likelihood of the data, given the parameters.

Â And, an alternative approach that offers some better properties, is the approach

Â of Bayesian estimation, which is what we're going to talk about today.

Â So first let's understand why Maximum Likelihood of Estimation isn't perfect.

Â So consider two scenarios, in the first one the team, two teams that played ten

Â times and the first team wins seven out of the ten matches.

Â So if we're going to use maximum likely estimation. The probability of the first

Â team winning is 0.7 which seems like an unreasonable guess going forward.

Â On the other hand, we take a dime out of our pocket and we toss it ten times and

Â it comes out heads seven out of the ten tosses.

Â Maximum likely destination is going to come out with exactly the same estimate,

Â which is the probability of the next coin coming out heads is also 0.7.

Â In this case that doesn't seem like quite as reasonable inference based on the

Â results of these ten tosses. To elaborate the scenario still further,

Â let's imagine that we take that same dime and now we patiently sit and toss it

Â 10,000 times. And sure enough if comes out heads 7,000

Â out of the 10,000 tosses. Now the probability of heads is still 0.7

Â but now it might be a more plausible inference for us to make than in the

Â previous case, where we only had ten tosses to draw on.

Â And so, maximum likelihood estimation has absolutely no ability to distinguish

Â between these three scenarios. Between the case of a familiar setting

Â such as a coin versus an unfamiliar event such as the two teams playing, on the one

Â hand, and between the case where we toss a coin ten times verses tossing a coin

Â ten thousand times. Neither of these distinction is apparent

Â in the maximum?]. likelihood estimate.

Â To provide an alternative formulism, we're going to go back to our view of,

Â parameter estimation as probleistic graphical model, where we have the

Â parameter theta over here and we have the data being dependent on the parameter

Â theta. But unlike in the previous case, where we

Â were just trying to figure out the most likely value of thena.

Â Now we're going to take a radically different approach.

Â And we're going to see that thena is in fact, a random variable by itself.

Â It's a continuous valued random variable. which in this case, in the case of a coin

Â toss, takes on value in the space, 01 but in either case it is a random variable,

Â and therefore something over which will maintain and probability distribution.

Â Now this is factored at the heart of the Bayesian formalism.

Â Anything about which we are uncertain we should view as a random variable over

Â which we have a distribution that is updated over time as data is acquired.

Â Now let's understand the difference between this view and the maximum

Â likelihood estimation view. So certainly we have as before that given

Â theta, the tosses are independent. But now that we're explicitly viewing

Â Theta as a random variable we have that if thena is unknown, then the process are

Â not marginally independent. So for example, if we observe that X1 is

Â equal ten that's going to tell us something about the perimeters, going to

Â increase our probability that the perimeter favors heads over tails and

Â therefore is going to change our probability of other coin tosses.

Â So the coin tosses are dependent. Marginally, not given theta but, without

Â being given theta they're marginal dependent.

Â So that really gives us a joint probabilistic model over all of the point

Â tosses and the parameter together. So if we break down that probability

Â distribution using this PTM that we have over here, it breaks down using the chain

Â rule for that Bayesian network that we have drawn there.

Â So we have P of theta which is the parameter for the roots of this network

Â and then the probability of the X's given theta, which because of the structure of

Â the network we have that they are conditionally independent given theta and

Â so we hav. Which this over here is just our good

Â friend from before, the likelihood function.

Â 4:53

Which is just a probability of the theta given the parameters and we've already,

Â specified, computed what that is in the context of this coin tossing example, and

Â that is data to the power of the number of heads times one minus theta to the

Â number of tails. But now we have an additional term which

Â is the probability of theta which we obtain from the prior that we have over

Â thena. And now we can, by virtue of having a

Â prior and in fact joint distribution, you can now go ahead and compute a posterior

Â over my parameter theta given my data set D.

Â So this is after having observed the values of N coin tosses, I have a

Â probability distribution over a new probability distribution over the

Â parameter and by simple application of Bayes rule that is going to be equal to

Â the probability of the data given Theta which is again my likelihood function

Â 5:53

times the prior, divided by the probability of the beta,

Â which importantly just as in our application of Bayes' rule is a

Â normalizing constant. And constant here means relative that

Â they know which means that if I know how to compute the numerator, I can derive

Â the denominator by simply, in this case, integrating out over the value of theta

Â to derive the normalizing constant required to make this a legal density

Â function. But the most common parameter

Â distribution to use when we have a parameter that describes multi-nomial

Â distribution over K different values. Such as this parameter beta is a, is

Â what's called a Dirichlet Distribution. Now the Dirichlet distribution is

Â characterized by a set, alpha-1 up to alpha-K, of what are called

Â hyperparameters. And that is to distinguish them from the

Â actual parameter's data. So the,

Â probability distribution, that is defined using these hyper parameters, is a

Â density over theta, here's theta, which has the following form.

Â Let's first look at this part over here, which is which is the part that acts as

Â kind of a parameters staina, and what we see here is that we have for each of my,

Â param, from each, for each of the, entries theta I in the multinomial, we

Â have an expression of the form theta I, to the power of alpha I minus one, where

Â alpha I is desociated hyper parameter. In order to make this a legal density,

Â we have, in addition, a normalizing constant, that partition function.

Â Which, in this case. And this is something that we'll come

Â back to has the following form that we're not going to dwell on right now.

Â it's a ratio of these things called gamma functions where a gamma function is

Â defined via the following integral. And for the moment we don't really need

Â to worry about this because the only thing we really care about for the moment

Â is the form of this, internal expression over here, knowing that it needed to be

Â normalized in order to produce density. Now intuitively, and we'll see this in a

Â couple of different ways, these hyper parameters, these alphas correspond

Â intuitively to the number of samples that we've seen so far.

Â So let's understand why that intuition holds.

Â but before we do that, let's look at a couple of examples of Dirichlet

Â Distributions and this is an, a special case of the Dirichlet Distribution.

Â Where we have just two, values for the random variable.

Â So it's really a distribution for a Bernoulli Random Variable and in this

Â case, Dirichlet is actually known often as a beta distribution.

Â But a beta is just a Dirichlet with two micro parameters.

Â 9:00

So here we have several examples of a Dirichlet beta distribution.

Â This one is the Dirichlet beta one, one. And, notice that, that corresponds to, a

Â to a uniform distribution. as we increase the number of increase the

Â hyperparameters. For example, we go to this green line,

Â which is Dirichlet 22, we notice we get a peak in in the middle.

Â So there's, there's an increase around 0.5 and that corresponds to a stronger

Â belief that the parameter is centered around the middle.

Â That, probability increases yet further, when we go to the Dirichlet 55 or beta 55

Â where now we, have an even bigger peak around, the value in the middle.

Â And as we shift the amount of data that we get and its mix, this distribution is

Â going to get, to moved to the left, or to the, or to the right depending on the mix

Â between heads or tails in this case. And as we get more and more data, the

Â distribution becomes more and more peaked.

Â So, so roughly speaking we have the, the mix.

Â between alpha heads and alpha tales, the balance determines is the position of the

Â peak. And the total alpha returns how sharp it

Â is. So now that we know a little bit about

Â what the Dirichlet distribution look like. Let's see how it's updated as we

Â obtain data. So let's, consider a case where we have a

Â prior, which we're going to assume is Dirichlet, we have a likelihood which is,

Â for data set, d, derived from a multinomial, a multinomial theta.

Â And now we'd like to figure out the posterior C of theta given D after having

Â seen the data D. So, the likelihood we've already seen

Â before. This is the probability of a data set

Â that has, in this case mi being the number of instances with value little xi.

Â And, so this is just the likelihood function.

Â And, the prior, has the form of a Dirichlet with the associated type of

Â parameters. And what's important to see, looking at

Â this, is that the theta I term in the likelihood and the theta I term in the

Â prior have exactly the same form. So when you multiply the likelihood with

Â the prior, you can bring together like terms, those with theta I at the base of

Â the x turn. And you're going to end up with a

Â posterior that looks exactly like a Dirichlet distribution as well because

Â it's going to have the form theta i to the power of mi plus alpha I minus one.

Â So if our prior was Dirichlet alpha one up to alpha k and the data counter m 1 up

Â to m k then the posterior is simply a Dirichlet with hyper parameters alpha 1

Â plus m1 up to alpha k plus mk. And, that again, suggests that the

Â hyperparameters of a distribution represent counts that we've seen.

Â If a priory are counts for, xi were alpha i, and now we saw an additonal mi counts

Â for alpha i, then now in the posterior we have an alpha I plus mi counts that we've

Â seen for that particular event. Now, from a formal perspective, this is a

Â useful term to know. this situation where the prior and the

Â posterior have the same form is called a conjugate fire.

Â 14:13

So to summarize we've presented the framework of Bayesian learning.

Â Bayesian learning treats parameters as random variables.

Â Continuous variables random variables but still random variables which then allows

Â us to reformulate the learning problem simply as an inference problem.

Â Because what we're doing is we're taking a distribution over the random variables

Â and updating it using evidence which in this case is the observed training data.

Â Now specifically in the context of discrete random variables over which we

Â have a multinomial distribution in the likelihood and a Dirichlet distribution

Â as the prior, we have this very elegant situation where the prior, the Dirichlet

Â distribution prior is conjugate to the multinomial distribution, which as we

Â just discussed means that the posterior has the same form as the prior.

Â And that in turn allows us to keep a closed form distribution on.

Â Of the parameters. Which has the same form all along as we

Â keep updating it. And that update uses the sufficient

Â statistics from the data for the update process, usually in a very efficient

Â form.

Â