0:00

[SOUND] In the previous model, we covered the expectation

maximization algorithm for Gaussian mixture model.

In this module, we'll talk about the channel form of expectation maximization,

which will allow you to train almost any latent variable model you can think of.

So it's going to be intense, but going to be worth it in the end.

But to start, we will need a few mathematical tools to work

with inequalities we're going to derive.

So the first tool is connected to concavity.

And [INAUDIBLE].

What is concave function?

Well a concave function f is a function for any two points, a and b.

And for any point in between them, it has a property that the value

of the function in this point, the blue point, is greater than or

equal to the value of this segment that connects f(a) and f(b).

So the orange point.

Or if you put them all formally, for any two points a and

b and for any weight alpha from 0 to 1,

if you mix these two data points with a weight while alpha,

you will get a point in the system.

And f of this mixture has to be greater than or equal to the mixture of f's, so

the blue point has to be greater than or equal to the orange point.

Or yet in other words, for n two points a and

b, the whole segment that connects f(a) and

f(b) should lay below the function f.

So if you have a concave function like logarithm for example,

then you can prove the kind of the same property but for more points.

So for example for any free points, and for any free weights alpha,

which are non-negative and sum up to 1.

We can prove that this concave function, so f of mixture of these free

points with weights alpha, is greater than or equal to the mixture of f's.

And you can generalize this thing.

So first of all, you can call this alpha's because you know why not.

They are non-negative phase sum up to 1.

Let's say that there is a random variable g which takes three values,

a1, a2 and a3 with probabilities alpha.

And now we can rewrite this inequality.

Think that f of the expected value of t is greater than or

equal to the expected value of f of t.

And just the same thing is the definition of the expected value.

So the average with weights being the probabilities,

and this is something called Jensen's inequality.

Which if you summarize everything, [COUGH] states that for any concave function f and

for any probabilistic distribution p of t, the f of the expected

value of t is greater than or equal to the expected value of f of t.

And this holds true even if you have free points.

Or if your t takes arbitrary to many points like infinitely many.

And in this case, your origin can turn to be an integral,

but the inequality still holds.

So, f of the expected value is greater than or equal to, expect a value of f for

any concave function f.

And the final mathematical concept we will need is something called

Kullback-Leibler divergence,

which is a way to measure difference between two probabilistic distributions.

So say you have two Gaussians.

One of them is located at 0, and the other one is located at 1.

And both has variance 1, okay?

So how to measure how different these two Gaussians are.

Well natural thinkers do measure the distance between their parameters,

and it's one here because the variances are the same, and

the locations are different by 1.

Well it may be a reasonable measure of distance here, why not?

But let's consider other two Gaussians which has the same locations,

0 and 1, but they have variances 100.

Both of them have variance 100.

In this case, the difference between parameters is the same.

It's 1.

But intuitively, it seems like these two Gaussians, the green and

the red one, are much closer to each other than the first two.

So can we build a better way to measure difference between Gaussians or

between any probability distributions?

Well, Kullback-Leibler divergence is something which tries to solve

this problem.

And for example, for this particular distributions.

The KL divergence between the first two ones,

the blue and the orange Gaussian will be 0.5.

And the KL divergence within the green and red one will be 0.005.

So it kind of reflects our intuition that the second set of Gaussians are much

closer to each other.

So let's look at the definition of the Kullback-Leibler divergence.

It may look scary a little bit but the idea is really simple.

So what do I have here?

It's an integral of q(x) times logarithm of something.

Well integral of q(x) is just an expected value of this function.

So here we have an expected value of logarithm of the ratio, right?

And logarithm of the ratio is trying to measure how different these

two distributions are at the current point x in the log scale.

So what I have here is basically how different these two distributions are at

any data point, at any point of the x-axis in the log scale.

And then we're taking an expected value of this quantity.

So we're kind of averaging across the whole space,

across the whole line, the line of real numbers.

And we're getting something like the main difference between

these two distributions.

So give the definition of the Kullback-Leibler divergence.

It has a few properties which we'll use in the following videos.

So, first of all it's non-symmetric, right?

If you change p and q, then you'll have different expression.

And this is one of the reasons why it's not a proper distance

between distributions in the strict mathematical sense.

But anyway it's useful to measure some kind of a distance or

difference between distributions.

Other property we make is [COUGH] that for any two distributions that coincide,

that are the same, they can distance between them and 0.

And this is really easy to see because,

if you substitute p with q, in the expression above,

you will have expected value of logarithm of q divided by q.

And q divided by q is 1 at each point.

The logarithm of that is 0.

So we have an expected value of 0 which is also 0.

So Kullback-Leibler divergence between distribution and itself is 0.

And finally, the KL divergence is non-negative for any of the distributions.

And that's kind of easy to prove because you can use minus KL divergence.

We can look at the minus KL divergence which equals to the expected

value of the logarithm of the ratio.

And we can put minus inside the expected value,

because expected value is [INAUDIBLE].

And finally we can use the progress of the logarithm.

Minus logarithm of something is the inverse logarithm of the inverse.

And now we have expected value of the logarithm.

And as you may know, the logarithm is concave function.

It's kind of easy to prove.

You can look it up in Wikipedia or so, or just believe me.

But anyway, this means that we can apply Jensen's inequality here.

And Jensen's inequality will say that expected value for

logarithm is less than or equal to the logarithm of the expected value.

And finally, the expected value of p divided by q is just equal to

the integral of q times p divided by q, which where the q vanishes.

And integral of b is always 1 for n distribution b,

because it's the property of the distribution.

And finally logarithm of this one is 0, so

we have just proved that minus KL divergence is always non-positive,

which means that KL divergence itself is non-negative.

To summarize, KL divergence is some way to compare to distributions with each other.

It's non-symmetric.

It equals to 0 then we compare the distribution to itself,

and it's all just non-negative.

[SOUND]

[MUSIC]