[SOUND] In the previous model, we covered the expectation

maximization algorithm for Gaussian mixture model.

In this module, we'll talk about the general form of expectation maximization,

which will allow you to train almost any latent variable model you can think of.

So it's going to be intense, but going to be worth it in the end.

But to start, we will need a few mathematical tools to work

with inequalities we're going to derive.

So the first tool is connected to concavity.

And is called Jensen's inequality.

What is concave function?

Well a concave function f is a function for any two points, a and b.

And for any point in between them, it has a property that the value

of the function in this point, the blue point, is greater than or

equal to the value on the segment that connects f(a) and f(b).

So the orange point.

Or if you put it more formally, for any two points a and b

and for any weight alpha from 0 to 1,

if you mix these two data points with the weight alpha,

you will get a point in between them.

And f of this mixture has to be greater than or equal to the mixture of f's, so

the blue point has to be greater than or equal to the orange point.

Or yet in other words, for any two points a and b,

the whole segment that connects f(a) and f(b)

should lay below the function f.

So if you have a concave function like logarithm for example,

then you can prove the same property but for more points.

So for example for any three points, and for any three weights alpha,

which are non-negative and sum up to 1.

We can prove that this concave function, so f of mixture of these three

points with weights alpha, is greater than or equal to the mixture of f's.

And you can generalize this thing.

So first of all, you can call these alphas probablities because why not.

They are non-negative and sum up to 1.

Let's say that there is a random variable t which takes three values,

a1, a2 and a3 with probabilities alpha.

And now we can rewrite this inequality.

Saying that f of the expected value of t is greater than or

equal to the expected value of f of t.

This is just the same thing: the definition of the expected value.

So the average with weights being the probabilities.

And this is something called Jensen's inequality.

Which if you summarize everything, [COUGH] states that for any concave function f and

for any probabilistic distribution p of t, the f of the expected

value of t is greater than or equal to the expected value of f of t.

And this holds true even if you have three points.

Or if your t takes arbitrary many points like infinitely many.

And in this case, your summation can turn out to be an integral,

but the inequality still holds.

So, f of the expected value is greater than or equal to, expect a value of f for

any concave function f.

And the final mathematical concept we will need is something called

Kullback-Leibler divergence,

which is a way to measure difference between two probabilistic distributions.

So say you have two Gaussians.

One of them is located at 0, and the other one is located at 1.

And both has variance 1, okay?

So how to measure how different these two Gaussians are?

Well a natural think is to measure the distance between their parameters,

and it's 1 here, because the variances are the same, and

the locations differ by 1.

Well it may be a reasonable measure of distance here, why not?

But let's consider another pair of Gaussians which have the same locations,

0 and 1, but they have variances 100.

Both of them have variance 100.

In this case, the difference between parameters is the same.

It's 1.

But intuitively, it seems like these two Gaussians, the green and

the red one, are much closer to each other than the first two.

So can we build a better way to measure difference between Gaussians or

between any probability distributions?

Well, Kullback-Leibler divergence is something which tries to solve

this problem.

And for example, for this particular distributions.

The KL divergence between the first two ones,

the blue and the orange Gaussian will be 0.5.

And the KL divergence within the green and red one will be 0.005.

So it reflects our intuition that the second set of Gaussians are much

closer to each other.

So let's look at the definition of the Kullback-Leibler divergence.

It may look scary a little bit but the idea is really simple.

So what do I have here?

It's an integral of q(x) times logarithm of something.

Well integral of q(x) is just an expected value of this function.

So here we have an expected value of logarithm of the ratio, right?

And logarithm of the ratio is trying to measure how different these

two distributions are at the current point x in the log scale.

So what we have here is basically how different these two distributions are at

any data point, at any point of the x-axis in the log scale.

And then we're taking an expected value of this quantity.

So we're kind of averaging across the whole space,

across the whole line, the line of real numbers.

And we're getting something like the mean difference between

these two distributions.

So again the definition of the Kullback-Leibler divergence.

And it has a few properties which we'll use in the following videos.

So, first of all it's non-symmetric, right?

If you swap p and q, then you'll have a different expression.

And this is one of the reasons why it's not a proper distance

between distributions in the strict mathematical sense.

But anyway it's useful to measure some kind of a distance or

difference between distributions.

Other property we may need is [COUGH] that for any two distributions that coincide,

that are the same, the KL distance between them is 0.

And this is really easy to see because,

if you substitute p with q, in the expression above,

you will have expected value of logarithm of q divided by q.

And q divided by q is 1 at each point.

The logarithm of that is 0.

So we have an expected value of 0 which is also 0.

So Kullback-Leibler divergence between a distribution and itself is 0.

And finally, the KL divergence is non-negative for any of the distributions.

And that's kind of easy to prove because you can use minus KL divergence.

We can look at the minus KL divergence which equals to the expected

value of the logarithm of the ratio.

And we can put minus inside the expected value,

because expected value is something linear.

And finally we can use the property that

minus logarithm of something is the logarithm of the inverse.

And now we have expected value of the logarithm.

And as you may know, the logarithm is concave function.

It's kind of easy to prove.

You can look it up in Wikipedia or so, or just believe me.

But anyway, this means that we can apply Jensen's inequality here.

And Jensen's inequality will say that expected value of

logarithm is less than or equal to the logarithm of the expected value.

And finally, the expected value of p divided by q is just equal to

the integral of q times p divided by q, where the q vanishes.

And integral of p is always 1 for any distribution p,

because it's a property of the distribution.

And finally logarithm of this 1 is 0, so

we have just proved that minus KL divergence is always non-positive,

which means that KL divergence itself is non-negative.

To summarize, KL divergence is some way to compare two distributions to each other.

It's non-symmetric.

It equals to 0 when we compare the distribution to itself,

and it's always non-negative.

[SOUND]

[MUSIC]