In this way, we arrive at

the negative log-likelihood function shown here.

This expression has two terms.

Note that the second term doesn't depend on Theta,

and therefore, it's irrelevant for

the search for the most likely value of Theta.

Now, let's notice the following.

If all noise variance is Sigma n are constant,

that is if they are independent of n,

this factor can be taken outside

of the sum in this expression.

Furthermore, if we take our function to

be a linear function, like shown here,

we see that we exactly recover the mean

squared loss function that we

used earlier for linear regression.

So this example teaches us a few things.

First, linear regression is equivalent to

a linear probabilistic model

with a custom Gaussian noise.

Second, the same holds for non-linear regression.

It's equivalent to a non-linear probabilistic model

with constant Gaussian noise.

Third, the above derivation suggests many ways we

can modify the MSE error

starting from a probabilistic framework.

We will talk about it shortly.

But first, I want to talk about another and very

useful interpretation of

the maximum likelihood method itself.

This is where I want to introduce one of

the most important quantities in machine learning namely;

the Kullback-Leibler distance also known

as KL divergence or KL relative entropy.

Let me introduce the concept of

KL divergence using our example of model estimation.

Lets call the model distribution P model of facts and

Theta and call the true data

generating distribution P Theta of x.

The KL divergence between P data and

P model is defined as an expectation of

the log of the ratio of these two probabilities taken

with respect to the probability

that stands in the enumerator.

This is the first argument in the KL function.

Now, note that this is non-symmetric.

If you swap the data in P model here,

you will get an entirely different expression.

Now, let's continue with the formula

for the KL divergence.

Let's write the logarithm of the ratio

as a difference of two logarithms.

Now, note that the first term here depends

only on P data and not on P model.

The negative of this term is called the entropy of

the data distribution and it

measures the amount of randomness in the data.

The more randomness in the data,

the higher the entropy of the distribution.

However, this term is irrelevant for optimization of

Theta as it doesn't depend

on Theta and so we can drop it.

Now, replacing in the final expression

an expectation with an empirical mean,

we get this formula.

It's exactly the negative log-likelihood of data.

So what we found is that minimization

of the negative log-likelihood function,

which is performed in

the MLE method is exactly equivalent to

minimization of the KL divergence

between the data and model distributions.

I would like to add here that KL divergence has

a number of properties that make it a

good measure of similarity between

any two distributions defined on the same support.

In particular, the KL divergence between

two arbitrary distributions P_1 and P_2

is always non-negative and it becomes

zero only when the two distributions coincide.

We'll get back to the notion of KL divergence and

entropy of distributions in our follow-up courses.

For now, I want to conclude this video by showing you

how regularized models can be

naturally introduced in the Bayesian framework.

To this end, let's return to the MAP method and look at

the form of the negative log probability

of the posterior distribution.

What we see here is that a non-flat prior produces

an additional term in the last function

given by the negative log prior.

Because it's made of the prior,

it's a function of the model parameter of Theta,

but not of the data X.

Now, if we recall our discussion

of regularization in the last week,

we immediately realize that this additional term is

actually equivalent to a regularizer in regression.

Depending on the choice of the prior,

this procedure leads to different regularizers.

For example, if we take a Gaussian prior,

we obtain the L_2 regularization shown here.

If we take the so-called Laplace prior shown here,

we will obtain the L_1 regularization.