Now, while the Bayes' formula for the posterior probability or for parameters given the data is very general, there are some interesting special cases here that can be analyzed separately. Let's look at them in a sequence. The first special case arises when the model is fixed one and for all. In this case, we can drop the conditioning on M in this formula. The Bayesian evidence in this case is simply some function of data only, and usually turns out to be irrelevant for the analysis of most probable values of data. If in addition to fixing the model we additionally take flat prior, which is a prior that is a constant function of data, then maximization of posterior probability is equivalent to maximization of probability to see the observed data given the model and a current value of parameter Theta. This method is known as the maximum likelihood estimation or MLE for short. The MLE is a very popular method of estimation of probabilistic models in both machine learning and statistics. As a method of model estimation, it offers a number of attractive properties. First, if the true distribution lies within the family of distributions that we explore, and if there is only one unique value of Theta that corresponds to the data observed, then this value will be recovered by an MLE estimator when the number of observations goes to infinity. This is called consistency of MLE estimator. Moreover, it turns out that when N is very large, an MLE estimator has the lowest variance among all possible estimators. This is known as the Cramer-Rao bound. When you deal with finite values of N, which is usually the case in practice, obviously all these nice guarantees are gone. This means in particular that some modifications of MLE estimators, for example, regularized MLE estimators might work better in practice than a plain MLE method. The next and very rich class of models is obtained when we do not assume a flat prior. In this case, maximization of the posterior distribution amounts to maximization of the product of the prior and the likelihood, since now, both of them depend on parameter Theta. This is called the Maximum-A-Posteriori estimation or the MAP for short. The Maximum-A-Posteriori estimation is just one of the many things that can be done in the Bayesian approach. We will be coming repeatedly to Bayesian methods during the specialization. But for now, you have to remember that the MLE method is just a special case of a more general MAP method, which is obtained when you take flat priors. Let's now come back to the MLE method and consider a specific example. Let's assume we model a real value of quantity y as some function f of predictors x plus a noise. Your function is parameterized by my parameters Theta and you want to fit these parameters to some changing data. Now, let's assume that all errors Epsilon are independent Gaussian random variables with zero mean, and variance that can, in principle, depend on the input. This means that each observation is independent and the probability to see all data is just a product of probabilities to see each point. This gives you this relation. The exponential here is due to the Gaussian density for Epsilon, where for each observation, the epsilon is expressed in terms of x and y using our original equation. Now, maximization of this expression is equivalent to minimization of the negative of its logarithm, because logarithm is a monotonic function. In this way, we arrive at the negative log-likelihood function shown here. This expression has two terms. Note that the second term doesn't depend on Theta, and therefore, it's irrelevant for the search for the most likely value of Theta. Now, let's notice the following. If all noise variance is Sigma n are constant, that is if they are independent of n, this factor can be taken outside of the sum in this expression. Furthermore, if we take our function to be a linear function, like shown here, we see that we exactly recover the mean squared loss function that we used earlier for linear regression. So this example teaches us a few things. First, linear regression is equivalent to a linear probabilistic model with a custom Gaussian noise. Second, the same holds for non-linear regression. It's equivalent to a non-linear probabilistic model with constant Gaussian noise. Third, the above derivation suggests many ways we can modify the MSE error starting from a probabilistic framework. We will talk about it shortly. But first, I want to talk about another and very useful interpretation of the maximum likelihood method itself. This is where I want to introduce one of the most important quantities in machine learning namely; the Kullback-Leibler distance also known as KL divergence or KL relative entropy. Let me introduce the concept of KL divergence using our example of model estimation. Lets call the model distribution P model of facts and Theta and call the true data generating distribution P Theta of x. The KL divergence between P data and P model is defined as an expectation of the log of the ratio of these two probabilities taken with respect to the probability that stands in the enumerator. This is the first argument in the KL function. Now, note that this is non-symmetric. If you swap the data in P model here, you will get an entirely different expression. Now, let's continue with the formula for the KL divergence. Let's write the logarithm of the ratio as a difference of two logarithms. Now, note that the first term here depends only on P data and not on P model. The negative of this term is called the entropy of the data distribution and it measures the amount of randomness in the data. The more randomness in the data, the higher the entropy of the distribution. However, this term is irrelevant for optimization of Theta as it doesn't depend on Theta and so we can drop it. Now, replacing in the final expression an expectation with an empirical mean, we get this formula. It's exactly the negative log-likelihood of data. So what we found is that minimization of the negative log-likelihood function, which is performed in the MLE method is exactly equivalent to minimization of the KL divergence between the data and model distributions. I would like to add here that KL divergence has a number of properties that make it a good measure of similarity between any two distributions defined on the same support. In particular, the KL divergence between two arbitrary distributions P_1 and P_2 is always non-negative and it becomes zero only when the two distributions coincide. We'll get back to the notion of KL divergence and entropy of distributions in our follow-up courses. For now, I want to conclude this video by showing you how regularized models can be naturally introduced in the Bayesian framework. To this end, let's return to the MAP method and look at the form of the negative log probability of the posterior distribution. What we see here is that a non-flat prior produces an additional term in the last function given by the negative log prior. Because it's made of the prior, it's a function of the model parameter of Theta, but not of the data X. Now, if we recall our discussion of regularization in the last week, we immediately realize that this additional term is actually equivalent to a regularizer in regression. Depending on the choice of the prior, this procedure leads to different regularizers. For example, if we take a Gaussian prior, we obtain the L_2 regularization shown here. If we take the so-called Laplace prior shown here, we will obtain the L_1 regularization. So to summarize, maximum likelihood estimation and maximum posteriori estimation are two extremely popular methods for model estimation in both statistics and machine learning. In this video, we rephrased the linear regression problem as a problem of estimation of a Gaussian probabilistic model. We already saw in our previous demo how the MLE method for linear regression can be implemented in TensorFlow. Now, we're ready to move on and talk more about another important class of probabilistic models, namely probabilistic classification models. Let's do it in the next video.