[MUSIC] Hi, welcome to our course. In this first video, we will see basic principles that we'll use throughout this course. Let's learn them by example. Imagine you are running through a park and you see another man running. And you ask yourself, why is he running? And you come up with four different explanations. First, he is in a hurry. Second, he is doing some sports. Third, he always runs. And fourth, he saw a dragon. Principle 1, use prior knowledge. From our previous experience we know that dragons do no exist. And so, we can exclude fourth option from next consideration. Principle 2, choose answer that explains observations the most. Imagine you saw that he is not wearing a sports suit. In this case, it´s very unlikely that he´s doing sports, and so we can exclude number two. Principle 3, avoid making extra assumptions. From the last two options, the third option, does he always runs, makes a lot of extra assumptions and so should exclude it. This principle is also known as Occam's Razor. And finally, we are left with only one case, that he is in a hurry. To conclude, we've seen three principles. To use prior knowledge, to choose answer that explains observations the most, and finally to avoid making extra assumptions. Before we continue, let's review some basic principles from probability theory. We define probability in the following way. Imagine you have some source of randomness, for example, a dice. And you repeat an experiment multiple times. And as the number of experiments goes to infinity, we get the probability as a fraction of the times some event occurred. For example, you would expect for a fair dice that the event that you threw five would have a frequency about one-sixth. And for events that you threw an odd number, it would be somewhere around one-half. We will consider two different types of random variables depending on which values they can take, discrete and continuous. The discrete for random variables can have either finite number of values that can take, as for example, for a dice. Or infinite, if you count the number of times that some certain event happened. An example of continuous random variable would be at tomorrow's temperature. The most convenient way to find the discrete distribution is to call the probability mass function. It maps a number for each point that refers to the probability. For example, in this case, we'll get a point that equals to 1 which produces in 0.2. The 0.3 with probability 0.5 and so on with probability 0.3 and other points with probability 0. Also note that these points sum up to 1. The most convenient way to define continuous distributions is called a probability density function. It assigns a non-negative value for each point. And then to compute the probability that a point will fall into some range, for example, from a to b, you should integrate this function over this given range. As is given on the slide. We will also need a notion of independence. The two run variables are considered independent if their joint probability, that is, a probability of X and Y, equals to the product of their marginals. So it will be a probability of X times a probability of Y. Let's see an example. Imagine that you have a deck of 52 cards and you take, randomly, 2 cards from it. And the first random variable would be the picture that is drawn on the first card and second would be the picture that is drawn on the second card. Those kind of variables are dependent since it is impossible to take one card two times. Another example is throwing two coins independently. Here the probability that the first coin will land heads up and the second would land tails up equals to the product of the two probabilities. And so these random variables are independent. The last thing we'll need is a conditional probability. We want to answer a question, what is the probability of X given that something that is called Y happened. It is given by the formula that you can see on the slide. It is the probability of X given Y equals to the joint probability P of X and Y over the marginal probability P of Y. Let's consider an example. Imagine you are a student and you want to pass some course. It has two exams in it, a midterm and the final. The probability that the student will pass a midterm is 0.4 and the probability that the student will pass a midterm and the final 0.25. If you want to find the probability that you will pass the final, given that you already passed the midterm, you can apply the formula from the previous slide. And this will give you a value around 60%. We'll need two tricks to deal with formulas. The first is called the chain rule. We can derive it from the definition of the conditional probability. That is, the joint probability of X and Y equals to the product of X given Y and the probability of Y. By induction, we can prove the same formula for three variables. It will be the probability of X, Y, and Z equals to probability of X given Y and Z, the probability of Y given Z, and finally probability of Z. And in a similar way, we can obtain the formula for the arbitrary number of points. So this would be the probability of the current point, given all its previous points. The last rule is called the sum rule. That is, if you want to find out the marginal distribution p(X), and you know only the joint probability that p(X,Y), you can integrate out the random variable Y, as it is given on the formula. And finally, the most important formula for this course, the Bayes theorem. We want to find out the probability of theta given X, where theta are the parameters of our model. For example, we have a neural network and those are its parameters. And then we have X. Those are the observations, for example, the images that you are dealing with. From the definition of the conditional probability, we can say that it is a ratio between the joint probability and the marginal probability, P(X). And also we apply the chain rule, we'll get the following formula. It will be the probability of X given theta, times the probability of theta over probability of X. This formula is so important that each of its components has its own name. The probability of theta is called a prior, it shows us what prior knowledge we know about the parameters. For example, you can know that some parameters are distributed at around 0. The term probability of X given theta is called a likelihood, and it shows how well the parameters explain our data. The thing that we get, the probability of theta given X, is called a posterior, and it is the probability of the parameters after we observe the data. And finally the term in the denominator is called evidence [MUSIC]