We will proceed with discussing of descriptive statistics. We have to discuss first what kinds of variables we can meet in the data. First of all, the easiest to deal with on type of variable is just a numeric variable. Numeric variable is something that is denoted by the number. For example, income or number of children. Numeric variables can be sub-divided into two categories. We can consider real numbers for example income or for example the weight or something like this. Another option is to consider account something that is denoted by integer numbers. It is usually denoted by count or integer. For example, number of children. Difference between real numbers and integer numbers can be important in some cases when you have to use different models for these different kinds of variables. But sometimes this different is neglectable. Sometimes it is even possible that we have some count of data and we can represent it either by real numbers or by integer numbers. For example, a variable like age can be represented and modeled as real numbers because age is basically a real variable if we consider not only the number of full years, but also number of months and days and so on. But usually it is considered as integer number. Anyway, numeric variables are Innocence simple. Another option is categorical variables. Categorical variables represents some values that are not numbers, but that are sum elements of some fixed sets. For example or we can consider a variable like country of origin or city of origin or level of education. These are examples of categorical variables. These categorical variables can also be subdivided into two clusters. We can consider all that categorical variables, for example level of education. For example we can consider several levels of education like general education, higher education, college, scientific degree. We understand that there is an order on the set of all possible values. We understand for example that college education is larger in the sense than general education. Also it is possible that we have categorical variable that doesn't contain any order on the set of possible values. Not ordered. This kind of variables are also called nominal variables. For example country of origin. There is a subset of categorical variables that can take only two possible values. These are called binary variables. Sometimes it is useful to consider them too. When we study our data, we wanted to use mathematical tools that deals in generally with numbers. So, if we have some kind of categorical data, we have to transform them into some numeric form first. This can be done in a different ways. Let us assume that I have a variable like country of origin or a hometown. Then I wanted to encode this variable in some numeric way. For example, I have possible values for hometown. For example in my study, there are three possible hometowns; Moscow, New York and London. I can use the so called label encoder to encode these variable. In this case I say that for example Moscow is encoded as zero, New York is encoded by one, and London is encoded by two. This is called label encoder. Every time I see Moscow I just replace it with zero. Every time I see New York I replace it with one. In this case, I can transform a column that corresponds to this hometown to the column of values 012. This is the simplest possible way how to do it. But sometimes it is not an optimal way because when we denote this categorical values by numbers, we introduce some kind of relation between these values. For example, we can make some arithmetic operation with numbers or some comparisons. In this encoding we see that London is larger than Moscow in some sense. But we understand that this relations between these numbers are just an artifact of our encoding scheme. Because for example we can denote Moscow by number two and London by number zero and these relations will be reversed. We see that it is just because we changed our mind about these encoding scheme. For some models of martial learning this is not a problem, but for some other like linear regression models, this is a problem because they assume that when they see a number, this number has the appropriate numerical meaning. In this case we have to introduce another encoding schemes. For example, the very popular one-hot encoding. In one-hot encoding, we associate with our categorical variable like hometown. Several numerical variables. Like this, assume that we have variable hometown and we introduce new variables. First variable hometown Moscow, second variable hometown New York and the third variable is a hometown London. Then, when we see here in the column hometown we see Moscow, we put here one, here zero, and here zero. When we see here New York, we put here zero, here one, and here zero and for London, it is the same way. Here zero, here zero, and here one. In this case, there are no such problems for example that we compare in a sense Moscow and London and make conclusions which is larger in our encoding because in this case, every value of this variable is encoded not by one number, but by a vector of numbers. These vectors cannot be compared just like their initial levels of our categorical variable. This one-hot encoding is extremely popular in machine learning. In statistical settings, it is also useful to consider so-called damn encoding. Damn encoding is just the same thing as one-hot encoding. But we remove one of the columns of their transforming data. For example this column. We can safely remove it because if we see zero here and zero here, we understand that our hometown is not Moscow and not New York and therefore it has to be London. So in this coding scheme, there is some kind of redundancy and if we remove one column, then we remove this redundancy. For some methods like linear regression, these can be important. Anyway, we have to transform our data in a numeric form before we can proceed with the analysis of this data by machine learning tools. Sometimes these transformations are as simple as we discussed here. Sometimes they use much more sophisticated techniques like alert to vague or some other [inaudible] of some complicated objects into vector spaces. Finally, we have just some numeric values in our data and we can use a bunch of mathematical tools to process them and analyze them. Now, let us discuss how to summarize the data of different kinds.