How do we handled different data types? First, as we mentioned, because models only accept numerical variables, we need to convert non-numerical variable to numerical ones if possible, so that it can be used in the model. But we also need to pay attention to the actual content of these variables, so these variables are meaningful to be included in the model. For example, as we mentioned earlier, social security number, even though this is a number but it may not be used as a variable because it's neither a measurement nor a count. If a number represent hair color cannot be directly entered either because an increase that doesn't have any order to it. But we'll talk about a different way of entering that information. Now, let's talk about how we handle specific type of variables. First, start with ordinal variables. Ordinal variable is one categorical variable, because there's order. Sometimes we just want to treat them as a numerical variable. In that case, it's relatively simple. For example, income level has three levels; low, medium and high. In this case it's natural we replace high with three, low with one and medium with two. Now, this new column, income level encoded is a numerical column and that column can be used as the independent variable in your model. Nominal variables, on the other hand, require different treatment. Because there's no natural order, we cannot enter them as 1, 2, 3, 4, 5 and treat, as a numerical variable. Instead, for nominal variables, we often need to create dummy variables, one for each category. But we don't want to create n categories, we want to create n minus 1 dummies and include n minus 1 dummies in your model. For example, blood type. The common blood types are A, B, AB, and O, they're four types. We can create a three dummy variables. One represents type A, type B and type O. With that, for blood type A, we enter 1, 0, 0, for type B, 0, 1, 0, for type AB, because none of these three types, so we enter 0, 0, 0 and in type O, 0, 0, 1. We use these three dummy variables or three variables to capture the four categories. Again, the reason we drop the last one, instead of having dropped the type AB is because that's already redundant given the three types. Because we got all zeros on the three types, we know that's type AB. At this point, I want to mention the notion of multicollinearity. Multicollinearity refers to the case when one variable is nearly perfect linear combination of one or more other variables. But it's simply, that means one of the variables as redundant given all the other variables that you had included. One example that we've just discussed is the dummy variable coding is that we drop the last one because that's redundant given all the others, n minus 1 dummies. But in your multiple linear regression in general, sometimes when you include many variables and some of them become redundant. That's also called multicollinearity. Multicollinearity should be generally avoided in a regression model. The reason behind is that when there's severe multicollinearity, the regression model cannot be uniquely determined. What does that mean is, when you have a regression with coefficients b_0, b_1 and b_2 coefficients. Oftentimes, when you fit the model, you get a unique vector of b's. Maybe there's 1, 2, 3 and that's your model estimation of this coefficient. But the multicollinearity, that actually, there's another equally well solutions, such as 2, 1, 3, actually that give you exactly the same prediction. From the mathematical point of view, we cannot decide which one is the right answer. As a result, we have an ambiguity about what should b_1 be. That's the reason we want to avoid multicollinearity is to avoid having non-unique solution to your linear regression problem. It's beyond our scope to detect multicollinearity and grade it. But there is a high level guideline on how you can realize that you have this issue is in general, when you have variables that are highly, highly correlated, you know that some of these are redundant. Then, this is the case of multicollinearity. The treatment often is you should drop one of the redundant variables to get rid of this issue. One thing I want to mention about coding ordinal and nominal variable is sometimes you have to decide which way to treat this variable based on the nature of the relationship between this variable and your outcome. For example, sometimes we need to code a variable called quarter. Each year, there are four quarters, spring, summer, fall, and winter. Sometimes your expected relationship between quarter and outcome, let's say sales, is going to be more or less a linear increase. Maybe your business is growing in such a way that each quarter there is a better result. In this case, you might treat quarter as ordinal variable, just enter them as 1, 2, 3, 4. However, in more general case, this variable may be displaying a nonlinear relationship. Maybe your summer sales is the best, winter sales the worst, and spring and fall sales are in between. In such a case, each quarter should treat it as a distinct season and they should be treated as nominal variables. What about date and time variables? The date and time variable are often appearing in a real-world data set. They could be in date from or timestamp form or even time form. Typically, date and time variables cannot be directly used in your model. However, we can extract very useful information from the time variables, such as, we can extract quarter, as we mentioned earlier. We can extract day of week, weekend or how many days since that start date. Lastly, let's talk about string variables. String variables, as we mentioned earlier, they could be categories. We already discussed how we can treat category variables. Then there may also be useful information that you can extract from string variables. Such as in a Titanic data set, we may be able to extract, take a class from the ticket numbers. Now is this A, B, C, or D? Finally, there are natural language processing tools that allow you to extract features from textual content, such as email message or a product review. They all can be converted into numerical values. That is beyond the scope of this course, but I just want to give you a flavor of what that entails. For example, some of that natural language processing will produce a vector of numbers, and each number represents a count of a word appearances in text. Other tools will produce a vector that contains a count of topics, mentioned in the text, one number for each topic. Sometime they produce a sentiment of the text. These are the possible ideas where you can convert text into numbers.