[MUSIC] Everyone, welcome to our lecture solving a real world problem using linear regression part one. So in this video I'm going to use a card data set. Use car data set to build the regulation model that will allow me to predict the selling price of the car. So, transfer my data saying to the panel data frame and look at the first five rows, so that many of them such as the year selling price and then so on. So this is the year where the cars will build. So I'm going to create a new column called the age. So the age of the car would be the year that we are now right now 2020 minus the year the car was built. So to get today's year, I'm going to import daytime and then now get today exactly where we are right now. So, x=datetime.datetime.now. So this will turn today's year, month, days, hours, second, millisecond and so on. So now, I'm only interesting on the year part. So I will just say the year = x.year. This will return 2022. And now I would define, my age column now is the call today's year minus the year the car was built. If you look at it, there is a new column here called age right now. This year is no longer useful. I'm going to drop it. I'm going to drop also cars name and then this is called cell attack. So I'm going to drop those and I look at those are gone. The next one I want to know is the info. When I look at the info on the car and notice that I have couple categorical data, the future, and the transmission. So I'm going to change them to a dummy variable using this cord. This will change those categories data into a dummy variable. If you look at it again, they split the fuel type into two type of fuel, right? And then the transmission, either manual or automatic. So now I want to check there is any missing data inside my clean data. So there is nothing missing, right? So this data is perfectly clean. The next thing I want to do is to do some visualization. So here I'm doing a scatter plot of a selling price versus the kilometer driving. So this is like the mileage. Think about it as a mileage. So selling price versus the mileage, right? Let me do that selling price versus the mileage. So when I run it, it should give us this. So there is a correlation between the selling price and the mileage. The more mileage they have the cheaper the car is. Now, I'm doing a badge platz, the age of a car versus the selling price. So, and then the blue is a transmission manual, the yellow are the manual transmission. So as you can see, the manual transmission are cheaper, so negatively correlated to the selling price and the age of the car is also negatively correlated to the selling price. The older the car is, the cheaper the car become. Right. So now we can choose our dependent variable and independent variable. So here the dependent is a selling price. Independent will be everything except the selling price. Now it's time to split our dependent variable and dependent variable between into training set and then test set. And then here we said with test set to be 20%, right, 20% of the data. Now, once we split that we can import our linear regression model and we train our model into the x_train and then y_train, right? Now we are going to use that model here LM to predict the x_test and then check the performance of the model by printing the r2 square. So the r2_score will give us 76%. Usually you want this to be close to 100%, right? So I multiplied here by 100%. Let me, if I change this. So 76, you want this to be close to one, right? Like 0.90, something like that. And then if you look at the coefficient here, right? These are the coefficients of the fuel type is possibly correlated, but the mileage is negatively correlated to the selling price of the data, right? We can print all the error here. So if the up to square is 76%. And then mean absolute square is this mean square error is six point something, would mean square is two point something. So if you look at our prediction and we truly I say you plot them. You see that we miss a few of them, right? So this score is not perfect. We can make this better. I will see you in the next video where we're going to improve this model to have a better to score. So in this video we'll learn how to convert video, saying to friend, look at some more information about the data. Change some categorical data into dummy variable, do some visualization, split our data into train to split and then Run the moral and then investigate the R two square. Thanks everybody. I'll see in the next video. Mhm.