So, this is the moment you've been waiting for. You've learned how to preprocess, and split your dataset in a repeatable fashion. Now, it's time to do some machine learning inside of BigQuery. So in this demo, we're going to try to do make an ML Model and train it, and get a result in 15 minutes or less. It's pretty audacious. So, the goal that we're going to have is based on this blog post that might technically lack row. I'll provide the link for you to peruse later, and we're going to be predicting cab fare, taxi fare for New York City taxi cabs. So the first thing as any good ML project begins is an exploration of the dataset that's available. So we've got the public dataset here of New York City taxi cabs and their trips, and we've got a bunch of interesting fields in the scheme that you can see here. So, it begin to think about the relationships between some of these columns, which could become feature columns if they're related to ultimately the fare amount of the cab ride. So, clicking on details, you can see that not only do we have some data from 2009, we have a ton of data. One hundred and thirty gigabytes, one billion rows so, it might be a good idea to sample just say, maybe a million of these for our initial model experimentation, and as part of any good data exploration exercise, you can click on the preview to see some of those potentially useful columns to model the relationship between the feature columns as you see here, and what our label could potentially be, which is predicting the ultimate fare amount of a cab ride in New York. So you have things like pickup datetime, latitude, and longitude, and what about passenger count? Do you think that could be related? Absolutely. Things like trip distance. How far the cab ran? These are definitely something that we can build a model off of. Now, let's go ahead and pull that data into our query. So when we bring in taxi cab trips, let's paste in the taxi cab query, and you'll notice that what we've done here is our dataset didn't have all the fields we wanted them in the way that we want them. So for total fare, instead of choosing the total amount column, which could include tips which are discretionary for taxicab riders, we're going to take the amount of bridge tolls or tolls charged to the passenger, the amount of that fare, per cab meter and then, that's what we're going to predict for. That's going to be the label, and it's going to be the total fare. Also, we have the timestamp of when the person was picked up, the pickup date time, but we don't have the day of the week like Sunday, Monday, Tuesday, Wednesday, Thursday, and maybe that could have a factor in the amount that the person was charged. So we're gonna create a new feature called day of the week. Same goes for hour of the day, and as you can see you for the rest of these features that we have here, we just got the location of where they were picked up. The location of where they were dropped off. So point A to point B, and the amount of passengers that were riding in the cab. Now you also noticed is, as part of a data exploration and preprocessing, there could be some bad data that's even in a public dataset that's given to you. So remember the good lesson that you should be extremely critical, and curious of any dataset that's given to you. It might be the not the complete picture of what you're seeing. So we want to filter out any trips that are less than zero in terms of distance and any fare amounts that are zero or negative. Again, that's indicative of bad data quality. So after all is said and done, what we're going to have is a query that has something like this. So let's go through. We're preprocessing the data by filtering it. We're splitting in. This should look familiar now. Where we're actually going to be splitting it based on the pickup date time, and we're reducing it by an order of magnitude from a billion down to a million, and you'll notice that we're actually, instead of having a hard-coded value here, we're actually going to specify a training parameter. So parameters.train, you can see it's just going to be a one. So were splitting the data just into two different buckets, training and evaluation. Again, just for the purposes of experimenting and trying to build a model quickly. So we've got the array of day names that we've created here from days of the week, and we've got our training evaluation set up. Now, what we need to do is actually add in some model code. So now the code that's actually going to run the model behind the scenes is going to look like this. So it's going to be just three lines of code. So here's the magic. Once you have creator or replace tables, we're going to create or replace model as the latest BigQuery feature release, that is now a reserve keyword. We're going to store it in a demo folder that we have, and we're going to specify some options for the model. What do you think we should tell the model or BigQuery about what type of model that we want? Well, since we're predicting on a numeric field which is going to be a fare amount, what type of model should we use? If you say a linear regression, you're absolutely right. So linear regression, the keyword here is linreg, and what about for labels? What do we actually predicting for? We're actually predicting for the total fare, and I'm going to throw in some additional optional what are called hyperparameters, which are the knobs that you can tune and train on the model. Right now, this is going to be telling the model to stop training when the next iteration of the model is not better than 0.5 percent in terms of quality, and how fast the model is going to descend that gradient is the learning rate. We're going to just set the hyper data, so hyper parameter of 0.1. If you're interested to learn more about these hyperparameters, and what are the parameters you can optionally tune and train, I'll provide links to resources as well as check out the machine learning on GCP specialization, which walks through this in a lot greater detail, and you get a lot of good practice with TensorFlow there. So just three lines of code. Let's see if we can clear up some of these SQL errors that we have here, and then run the code. So let's see what we got. Now the last thing that we need to do is tell the model what data it's going to pull in. So we're going to bring all of our taxi cabs trips data as a result of this query. Now if you can imagine, if we just run this with no model code, I'm going to comment out the model code, and we're going to just confirm the amount of records that were actually pulling in to potentially be for training there. So I'm going to run that query. How many records are we going to have? We're going to train it on over a million. One point one million records are about to be fed into the model. All right. Let's not wait anymore. Let's give it a go. So, it's going to run, and I'm going to process over 74 gigabytes of data and now, you're actually training a machine learning model. No, you're actually running the query. Let's uncomment the model code here. Make sure that the model is actually running, and just like you would run a normal query, you can run a model just alongside your SQL as you have here. So it's going to take the results of this query, and run it against your linear regression model. Now this will take about five minutes to process and the output, since I've already ran it before, actually returns what kind of looks like a table. So, I'm gonna cancel this query but on your own, it'll take about five, six minutes to run, which is incredible. So I'll just pause there for that moment, running a machine learning model, and training it in five minutes, that's incredible, and doing inside the BigQuery web UI, that's one of new great features of playing with us. So, the actual results are what you're going to get here. So since it's linear regression, if you're familiar with linear regression, you're going to get things like a bias term or an intercept that actually shows up in your results, and then you've got your features, and your weights of those particular features, and how they actually influence the importance of which features give the most influence over the final result for fare amount which is what we're predicting. So, let's go ahead and see if we can continue on with the next step. So we've split it into training and validation. We've just trained it, and now we need to evaluate its performance against another dataset, which is going to be our evaluation. So, instead of the training parameter, we're going to switch that over to the evaluation and now since the model already exists, how we evaluated it is, I'm going to comment this out, training is done. Let's see how well it did. So we're going to select the mean_squared_error or if you want to get a little creative, you just take the square root of the mean_squared_error, and that'll be the root mean_squared_error, which is going to be the loss metric for the linear regression model, and what that's actually going to output is something like, our model was plus or minus $12 for a cab fare and from there on, you go on and add new features and performance tune, and we're going to specify which model it should actually evaluate. It found the model, and we're going to specify that this is the evaluation set as you see here. Once we get the green light, we can see that we're going to evaluate 74.3 gigabytes of data, and the result that we're going to get back is just a single number or the root mean_squared_error for that plus or minus dollar amount that we've got for the fare amount. For this particular case, we've got plus or minus $12, and that's the end of your training and evaluation step. So you just created an ML model, trained it, and evaluate it all in the span of say, ten or 15 minutes. Now, the hard part really begins. Remember what we said feature engineering was one of the hardest things of ML project face is 12 plus minus $12, a good enough metric for your cab rides? Maybe, maybe not. So, I'll leave it to you to read through the rest of the blog post, and work through the code that's available in GitHub to see how you can actually reduce that $12 by 40 or 50 percent with some creative feature engineering, what goes along with that. Thank you for tuning in.