So, we are now in Lab Two, where we're going to show how to use Datalab to run a BigQuery query. So here, we are going to be analyzing a dataset that's relatively large. 70 million rows, eight Gigabytes of data, and we'll analyze it using BigQuery and using Cloud Datalab. So first thing to do is to go ahead and launch Cloud Datalab, and we can do this from within Cloud Shell. So, first thing to do is to basically figure out, what our compute zones are. So that we can run Datalab in one of those compute zones. I happen to know that U.S. Central one is a compute zone, so I'm going to skip that part. And then, go ahead and create Datalab. So, that's Datalab create the name of the vm and some zone. So, let's go ahead and do that. So, we'll go back to Cloud Shell. So I'm here in Cloud Shell, I will say "datalab create mydatalabvm." And the zone is going to be "us-central1-b." Now, this command takes probably about five minutes to complete. So we're going to fast forward the video, and then, get to the point where you see this message that says that we can now connect on localhost 8081, or do a Web preview on 8081. So, let's wait for that to show up, and then we will continue. So while Datalab is starting, let's go ahead, and try the BigQuery. So this is a query that I want to run, the comment here is very important because we want to run standard SQL, and by default, the BigQuery user interface at least now at the time that I'm recording, it defaults to what is called Legacy SQL. So, we will go to the BigQuery Console, and we can get to the BigQuery Console from the GCP Menu, by selecting this, and going down, and selecting BigQuery. And we get asked to sign in to the password, the password again is from QwikLabs. So, take the QwikLabs password, put that in, and we are in BigQuery. Make sure that the project is your project. So in this case, not this and not QwikLabs resources or anything else. So we are here, we can click on the "Compose Query" and again, your user interface might be slightly different, user interfaces change all the time. But in general, many of these buttons, et cetera, should be recognizable. So we'll go here, and we will go ahead and run this particular query. So let's go ahead and run the query. So in another way, if you didn't want to say #standardSQL here is that I could gone ahead, and done show options. And turned off the Legacy SQL. So that's another way to do it. But either way, we want to run standardSQL. So we're going to run standardSQL, and then to run the Query. So this is being done on a dataset called "bigquery-samples" or project by BigQuery samples, and a dataset airline_ontime_data, and the name of the table is flights. So we can see that here, there is bigquery-samples. We don't actually see that here. So how would you get to see a project that is not on the left-hand side menu? What you would do is to go ahead and click on this down menu, and say "Switch to project", and "Display Project," and go and put that project in. And at this point, bigquery-samples shows up. And in bigquery_samples, there is the airline_ontime_data, and in there is the table called flights. So I can go ahead and look at the flights, and we see that in the preview, that these are the columns, and some example data values in the table. And look at the details, and it turns out that this table is nearly eight Gigabytes. It has over 70 million rows. And it is this. So, let's go ahead, and run the query. So it is this that we're basically queried against. What have you done here? We said go ahead and select the departure_dalay, and count the number of flights. So this is the number of flights out of a specific departure_delay because you are grouping by departure_delay. So for example, if the departure_delay's negative 37. In other words, that the flight departed 37 minutes early, how many flights were there? There are 107 such flights in the dataset, and these are the quantiles. So, this is each 28th percentile, right? Because it's divided by five. Like 80 percent of those flights, arrive 66 minutes or more early, and 60 to 80 percent of flights arrived between 41 minutes and 66 minutes, and so on. So we had a question that I asked you, if the departure_delay's 35 minutes early, what is a median value? And the median value, would be the value in the middle, right? So, 28 minutes. So, if you go back to our console, we now see that Datalab asks us whether we want to continue, and say "Yes." Go ahead, then accept all of the things. So now let's go ahead, and run these other query. To go ahead and find the airport-pair. Airport-pair meaning a specific departure airport and a specific arrival airport that has a maximum number of flights between them. So this is again from the same table, but now, I'm selecting the departure_airport, the arrival_airport, and counting the number of flights but grouping by both the arrival_airport, and departure_airport. And ordering by number of flights descending which means, that the airport-pair with the maximum number of flights will be the first, and I'm limiting 10ths. I'm going to get there first 10. The 10 most common of those. So notice that this is something we've processed 70 million records. And when I did it, it took me 2.3 seconds. How is that possible? Well, it's because the 70 million records weren't done on this one machine that I'm running on, right? Where I'm running it, it's run on thousands of machines. It's run at scale. And this is what it mean when we say we launch services on the Cloud, we do these things in a serverless way. But anyway, going back here, it turns out that if the departure_airport is LAX, and the arrival_airport as SAN, that is 133,000 flights. So that's the airport-pair with a maximum number of flights between them. So at this point, now when we go back to Cloud Shell. We see that we might click on the Web preview, and change port to 8081 to start using Datalab, that is this item here, Web preview, so select that, change the port to 8081. And at this point, we are now inside Datalab. Everything that you've done in BigQuery so far has been great. We have been able to go ahead and run SQL queries on millions of rows of data, get our answers back in seconds. That's great, but what we really want, in addition to getting those answers is to do things like drawing graphs, et cetera. We want to be able to visualize the data. And visualization is one of those things that you can't do in the BigQuery Console. We want to use a custom visualization tool. In this case, we're going to use Datalab, which has full access to all of the Python goodness to go ahead and do all of our graphic. So what we're going to do here is that we're going to run one of our queries, but we're going to do this not from the BigQuery Console. But from within Datalab. So here we are in Datalab, I'll go ahead and start in your notebook. And in this notebook, what we have here is a code cell, so I can go and paste the code in that cell, and hit "Run" to run the code. So, all of this is being executed by BigQuery. So in the same order of seconds, we're going to be analyzing this millions of flights, and what we're now doing is I'm getting it back as a Pandas Dataframes. So.two_dataframe here is a Pandas Dataframe. So, it basically shows you the first few rows of that dataframe, and as before, we have a departure_delay, we have the number of flights, and we have the deciles because in this case, I'm doing the quantiles as 10. So there are 10 of them, and I get them back as a Python list. If you now go ahead and take the same dataframe, and we will basically go ahead and do a quick rename, what we now have is we've taken this deciles data, and we've broken it up, and gotten 0 percent, 10 percent, 20 percent, 30 percent, et cetera, as separate columns. Why am I doing that? By doing separate columns, it allows me to do the next thing that I want to do. So, let's go ahead, and so at this point, I'm going to drop the 0 percent, or I'm going to drop the 100 percent, and I'm going to take the 10 to 90 percent all of that data, and I'm going to basically go ahead and plot them as graphs. So at this point, how do you read this graph? What you get here is that for example, in the departure_delay is 10. That's 10 minutes delay. 10 percent of flights, nevertheless arrive early. But my end, 90 percent of flights arrive within above 21 minutes. So these are the deciles. The median on the other hand is a departure-delay, and arrival_delay of perhaps, three or four minutes. So that's essentially, what these lines are. These lines give us a distribution at a specific departure_delay. Looking at this, you notice that the relationship is essentially linear for all departure_delays until it gets to maybe below minus 20. So far, flights that depart more than 20 minutes early, right? So they depart really early. The relationship is rather scattershot. It's not very linear. If we're going to be building a linear model, we will be okay with doing such a linear model, somewhere in the middle of the distribution but not at the edges. And this is the kind of thing that you cannot get this kind of an insight easily any other way. You need to basically plot distributions, and plotting distributions is a lot easier when you have the full power of Python at your disposal.