Welcome to the module on advanced BigQuery functionality and performance. I'm Evan Jones, and I'm super excited to share a cool data set and write and understand some pretty powerful SQL for advanced insights. Let's get started. Here's the content that we're going to cover. First, BigQuery natively supports geographic information system functions, or GIS functions and their data types. We'll analyze three key questions that we want to answer and then you use those GIS functions to help us solve and map the results with BigQuer Geo Viz. Next, we'll explore with clauses versus permanent tables for performance and readability. After that is one of the most critical tools in an analysts tool kit, which is the use of analytical window functions, which allow you to break up a given data set and perform SQL only over certain windows of that data. Then, we'll discuss ranking functions and go deep into repeated data with array data types. Lastly, we'll conclude with performance best practices to help govern your data engineering and analysis efforts. I find it's best to start with a robust data set and get inspired to ask difficult questions that we can then build out with advanced SQL functions. Our data set for this module will be over 24 million bike-share trips in downtown London. If you haven't seen these bike docking stations in cities, they're getting pretty popular. Essentially, a person can go to one of 100 bike docks that are in a street corner in the city and rent a bike by the hour. They must bring the bike back to one of the station's to end their trip. The public data set we have consists of two tables, the actual millions of trips the bikes have taken, and the locations of the docking stations around London. Let's first start by asking some fun, thought-provoking questions. Here are three questions that I came up with that I want to know. First, I want to know which pairs of stations rented from and then dropped off at has the fastest riders in kilometers per hour on average. I've some assumptions about people that are late for work that live far away that I'm curious to see if the data supports. Second, if I were running this business, I would want to know which stations have bikes that are always in use. Docked out, docked in, in and out quickly. Versus those bike docks that are getting dusty with unused bikes. That can help the business plan for capacity, and maybe even move bikes midday from those slow stations to really hotly demanded ones. Lastly, I want to manage the overall health of our bike fleet. These are assets. If we don't want to open ourselves up to liability by having our bikes breakdown on a midday ride in busy London, we want to make sure that our bikes get the maintenance that they need, and we prioritize those bikes that need the maintenance first. Sound like fun? Let's tackle that first problem first. Finding the average pace, the fastest one, between a pair of docking stations. We'll make the assumption here, since we don't have GPS data in the ride itself, that the clock begins when the bike is rented from one station and ends at the final station. And for ease, we'll use a straight line distance in kilometers between the stations themselves for distance. It's time to explore how we can make this insight a reality with the data that we have. As I mentioned, we have those two tables, the individual bike rides that are in the cycling higher table, which is shown here. Take a look at the columns, and knowing what we want is the average speed, which is that distance traveled over that time duration, what column or columns would be useful? We could do a timestamp difference between start and end date to get the duration. But lucky for us, we already have a duration field. So let's use that d uration field as the total time the bike was out during that trip. What does 3,180 mean for duration? Well, if you were to click over to the schema tab, you can see that in the description, this is the duration in seconds for the trip. This is why it's critical when you're creating your own data sets to add in column descriptions. So duration is done. What about distance? How can we calculate the distance in kilometers between the starting and ending station? We still need to know how far the bike physically traveled to get to the pace of the bike. The cycles higher table only has those names of the starting and ending stations. Where else might we look? The cycle stations table has not only the name of the station but the actual physical location in latitude and longitude values. Since BigQuery natively supports GIS functions, we can use the function ST_DISTANCE, which will give us the straight line distance between two geographic points. By the way, you'll see a lot of these GIS functions that begin with the letters ST. It simply means spatial type, and BigQuery has a special geographic data type as a backend for all these geographic data, and it's optimized for GIS work like this. More on GIS later. Before we begin writing any SQL, let's first remind ourselves of the overall insight, see what data we have, and then mentally prepare for the query. Our goal is to find a pair of bike stations with the highest average rider pace in kilometers per hour. We analyze the schema, and found duration in seconds, and a lat/long of the starting and ending stations, which will give us that distance. Now, duration and distance will be in two separate tables. So the first part of our query will need to be a join. Then, since the lat and longs are not of a geographic data type by nature, we'll need to convert them into formal geographic points before we can use the power of GIS functions on them. Then, we use ST_DISTANCE, which will give us that straight line distance between the stations in meters. Next, we want the average for all riders on that route. So we'll then use a SQL aggregation function. Lastly, we'll apply any filter that we want to remove any anomalies from the data. This is the first part of our query, to bring the data from two different tables together, bikes and bike stations. In complex queries where I know that I'll likely share my results with others in the form of a table, I often use the STRUCT function to clearly outline what fields came from what tables. It may be overkill for small examples like this, but for super wide schemas, you're talking like 30 plus columns, you're almost guaranteed to see STRUCTs in use. So why not get familiar now? The schema behind Google analytics is a great example of this super wide schema. So by the way, looking at the query, why do we have three tables in the join conditions? Well, each bike starts and stops at a station, so we need to join the station's table twice, because the bikes table has those two column IDs to join on. This is the result from running that previous query. You can see how putting the columns for each table in a STRUCT gives us a nice prefix in the column name. So it's super clear that starting dot longitude is different from ending dot longitude. This process of joining together many tables into a single table is called denormalization. It's often a great way to store data for fast retrieval from BigQuery without having to do those joins each time. It's also highly performant because BigQuery stores data in a columnar format instead of a record format, which means you're not punished for having super wide schemas. Now that you're familiar with STRUCTs, it's time to do the actual GIS functions to convert our lat longs into points, then take that distance between the starting and ending stations. The inner function there, that's the ST geography point, accepts a longitude and latitude float value and converts it into an actual GIS point. We do that twice, and then pass those into the ST_DISTANCE function for the final distance calculation, where we got 661 meters apart from those two stations that you see there. Lastly, we'll probably want to draw these distance lines on a map somewhere so we can use the example ST_MAKELINE to do so in a demo. Now that we're done pre-processing all of our data, it's time to do the actual SQL to calculate the insight that we want. We pull the station names, round the trip distance, convert it from meters to kilometers, count all those trips from that station, and filter by only stations that have only 100 rentals or more with a HAVING clause. Then, we actually do the speed calculation by dividing the distance and duration and taking the average. Lastly, for the insight to make any sense, we'll sort from the highest average speed first in our ORDER BY clause. You'll see that the data is pulled from a staging place noted there in the fROM clause. That is actually a named sub-query, or a WITH clause that we'll cover later. It's essentially the query that you saw before to pre-process the data, which we named temporarily as staging so we can call it later. And here's the actual output of that query. You can see that if you're going from Finlay Street to King Edward Street, you're likely to encounter the fastest commuters on average, with about a 16.6 kilometer per hour average pace, which is about 10 miles an hour. On commuter bike, that's pretty good. What does it say about these stations? Well, maybe the goal of these riders is just get from point A to point B, and no time spent enjoying the sights of London that'll eat up your average pace. Without having GPS data mid-ride on these bikes, we can only speculate at what the actual routes that each rider took was. Well, wouldn't it be easier to see these trips as actual lines on a map of London? With BigQuery Geo Vis, you can do exactly that. Geo Vis is a web-based front-end that you can run BigQuery queries on and visualize the geographic data points. You can even style those points lines or even two polygons with different weights and colors based on your actual data values. Let's do a quick demo of the tool using the results of the fastest commuters query that we just made.