Next is a critical topic for any Data Engineer and analysts to understand. Which is how to break apart a single data set, into groups or Windows to run SQL over just pieces of it. Let's go back to the questions that we want to answer. For this next inside, we want find this stations with the fastest, and the slowest bike turnover times. As we did before, let's map out our analysis. It's the same dataset as before, but this time we want the timestamp of when each byte came in and out of the docking station. We ultimately want to find out how long that byte sat there after its previous trip ended, before it was checked out again. On the query side, same as before will join the datasets together. Next though, we need to find a way to look backwards and see the previous time a bike_id was returned to that station. Then we simply take the timestamp difference for the idle time to that bike_id at that station. Lastly, we'll get together some aggregate metrics for that station as our final report, and filter the data as needed. We can use a lag function which is one of those navigation functions in SQL. So go back one record in the dataset, and pull the previous value for bike.end_date. In order for the lag to make any sense, we need to use an analytical window function to logically partition our data before doing the lag. Here is a visual of the results. So you can see what the actual lag is doing. The last end date is taken from the previous end date from the record before. By default, lag just looks back one record, but you can look deeper if you want to offset more. The critical part here is the window function that we put with the lag. The over partitioned by bike_id, order by bike.start_date, ensures that we're only taking the lag within all this same bike IDs that is what that partition does, like the bike ID you see here 9,842, and then that sort order of the window of the data is correct, oldest to newest. That allows us to transverse the past rentals accurately. Lastly, this is simple timestamp diff between the current rental start time, and the last rentals end time, which gives us the time in hours that the bike was idle in that dock. Nobody was using it. We can then aggregate with an average and show that result for each station name. Here you can see the hottest station is warmwood. With only a 1.15 hour downtime on average in the station with the bikes just sitting there is a new station that was opened in 2017. Here East South in Queen Elizabeth Olympic Park with over a 116 hours idle on average for those bikes. One thing you could try, is filtering the dataset for only daytime hours or nighttime hours, to see how time of day affects each station. Last up in our discussion of advanced functions are ranking functions as well as the array data type. Our last insight will be the rank which bikes need maintenance based on the most usage metrics that we're going to create. So let's collect some bike stats. For the first part of the query, let's get granular information on each trip taken by each bike by given a rank. Give a one to the first trip ever took, and a two for the second trip etc. Then lets sum common metrics for duration and distance. Finally we will store all these metrics, new convenient structs that we call stats. These are the results from that last query. Here's a challenge for you. What if we wanted to rank each bike against all the other bikes, and we also wanted granular details for every trip that each bike took? How could you write a query. They will return this result here. It's one record for bike 12757 which is ranked number one in distance traveled over 5,000 kilometers. But we also want a repeated value for every trip that that bike took. If you wanted to drill down into that data, the secret to that is the array data type. We'll show you how to get there in our next demo. So here's another demo working with that same London bike share dataset. This time we're actually looking at individual bike IDs and ranking them in accordance to the problem of solving which bikes need to be maintained the most based on say, how many hours the bike is ridden, how much distances is actually traveled. How we're going to be doing that is with window functions like logical partitions not like column partitions that we talked about a little bit earlier, and then arrays which is different levels of granularity in a given dataset. So let's see the query that is actually going to give us what we need. So this is a very long query, but we're going to break apart everything, and you're going to walk away knowing exactly what every line in this code actually does. So for me of course I like to start with a staging query. This is the same if you looked at the previous demo, it's the exact same query as you saw and literally it's just putting the starting and ending stations inside of their own structs. So it's very easy to see visually on the query that's resulting. Then all the transactional information about the actual rental including the distance the bike traveled in meters, how long it was in that duration in seconds, and in the actual start and end date which is your timestamp as well. Now over that data, this isn't the interesting part. The actual interesting part is what we do afterwards. So we want to calculate some key maintenance statistics. So I'm going to actually break the query off here. This is a generally how I write pretty complex queries, is I start with the raw data call it staging and do some kind of analysis on it. Of course you can promote these to permanent tables a little bit later, but I want to basically take the results of that query. Let me just throw a limit on here. So you can just see what some of the results are from here. Let me explain what this is going to be doing. So we're going to take the individual bike ID and then within that bike ID. All right. Well for that bike ID, give me every single trip that bike has taken and order it. So in order to actually say for this bike ID because remember the dataset has maybe 700 thousands of different bike IDs. If you're using the language for this given bike ID, do this, you're going to be doing a window function. So window function basically says, "rank this bike ID," meaning window it, partition it, logically break apart the dataset into separative and groups by the bike ID, order it by the start date when that trip was actually and this is oldest to newest. Order it from oldest to newest bike trips, and then just assign a ranking. The very first thing that should show up here is for this bike ID number 1-2-3-4-5 or something like that will be the first trip it ever took. So we have the actual trip number that it took, and then this will give us how many trips the bike it's actually taken. So if it has taken 715 trips, what is the 715th trip look like? Now, next to the ranking function is one example of an analytical function that you can do, but we can also do some aggregate functions. We can say the duration which is in seconds, and now it's in minutes, and now it's an hours with a division by 60 divided by 60, I want the total hours. Again by the bike ID. Cumulatively of that it's been used for every single one of those trips. So if it's the trip 715 how many hours has been out of the station, and you can use it maybe as a proxy for maintenance. If it's over a certain threshold like 1,000 hours make sure that we flagged it against maintenance. Lastly, if you're looking at hours, maybe a little bit more an accurate measure is the distance. So the total distance that the bike has traveled in kilometers. This is in meters divided by 1,000 for kilometers, same word doing a window function just to get that particular bike ID and then ordering it by that started as well getting accumulated distance. So let's actually see what just this looks like before we add on the last bit which actually involves the arrays. So this is the windowing functions and then we're going to get it into the arrays. So what we should see is the bike ID like bike ID 1, 2, 3, 4, 5, or something like that of each trip, then the trip ordinal number 1, 2, 3, 4, 5 is the fifth trip and then for each of those trips, some key stats for maintenance, how long it's been out of the shop and then the total distance it's actually been traveling. So here we go. Bike ID number 17 there current shift number this is it's first trip that it ever took, it was only as 0.03 hours, so very, very short, a very very short distance and this is when it was started and when it ended. So you can say for each of these different bikes now we can do some really interesting stats, it'll be a ton of raw data I've only limited it to be 100 here. But it's actually B queries can do that for every single bike, do these aggregations. So you did the rankings on each bike, now you can do rankings across bikes. The really cool thing that we're going to do in this last part of the query if I just un-comment this, we just turn this into a separate query the name sort of query that's that width clause. By the way, if you haven't seen this before like in chain together common table expressions are named subqueries with staging as blah, comma, maintenance stats as select from staging but also be blah and you can have your final query be select from the maintenance_stats. That's why I kind of chain together these queries instead of trying to do it all in one massive query because it makes it a little bit harder to read. Of course, you can materialize the results of your earlier with clauses into tables as you need. So the very, very last thing that we want to do is say we just want to agree on the bikes that have traveled the most in distance, in cumulative distance. I want to rank those bikes against each other. So bike 17, have you traveled the most kilometers or not? But here's a really interesting part. We're going to be taking the ranks of each bike, showing the bike ID, showing the total distance it's actually traveled, the maximum cumulative distance which is essentially just the last trip, and then the last thing that we're going to do, which is where the array is coming to play is, by the way if I wanted to drill into the detail I want you to show me for that given record for bike 17 on the same row as if you're going to double-click on bike 17, show me all of the stats, these were this stats that you saw previously, show me all the stats for the most recent 10 rides. What does that actually look like? So let's go ahead and run this and it's going to do it for every single one of the bikes. You'll see the arrays, this basically just aggregates it into an array. The arrays give you just the multiple levels of granularity. So you're going to see on a top-level you get high-level information about the bike ID and why it's number one, how far it's gone maybe could be thousands of kilometers for some of these bikes. But if you wanted to get a sense for some of the different trips as well, you'll have that. So this is what it looks like. So this is just one row of data, it looks like it's 10 or 15, right? Well, that's because these are arrays. So at the top of which just one row and it's the bike ID 12757 it's traveled a whopping almost 6,000 kilometers. Hopefully, they're doing some active statements on this, if not this is great. This is just the ranking of that given bike. So you can say WHERE-Clause filters where give me the top most traveled bikes where rank is greater than or equal to five. Then if I imagine just conceptually double-clicking on that 12757 these are the most recent 10 trips for that bike and you can see as I start scrolling down it's already wants to show me the number two bike. Again, this is row one, row two, the reason why you see this exploded detail here and some seemingly blank data here. This isn't blank data, this is because this level of granularity it's not a repeated field. Whereas these fields are repeated meaning that they're arrays. All of this is technically one value, it actually looks like this. Back in the system it will look like an array. So if you've seen arrays with the brackets and commas for each of the different array elements, visually in the UI it just breaks it out for you like that. So you can easily can kind of compare those levels. Arrays instead of BigQuery is super-useful tends to have instead of a very, very wide schema. Deep levels of detail alongside high-level dimensions. You can have a super wide table for people who just care about what's the number one bike grade? Number one bike is 12757. For people who want to care, I want to know what was the second to last trip of that bike. You can get that same granular level information from the same table just by looking at that second element, second most recent element inside of that array. Again, it's up to you, we don't have a table that says how much maintenance was actually performed in these bikes you can clearly just add a WHERE-Clause filter and basically say. "Any bike that's traveled more than 1,000 kilometers or 5,000 kilometers." Be sure that you do extended maintenance on it, or give the person a special award if they hit the 10,000th or 100,000th kilometer for this for this given bike ride which would be elected cumulative distance there as well. So again, that's the working with arrays which BigQuery naval supports. And as you saw, the windowing functions which are super useful for doing calculations on a specific bike ID. If we just started doing the rank of all bike eddies by their Start trip date, that might be useful to show you which Bikes are used most often but we want to say for a given bike ID we'll rank all of its trips create a window of data and that's exactly what a partition function does here. Again, the huge thing to keep in mind is when you see this 'partitioned by' that is completely different than a table partition which is a reserved special column on which you're partitioning data out there on disk. This is a logical partition when you're using a window function inside of BigQuery. All right. That's it for events functions, follow along inside of the queries here as you can see a little bit of a discussion of this repeated fuels and arrays and try the code out yourself.