0:00
Okay, back in QwikLabs, let's go ahead and take a look at Lab number 1.
So, if you want to access Lab 1, click on the hyperlink here to the left,
and same thing for the other labs as we get them.
So this first lab, as you heard, we're going to be exploring our course dataset,
which is going to be those IRS tax filings.
So let's jump right into it.
The first thing that we want to do is find the row counts of that table
within the BigQuery UI.
So the first thing we actually need to do is find our dataset, first of all.
So go ahead and click on the hyperlink here,
which will open up in a new BigQuery window.
0:36
And a lot of the questions that we get asked here are,
the BigQuery public data that you saw here didn't include all those amazing 50
datasets that you mentioned, why not?
So we only show a sample of those there, but if you access them directly via that
URL you'll be taken to all of the ones that we have.
Because, again, they're publicly available.
So you're going to get very familiar with this IRS 990 dataset and
all of the child tables they're in.
So what I want you to do is just go ahead and click around on a few.
And the ones that you care about the most, again, are just these four
2012 through 2015 that have the annual filings, and then the EIN.
If you're wondering those, ez is just a different filing for
certain non-profits that qualify for the easy filing form.
And then, pf is private foundation.
So just for the purposes of this lab, just those annual filings in the EIN.
So I'm doing, as I'm clicking around here on the actual tables, and
this is true of any table in BigQuery.
You'll be taken to the actual schema where you can see the column names,
what the data type is.
Whether or not it's a required field, or
if it's a field that's allowed to be empty or nullable.
And hopefully, if it exists, a description of what that field is,
so you may be wondering, what on earth is EIN's?
EIN is actually the Employer Identification Number,
it uniquely identifies one in those organizations.
And it's kind of just like a social security number for
those folks that live in the US.
So, if I wanted to get a little bit more metadata about that table,
I can click on Details.
And then here I can see the size of the table, rather small, 383 megabytes, but
it's still over a million rows.
And because the nature of this table and
you are just believing the person who gave you this data.
Which, in this case, is me, it's 1.5 million non-profits.
And again, you're just going to have to take my word for
it that each row in here is an individual non-profit.
But if you didn't want to take my word for it, let's continue on with the lab and
see what we're going to do to be able to count that theory.
You're going to build a slow distrust of anyone who gives you data
by the end of this course.
So we've expanded the dataset, we've looked at the different data table types.
And this kind of exercise, even before writing your SQL queries,
is critical to understand what your data fields are.
And what are the different tables that you have available for your analysis.
So we wanted to find the number of rows in the metadata,
metadata is just a fancy word for data about data.
So we add 1,587,766, the same as here.
3:21
And the question is, assume each of those records, what does that number mean?
And as I mentioned, that's there are that many non-profits
registered with the IRS, okay, but is that 100% true?
So what we want to do right now is write some SQL to see whether or
not that is actually true.
So, let's go ahead and do that.
So the first thing we want to do is compose a query.
Before we do that, I actually want to show you something pretty neat.
You can click on a table name, and to get a little bit of prewritten query,
you can actually click on this button, Query Table, up above.
One of the things I want to immediately draw your attention to is,
if you see these brackets, that means you're still in legacy SQL mode.
And we want to be in the latest and greatest, which is standard SQL.
So, un-checking legacy SQL, clicking Hide Options,
Query Table again, you actually get these back ticks.
Which lets you know you are in standard SQL mode, which is great.
And that's where we actually get some SQL for free, which is great.
So instead of writing select from,
and most importantly making sure that we spell the table name correctly.
You can get that for free by clicking on the query table, so
what do we actually want here?
We want to see whether or not that one million or
so EINs is actually unique.
So what we want to do inside of SQL,
the SELECT clause basically says, which columns do I want returned?
So, for example, if I wanted to just do EIN end columns or comma separated.
4:56
Maybe I wanted the employer identification number and then the name from this table.
And I just wanted to limit, not even limit 1,000, I'm just going to limit 100.
Click on Run Query, so I get a flavor for what some of the data is.
So EIN number, and this actually corresponds to this particular name.
So if you know a little bit of SQL, or if you don't,
you can just follow along as you learned here.
You can actually get a total count of these rows by using an aggregate function.
We'll go over that a little bit more later.
So I'd say I want the total count of organization of numbers.
I'm going to remove the name, and I'm going to say, give me the count.
5:43
And that's the exact same number that we got, 1587766,
as you saw on the table metadata.
But there's a cool thing that you can do, how many of those are distinct?
So 1587766 and it should be the exact same,
if you trusted your data provider, that each and every row here is unique.
And when we execute it,
we see how the numbers are slightly different, interesting.
So let's go back to our lab and see what we're going to do about that, so
we've ran this query.
6:19
One of the things also we'll introduce here,
as I'm going to hop back into BigQuery.
Is, as we're writing this, a good best practice is,
continuously format your query, this button is super fun.
And as you'll see here,
the column name that's returned in the query results was this F0_.
And that's because we fundamentally changed the data output value,
it's no longer outputting EIN.
It's performing a calculated field on it, it's saying,
count the distinct number of EINs.
So it actually becomes a completely new field, which is unnamed at the time.
So you can add in what's called an alias by hitting AS,
then we could say, count_ein_distinct.
And normally, best practice is all lowercase and
use underscores when you're aliasing things.
We'll run that query again and that is going to be the exact same thing.
And if you also, again, wanted to play around with cache, boom.
Those cache results are there,
because we're running the exact same queries that we did before.
7:23
Okay, so interesting, we've got a different number.
So now what we want to do is modify the query to count the total number of
records.
And see the difference between that total and the unique count.
So it's going to combine two things that we've already written.
So the first thing that we want to do is we want to say,
give me the count as the raw_ein_count, and what we want to do, execute that.
And again, SQL is a very iterative process, so you can run these queries.
And then continually improve and write more onto the SQL and
just keep running them.
So we have raw_ein_count and then we have the distinct counts.
I'm just going to change the wording here, so it's a little bit more uniform,
raw_ein_count, distinct_ein_count.
And then one of the great things that you can do is you can basically say, well,
I'm going to take this.
8:23
Subtract this,
because you can perform these kinds of calculations within the SQL.
And we no longer need this limit,
because a limit just limits the amount of rows that are returned.
But in aggregate function, in this case,
collapses everything down to a single row, because you're aggregating it.
If you're quick with your eyes, you'll notice, what did I miss?
The alias for this final field AS
total_duplicates, so, interesting.
So, of about the 1.5 million, there are over 3,000 that
appear more than once in the organizational details table.
That again, that tells you have potential dirty data or
other errors that you should investigate inside of that data table as well.
And this kind of pre-exploration exercise for looking at duplicate records is very,
very common when you're given a new dataset as well.
That's why we start you off with that,
3,179 is exactly what we've found.
And as we build on query by query, one of the things that we're
going to have you do here is keep commenting out your query.
9:46
So all I did here was selected everything, and if you're using a Mac,
that's command and the forward slash.
Or if you're using Windows, that's the Control and the forward slash, or Linux.
Very helpful to comment and un-comment, and again that toggles the comments.
One of the best practices that I also advise is add in that standard SQL.
Even if you've disabled the legacy SQL in the show options as well.
Okay, so we have 3,179 duplicates,
comment your code heavily,
and now what do we want to do?
Copy and modify the partially written query below,
find the actual EINs that occurred more than once in the dataset.
Okay, well, let's think about this.
So what we want to do is start here and copy it.
Paste it below.
And one of the great things about BigQuery is if you just wanted to
select a partially written part of your code and just execute that.
You can just select all of those, click the down arrow and
Run Selected on that particular portion.
Which is fun if you have a lot of different individual SQL scripts inside of
one statement.
11:42
And what this is going to do is it's going to give us five records.
It's going to give us the EIN, the name, and
then the count of that particular EIN that's occurred.
And if you're wondering what the GROUP BY does, GROUP BY is just,
if you're aggregating on one field, if you're counting the EINs.
You also need to aggregate on anything that's not aggregating.
So things like your other string fields, you put them into your GROUP BY statement,
we'll cover a little bit more on that later.
12:39
So, as you see here, having what more than one?
Well, we actually need to specify the field that we want to filter on,
so the query validator is definitely your best friend.
So we want to have the duplicates reveal themselves,
the duplicates mean their EIN numbers showed up more than once.
Interesting, okay, cool, so that showed up twice, twice, twice, twice.
But I don't want to go through all 3,179, that's the same number as before.
And then to find out which one was the most, so
what we can do is sort the results.
And basically say, well, I want to put ordering it by the highest to lowest.
So the entire count and highest to lowest is descending DESC,
it defaults to ascending.
So you actually don't need to put anything if you want lowest to highest.
So as you run that portion and see who occurs the most,
do we have anyone that's going to beat two?
14:03
Yep, so we have our HAVING filtering out the aggregation,
this is ordering it by name.
Okay, so actually what we wanted to do was order it by the organization's name,
which is fine.
So our way showed the ones that actually appeared more than
once with the highest on top.
And this one is just saying, hey, give me the list of duplicates and
then ordered them alphabetically.
So that is actually organizing it by the organization's name.
And let's see who comes up top alphabetically,
maybe something with a number.
And that is descending, meaning Z to A,
let's re-run that with A to Z or ascending, which was the default.
14:50
So let's see our As, or numbers, will come before our letters alphabetically and
we'll get this One World Foundation Inc.
And let's just verify that's what we had in the lab,
which is the first organization that has a duplicate EIN.
And again, since we're sorting on name, ascending,
that's One World Foundation, Inc., which matches this.
All right, save the query, let's go ahead and save it, give it a fun name.
15:24
You have the option while you are saving it, as I went through that pretty quickly,
to set it to private, to project level, which is interesting.
So if there were more than one person on your project, like say your organization,
you can actually share your queries with them.
Or if you're an instructor creating these courses for Coursera.
Or if you just want to share your code with the world,
you can actually set your entire query to be publicly visible.
So keep in mind which one of the right settings is for you.
Also, again, all of this is written in your project instance in QwikLabs.
So make sure before you end the lab, or end this specialization,
any of the code that you want to keep.
Just copy that out of your BigQuery queries and into either your free trial
account for Google Cloud platform or the Notepad document.
Just for your reference later,
because this will completely go away as part of your project cleanup.
16:20
All right, that's the end of Lab 1, is there anything else we wanted to cover?
Let's do a quick recap of the top three things here.
So, number one, if you are brand new to SQL.
We're going to review a lot of the SQL concepts as part of the next module,
module number 3.
But number one is, make sure you use standard SQL.
If you're interested in the quality of your dataset, the first thing you might
want to consider is, what is supposed to be that unique record?
In this particular case in the organization's table there is 1.5 million
non-profits, but there are some of them that appear more than once.
So you can filter those out by using the counting them first with
the COUNT DISTINCT, very, very useful.
And then you can filter that out by filtering on the count
where the occurrence happens more than once.
So again, having filters on aggregations.
All right, let's continue, Lab 1 in the books.