But it turns out,

there's a lot of common choices for choosing K. Normally,

we don't often let the user just tell us how many bins they want,

instead, we try to let the data lead us to the correct choice.

And one of the most common choices for the number of bins, which is K,

remember K again is the number of bins,

is to take the square root of the number of data items we have.

And so, what do I mean by N?

Well, again, if I have a data set where it's title and number of pages,

N is going to be the number of rows in my data set.

So, if I have 100 books in the library,

then K is going to be the square root of 100.

And so, in this case, K would equal 10.

Now, of course, you want to think about,

I can't have a fractional number of bins.

So, typically, you'll take either the ceiling or the floor for this bin choice.

Now, this particular choice works really well if the data,

so this column of data, is already normally distributed.

What we're trying to always do is think about how we can

make the data fit some sort of normal distribution pattern.

And this is where things like Sturge's formula comes in.

So, this is trying to think about transforming the data on a log scale.

So again, remember N is the number of rows in our data set.

Here, Scott's choice tries to learn more about

the data by taking the standard deviation of the data,

and dividing by the cube root of the number of samples in the data.

And the Freedman-Daiconis rule use

the interquartile range along with the cube root of the number of data samples.

And we're going to talk more about what IQR

is in a different module, a different lecture.

And so for this histogram,

imagine again, we have a 1,000 books.

So, we've got our title,

we've got our number of pages,

and we've got 1,000 rows here.

And so we can use all of those common choices that I just showed, Scott's choice,

square root choice, Freedman-Daiconis,

to create a histogram of the data.

And all I'm doing to create a histogram is,

let's say that my page range goes from the biggest book has 10,000 pages,

and the smallest book has one page.

Okay. So, if I do this,

and I've got a 1,000 samples,

my square-root choice is going to be the square-root of 1,000.

All right? So, how many bins fall into the squared of 1,000?

So, 10 squared is 100,

20 squared is 400,

30 squared is 900.

So, we're somewhere above 30 bins, right?

And we can put this in our calculator, figure this out.

But, once we know the number of bins,

we can then find the bin width.

So, let's just say in this case,

we want to just have a user defined K equals four.

So, we're going to have four bins,

if we've got my book pages now range from 10,000 to one.

I have to figure out my bin width.

So, I've got 10,000 minus one.

This was my max X,

minus my min X, divided by H,

is going to equal K, and if user defined K is four, now,

I can solve for H. And maybe to make this nicer I don't want to use my Min x as one,

I can say, okay, well, a nicer number would have been zero.

We talked about nice numbers before,

so I can take 10,000 divided by four and that would give me 2,500.

So, what this means is now,

every range is going to go from zero to 2,500

to 5,000 to 7,500 to 10,000.

And what I do is for each book, each row here,

if this particular book has 10,000 pages,

that means I add one to this bin.

If this book has 6,789 pages,

that's between 5,000-7,500, so I add one to this bin.

Let's say again, this has 6,215 pages,

I add another one to this bin because it's between 5,000-7,500.

So, histogram is just counting up how many things fall in each bin,

and we figure out this K and this H based on our different rules.

And so for our imaginary data set here,

the square root choice looks something like this,

where we take the square root of 1,000 books and we wind up with over 30 bins,

and we get a distribution like this,

and we can sort of see, well,

maybe there's two peaks if I connect these with lines,

I see this little dip here at 38.

And so, I may have the impression that perhaps there is

some multi-modality going on in the data set.

However, if I use Sturge's formula,

I wind up with a very,

very smaller amount of bins,

and I wind up with what looks like a nice sort of

almost normal distribution even skewed a little bit towards the lower edges here.

And now, normally in a histogram, too,

we wouldn't allow any space in between the bins because

the spacing gives us some idea that there may perhaps be category between the data.

So this is part of the problem with trying to draw things like this in Excel,

then why we're going to learn things in Pythons,

we have more control over the design space.

Scott's choice, again, looks very similar.

We see this sort of nice normal distribution pattern for the most part.

Again, different number of bins.

And the Freedman-Diaconis rule.

And so, we can see, depending on the underlying data distribution,

the histogram can give us

slightly different views of the data distributions and it's really

impacted heavily by the number of bins and the bin width.

So those are the choices that we get to make for the design aspects of this.

And again, dependent on what we choose,

we can have drastically different looking histograms that can even cover up things if

we do wind up with something like

a multi-modality and dependent on how I've bin the data,

I could transform data that looks like this into this.

Now what's interesting is,

histograms are primarily our first look data analytics tool.

We're trying to learn about what the underlying probability distribution of that data is,

how likely is it that we might find a book with only 10 pages

versus a book with 10,000 pages and things like this.

So this is often sort of your go-to tool for data detectives and try to look at,

explore, and understand data.

So oftentimes, you may have a book title,

you may have things like the number of pages,

you may even have a category and how much the book sold for or the cost,

you may know even the quantity of books sold.

And for each column,

you may want to make a histogram so you can understand what's going on with the data.

And the main thing there is,

you can also quickly tell if there's errors in the data.

Imagine if my histogram looked like this for the number of pages.

Why would there be so many books with zero to whatever number of

pages this likely means that something in the page column has an error.

So again, histograms are often our first look tool for thinking

about what's going on in each of our data columns.

And these are typically not nominal or ordinal data but these

are ratio or interval data.

And so again, we get the choice of bin width,

the choice of number of bins,

pack bins together like we show in this picture here,

and we have to think again about designing labels and designing

aspect ratio and all of these elements

we've been learning about go into the creation of this.