Hello and welcome to Lesson 3 in Module 15.
This lesson will introduce the density based clustering technique,
which is typically done with an algorithm called DBSCAN.
DBSCAN doesn't require a number of clusters be specified ahead of time,
and that's a real benefit of it.
On the other hand, it's a little more complex to understand.
It's a density based algorithm in that it finds
over densities of points and assigns those together to a cluster.
So by the end of this lesson,
you should have a better understanding of what
density based clustering algorithms are doing.
You should understand how the DBSCAN algorithm actually
works as well as be able to apply it by using the scikit-learn library.
There are two readings for this particular lesson.
The first is a visual example showing how this clustering will work.
So I'm going to once again go back.
I'm going to pick a different number of clusters.
Here, you can see there are three clusters of points.
And here is just the DBSCAN algorithm being run in
this case with a particular value epsilon and a particular value of minPoints.
These are two of the hyperparameters.
So let me just go ahead and do it.
And you can see what happens.
It basically finds points and says,
"Are you near the other points that I'm looking at?
And if so, you're part of this cluster."
You notice that it stops; it's run out.
When it's done with that particular group of points,it goes to the next points,
in this case, the blue ones,
and it keeps looking at those.
And then when it's done there,
it jumps up here to the green one.
And this time, you could see it's taking them longer than
the algorithm that was with the K means algorithm.
And when we're done, you could see that we have three clusters.
We didn't say there were three ahead of time.
We have three clusters that are marked.
We also have these open circles.
These are points that are farther away from
the main cluster and that's specified by these these values here epsilon and minPoints,
such that they are left as outliers.
So the DBSCAN algorithm is nice in that it analyzes data,
finds clusters as well as points that are not part of a cluster and it marks them.
The way it marks them as typically called noise.
Let's go ahead and look at the notebook for this particular lesson.
First, we are talking a little bit about the formalism of the DBSCAN algorithm.
We'll look at the Iris digit data.
The Iris data then we'll look at the digit data.
For the digit data, we're going to explore some of
the hyperparameters as well as how we can
cluster the digits as well as what
the dimension reduction will do to a density based algorithm.
So again, for the formalism,
there's two main parameters that we're going to worry about.
The first is epsilon and the second is minSamples.
The epsilon hyperparameter defines a maximum distance between
two points for them still to be considered in the same density neighborhood.
You could think of it this way.
If you started to build a cluster of points,
that's a density neighborhood,
and as soon as you start moving beyond epsilon,
you're no longer in that same density neighborhood.
The second hyperparameter is minSamples.
This parameter specifies how many points must lie within
this neighborhood of a current point in order for it to be considered a core point.
The set of core points is what's going to define our cluster because we have to
connect points via paths through these core points in order to build that cluster.
And you saw that with the visualization when we started it on a point
and we put a little circle that had the epsilon radius,
and there's other points in there that we got other points and we kept doing this,
and as soon as we get the minimum number of samples,
that formed a connected region.
If it's outside of that epsilon,
then we would call it a outlier because it wasn't
reachable by the path between core points.
So we look at DBSCAN algorithm.
We use it with scikit-learn algorithm or scikit-learn library.
We talked a little bit about the different hyperparameters you might want to change.
We run it, and when we first run it,
we got three clusters with these default values that I put in.
It turns out I actually tuned those parameters to make sure we would have three clusters.
You see the number of core points is 53.
Well, the dataset has 150 so that means most of the data is not treated as a core point.
So we really want to look at this in a plot.
We can do that. We could see our plot.
Here is our data,
and we've marked our core points with X.
These are points that with those sets of hyperparameters are marked as
being basically critical for defining the locations of cluster centers.
We can also then say, "Well,
where are these cluster centers?"
Here's our clusters. Here [inaudible] members they have,
and then we can of course plot that.
And here, what we're doing is we're saying,
"Here's our data with the big circles labeled by the Iris type.
Then the small circles marks the individual clusters with the last thing,
noise, being marked by a purple dot.
So now, we have four different types of clusters if you will.
We have class 0,
class 1, class 2,
and then finally, noise.
And if you look at this,you see that
the concentrated high density regions are clustered together,
but these points out here are marked as noise even though they're part of the Iris
they're a little farther away from the main body and so they're marked as noise.
Same with this point down here.
When we come over these two Iris species,
in this two dimensional space,
they are intertwined and thus it's hard to separate them out.
So you could see here's a single cluster,
and here's the second cluster,
and there's lots of noise points in between.
This just sort of demonstrates how the algorithm is actually working.
Noticed this point was marked as actual part of
the cluster and these two on either side of it weren't, again,
simply due to the nature of the algorithm trying to find paths of connected points,
and we see that that point has not included.
The rest of this notebook then switches over to the digit data.
We're going to view it and then we're going to
start changing hyperparameters and seeing what happens.
Here, we search and say, the whole idea is,
"Is there a particular set of hyperparameter for
epsilon that will give us the right number of clusters?",
which in this case should be 10 plus the noise.
So we say, "Which of these give us 11?"
As soon as it does, we print it out,
and it turns out there's just this one value when minSamples is 20.
See that a lot of the data is noise in this case.
The others have different ratios of membership.
So we can actually look at these.
We could say, "How many clusters do we have?" and "Can we print them?"
You can see that there are a lot of the clusters have
a reasonable number but this one's a little more than normal,
and this one, and this one are a little lower than normal.
We can then take a look at what are the mean cluster images.
And again, you could see that this cluster is a little confusing;
maybe a little ambiguous;
maybe this one as well.
But then again, this one kind of looks like a 7.
This one kind of looks like a 5, a 3,
a 6, a 0.
So some of them do very well in terms of being viewed and others don't.
When we make our fusion matrix, it's a little different because now,
we have to include our noise and that's done here at the top.
You could see that that noise cluster contains actual images of pretty much every image,
with the exception of maybe 6 and 0.
And if we come up here, you can see the 6 and the 0 are both really crisp and clear,
so these are typically not marked as noise or outliers.
The other sets typically do have some that are marked that way.
But otherwise, you see a lot of the clusters are done quite well, right?
This particular one is split evenly between noise and the cluster 1.
This one does pretty well.
The only one that doesn't has cluster 1,
which is predicted as a 1, a 4, or 9.
And if it come up here, you could see that that is a very ambiguous cluster.
The other one here was 4. If we come down 4,
actually that's a little confusing because it looks like
it should be pretty good but it's a little messed up.
Last thing was dimension reduction.
We can apply dimension reduction.
We can then go and see what our clusters were.
I noticed that when I do that and use the same hyperparameters,
we only get three clusters,
and they look kind of strange.
Seven maybe, but this is uncertain. That's definitely certain.
So we could change our hyperparameters and we can now recover our cluster centres again.
We can look at them, do similar analyses.
So hopefully, this is giving you a bit of a feel for density based clustering,
also the DBSCAN algorithm.
The nice thing about this algorithm is you
don't have to know the number of clusters ahead of time.
So you might think about running DBSCAN and getting a feel for what it might do,
and then running K means.
But again, this has some useful properties.
You don't specify the number of clusters in it.
It also has the idea of noise cluster,
which are technically outliers.
And we'll see the importance of
anomaly detection and being able to find outliers in a future module.
If you have any questions,
let us know in the course form. And of course, good luck.