However, k-means cluster analysis and
cluster analysis in general has some disadvantages.
First, we need to specify the number of clusters.
But we don't know the true number of clusters.
And figuring out the correct number clusters that represent the true number of
clusters in the population is pretty subjective.
On top of that, your results can change depending on the location of
the observations that are randomly chosen as initial centroids.
K-means cluster analysis is not recommended if
you have a lot of categorical variables.
If you have a lot of categorical variables, then you need to use
a different clustering algorithm that can better handle them.
K-means clustering, assumes that the underlying clusters in the population
are spherical, distinct, and are of approximately equal size.
As a result, tends to identify clusters with these characteristics.
It won't work as well if clusters are elongated or not equal in size.
There are a few steps you can take to help you feel more confident about
the reliability and validity of your clusters.
First, conduct the k-means cluster analysis using a range of values of k.
This helps, but doesn't completely solve the cluster instability problem
related to the selection of initial centroids.
Splitting your data into training and test data sets,
will allow you to run more than one sample through your algorithm, and
can be helpful in determining whether the clusters you find are reliable.
If you get the same results in different samples, you can be more confident that
the clusters are reasonably catching the underlying subgroups in your population.
In addition,
validating the clusters by determining whether they are interpretable, and
whether they differ from each other on other variables not used in the cluster
analysis, can increase your confidence in the cluster solution that you choose.
In this course, we've just scratched the surface of cluster analysis.
There are many different methods, distance algorithms, and
approaches to choosing initial centroids, and the number of clusters to retain.
Some of these methods may be better suited to the data you have, or
to the shapes and sizes of the clusters you think might exist in your population.
K-means cluster analysis is a good starting point
because its simplicity makes it easier to convey the concepts.
Hopefully you will have learned enough to feel confident about exploring other
methods.