Clustering is an important integral part of description task and predictive task of data mining It is based on similarity to divide similar objects through static classification into different groups and subsets In Python, there are many third party libraries and dedicated toolkits for cluster analysis Let's first enjoy some charm of cluster analysis in Python There are a lot of clustering algorithms Of them, the K-means clustering algorithm for its simple and fast characteristics is widely used We may generally say the K-means algorithm is simple but it suffices Let's look at its basic operational procedures The first step is to randomly select k objects to be the initial clustering centers and then determine the clustering center for each point which is actually to calculate the distance The mean square deviation is usually adopted as the standard measure function Next, calculate the clustering center of each new cluster till convergence That is to say, the determined center does not change any more meaning the clustering is completed We should guarantee that clusters are as compact as possible and all clusters shall be as separated as possible Next, let's look at two examples of cluster analysis with the K-means algorithm Common clustering algorithms all contain K-means the basic algorithm like the "vq" in the cluster module of SciPy It is a vector quantization package which includes the algorithm of K-Means Our example here is like this It is known that Dameng is a good learner We'll look for potential good learners based on the scores There are six students here Xiaoming, Daming, Xiaopeng, Dapeng, Xiaomeng, and Dameng They all have four courses Advanced Math, English, Python, and Music Their scores are as follows Now, let's use the K-means for clustering of these data with the following method At first, put their scores into a list and then use the array() function in NumPy to generate data for them Next, use the whiten() function to calculate the standard deviation for each column of elements and form a new array There are two core functions in kmeans One is kmeans and the other is vq kmeans is for data clustering This part, as we see, is data Next, look at the argument "2" after it What does it mean Let's think about it. Since we're to find good learners does it mean there're two groups one is good learners and the other is not good learners So, we may select Group 2 i.e., to cluster into 2 groups The return result of the kmeans() function is a tuple of which we only use its first value It is an array of clustering center We may write it as this a comma followed by an underline This is the second one, meaning we don't need it Then, we put the result into vq such a function It is a vector quantization function which may classify each data i.e. everybody here and then acquire the result Have a look. The result is like this As we see, the group where Dameng belong to is represented by 0 Well, let's consider this: who else are good learners This data and this data are also 0 which means Daming, Xiaopeng and Dameng are in the same group i.e., good learners Then, let's look at the detailed scores It seems that this group of scores have the potential for good learners Let's look at other three students As we see, since Xiaoming's English is poor he hasn't become a good learner yet And Xiaomeng has three courses with similar scores he has to work harder How about Dapeng He seems to have the potential for a good learner It's worth noticing that The Kmeans algorithm does not acquire a globally optimal solution but a locally optimal solution Thus, the clustering result is likely to vary For example, when you run this program you might find out that Dapeng and Dameng may be classified into the same group Whichever the result might be we will see that Kmeans is quite simple but it really works If the data is of a bigger quantity we might see a more interesting result This is a small case for finding good learners Next, let's see how we can perform the previous task of finding good learners with the renowned machine learning toolkit scikit-learn scikit-learn is a open-source machine learning module in Python which is built based on the previously-mentioned numpy scipy and matplotlib modules It provides various interfaces with machine learning algorithms and it's quite convenient for the user to simply and efficiently call these interfaces and for various mining and analyses etc Now, let's see how to solve this problem with scikit-learn As we see the first part is the same Generate an array The next part is mainly two functions The first one is fit() and the other is predict() fit() is a method for what, as we see for clustering of datasets after Kmeans determines the category and for learning. "fit" is actually a learning process By contrast, the effect of predict() is to, based on the clustering result determine the belonging category Finally, the output result of "code" is like this As we see like we mentioned before Dapeng and Daming are classified into the same group, aren't they both good learners Clustering is an important method in machine learning or data mining Apart from clustering classification is also an essential method However, classification is different from clustering To put it simply, the way of classification is like this At first, it divides a dataset into two parts The first part is the training set and the second part is the test set Acquire a model from the training set and then give a definite mark to the group of unknown object in the training set For example, let's take one randomly for example If there are data on my work attendance for one year the mark of that group is at work or absence Suppose the test attributes include weather, feelings, day of week I'm full or not And suppose we get such a rule in training I go to work as long as I'm full Then, we apply this rule to the test set Can it mark off the data of my absence That's roughly the idea of classification There's another simple instance In this instance, the renowned support vector machine algorithm is utilized to classify data Similarly, two methods are used namely, the fit() method and the predict() method The fit() method is to learn n-1 training sets and then predict one test set So, is it still quite simple We only need to understand these methods and see how its arguments are set up and then we are able to perform some basic classification tasks Let's look at a more practical example Based on the rise and drop regularity of closing prices of two consecutive days of 10 Dow Jones Industrial Average stocks over the recent year conduct clustering to them Suppose they are the 10 companies First, suppose we've defined such a function to acquire the data of these companies Acquire their closing prices and then how to find the regularity of rise or drop in data You guys might still remember that As we mentioned before there's a diff() function in numpy to perform this task Therefore, here we still use the same function to process our data and then use our previously introduced fit() and predict() methods in scikit-learn to cluster these data Here, we cluster them into 3 groups Well, that's the clustering result we want According to these results can we think about it Why is there a similar regularity in these companies Is that related to political and economic factors We may explore a lot here Finally, let's briefly talk about selection and evaluation of models If some complex data mining tasks are to be completed with models of machine learning first, we need to select an appropriate model based on data and tasks such as a classification model or a clustering model After determination of model, there's an important task, i.e., determine its argument which often can't depend on experience only For instance, if some data are to be with a K-means model how can we determine the optimal quantity of K values, i.e., clusters We need to adjust the argument which is an essential and often energy-consuming task Continuously adjust the argument and evaluate the model so as to expect a good model and minimize the error between the true value and the predicted value This plot, for example, shows a K-means map of inflection points plotted with the most common "elbow" method to determine the optimal K value in K-means The data are the Iris dataset and the core indicator of elbow method is the sum of the squared errors It is the clustering error of all samples and this is the sum of the squared distance from the sample in each cluster to the center-of-mass As K increases, the degree of aggregation of each cluster is enhanced As the K value continuously increases, the extent of decrease in SSE will reduce and tend to be slow The plot of relation between the SSE and the K value appears like an elbow hence the name the "elbow" method The K value corresponding to this elbow is the optimal cluster number In this plot, as we see, the inflection point is at K=2 followed by K=3 As we know, the Iris dataset is actually divided into three categories Here, our main consideration is to render you a vivid picture to understand how to select model arguments Sure, you guys shouldn't be too afraid as the precision requirement of some questions are not so high sometimes For example, in our previous example of finding good learners we may determine the K value by our experience As for the detailed mathematical principles of KMeans and the method of selecting optimal K values you may refer to the explanatory document we provide In this section we mainly used the simple algorithm of kmeans to talk about how to apply it to data clustering Hope I've unveiled the mysteries of machine learning or data mining to you Are you very interested in it now