Hello, everyone In this part, we'll base on the UCI wine dataset to complete a complete process from data acquisition, preprocessing, exploration, machine learning modeling, argument adjustment to model evaluation etc Hope it can provide you relatively complete knowledge of conducting a classification task with machine learning models First, look at the UCI dataset Directly view its official website You may resort to the dataset summary on the website first to learn about the basic information of dataset As we find, the UCI wine dataset consists of two parts related to red and white variants of the north Portuguese Vinho Verde wine It's a classical standard dataset for machine learning often used for tasks such as data analysis and machine learning and classification We choose the red wine dataset for analysis Just click here to download data Then, look at the basic information of the data Here is the description of attributes in the dataset As we see, there're 11 input variables, namely, feature attributes which are based on physicochemical tests including residual sugar pH value (PH), and alcohol The output variable is the target attribute "quality" which is the quality rating of wine acquired based on experts' sense organs whose values range from 0 to 10 Let's directly view its data As we see, the data have headers and data are separated with the semicolon ";" Quite easy then Just use the "read_csv()" function in the pandas module and set the value of its argument "sep" to be ';' to read data Let's read the data Read the data Save the data into the variable "wine" "wine" is a DataFrame object Next, let's explore the data and conduct necessary preprocessing Run the program First, look at the basic information of the data There are 1599 non-blank records in total and 12 attributes Acquired data like this often contain duplicate records Check them out The duplicated() method may be used to check whether Series or DataFrame objects have duplicate records If yes, it returns True If no, it returns False Sum the returned results with the sum() method, and we an get the number of duplicate rows As we see, there are 240 duplicate records Let's delete the duplicate records with the drop_duplicates() method Still assign the result to "wine" After processing, there are 1359 records left Then, look at the basic statistics of the data Briefly look at the target attribute "quality" As we see, the mean is 5.6 the minimum value is 3 the maximum value is 8 and the median and the upper quartile are both 6 Obviously, the records whose quality is 6 make up at least 1/4 of all the records Check how many values each category of the "quality" attribute has Use the value_counts() method As we see, the records corresponding to 5 and 6, the middle figures, seem in great quantities while the records corresponding to 8 and 3, the starting and ending figures, appear less in a normal distribution Next, plot a pie chart As we see, the total of the records whose "quality" is 5 or 6 exceeds 80% of the total record number Then, look at the Pearson correlation coefficient between "quality" and other attributes The data are often in a normal distribution. No separate testing will be conducted here As we see, the correlation between the "volatile acidity" and "alcohol" attributes and "quality" is relatively big One is in negative linear correlation and the other is in positive linear correlation Let's move on to observe the plot and see the distribution of the mean of the "volatile acidity" and "alcohol" attribute corresponding to each "quality" value which more vividly show their correlation Process it with the barplot() function in the seaborn module to list the attributes to be observed separately Look at these two plots What to observe For example, for "alcohol" look at this item which is the mean of the "alcohol" values in all the records whose "quality" is 8 close to 12 The others are similar Can you obviously see that the higher an "alcohol" value is the higher the "quality" is But "volatile acidity" is just to the contrary which is identical to our previous judgment of correlation coefficient The bar chart, however, is more detailed and more vivid Since the quantities of attributes and records are both small we don't do data reduction That's all for data exploration and preprocessing here Next, let's talk about our core task Suppose we now have data of a batch of new red wine Through physicochemical tests, we've known their 11 attributes such as the pH value and alcohol Can we, without experts' appraisal learn the "quality" values of this batch of wine From the data exploration above, we know there are 6 "quality" values from 3 to 8 with a lot of categories It's a little bit difficult to classify them If we change a multi-class task into a two-class task such as dividing the "quality" value [3,8] into two parts of which [3,6] means average quality and [7,8] means superior quality for binarization would the question be a little easier We only need to judge whether the new wine is of average quality or superior quality Of course, we may also divide the raw data into 3 categories For example, in the program coming up soon the "quality" is processed into 3 categories Look at the program first Look at this part We may conduct data binning with the cut() function in pandas we talked before First, set the bins for the data to be divided Think about how this writing of "bins" value divides data Note that data division with "bins" creates left-open and right-closed intervals and here, it would be (2,4], (4,6] and (6,8] As the "quality" value ranges from 3 to 8 3 and 4 are in a group 5 and 6 are in a group and 7 and 8 are in a group The group names are determined by group_names here including "low", "medium", and "high" After execution, "wine" will have one more attribute "quality_lb" whose value is "low", "medium" or "high" Let's execute it It's really like this One attribute is added and each value is one of the three mentioned above It's inconvenient to operate strings Next, we use the LabelEncoder() function in the preprocessing module to allocate labels 0, 1 and 2 to the quality_lb attribute After execution, what's inside the "label" attribute is the specific label Go on to execute it Have you got it A new column is added We did it Quite convenient, isn't it Let's use the value_counts() method to gather statistics on the distribution of new categories As we see, the records corresponding to the "medium" category is the most This is the quantity of the category corresponding to "high" and this is the quantity of the category corresponding to "low" Next, conduct some necessary processing to the data After such processing, the "wine" has 11 feature attributes With the newly generated "label", there are 12 attributes in total Now, let's resort to data selection to separate the feature attributes from the target attribute saved into "X" and "y", respectively Look at the values of "X" and "y" As this is a classification task the following work is usually to divide the data into the training set and the test set A normal and simple method of classification is to use the function "train_test_split()" which may randomly select training data and test data at a proportion from the sample The test_size argument is used to set the proportion of test set Here, we set it as 0.2 that is to say, 20% which comes to 272 records after calculation And the others, 80% of 1359 records i.e., 1087 records, are the training set for training the model Before actually using machine learning models we need to normalize data Here, the training set and test set for feature attributes are standardized with the scale() function Next, let's use the Random Forest model to solve this classification question Random Forest is a relatively new machine learning model which is an ensemble learning algorithm The so-called ensemble learning is to construct and combine multiple learners to finish learning tasks The Random Forest model is of the Bagging type, representative of parallel ensemble learning The detailed methods include multiple random sampling of raw dataset to acquire multiple different sample sets and then, based on each sampling set, training of a decision tree base learners followed by combining of these base learners It finally resorts to voting or calculating means to enable the model to acquire high precision and generalization The details of a specific model can be learned by yourselves based on your own foundation and needs The main purpose of our case here is to raise a practical question based on data and construct a relatively complete solution with the focus on its process In future, based on your own needs you may learn more about algorithm models and then can conduct real data mining With the sklearn module, it's convenient to complete modeling and testing Look at these code lines First, use the RandomForestClassifier() function to construct a classifier The n_estimators argument is frequently used which means the quantity of subtrees to be built before anticipating with maximum votes or means As a base learner is a decision tree more subtrees often bring better performances to models but the codes will be slower Here, the argument value is set as 200 and then learning is done with the fit() method of model based on the training set Then, use the predict() method to predict based on the data of the part X (X_test) in the test set The result of prediction is saved in this variable For other classification models the writing program is basically the same Normally, just replace the modeling function How about the effect of prediction We need to compare the result of prediction with the actual y value This is the actual y value and this is the result of prediction Let's observe the result of prediction with the most ordinary confusion matrix The confusion matrix is a special matrix which is a visual representation of algorithm performance Each column represents the predicted value and each column represents the actual category Let's execute it and see the result This is the confusion matrix of classification result How to observe a confusion matrix To simply understand it, the number on the diagonal line is the number of data records whose categories are correctly judged and those at the other locations are record numbers of mis-judged categories The higher the value at the diagonal line is in the total, the better the classification effect is Look at some details This value, say, 15 is the number of correctly-judged category "0" Category 0 means the quality is "high" How about the total number of Category 0 It's 15 plus 17, which comes to 32 As mentioned just now the row in the confusion matrix means actual categories while the row represents the predicted value Then, look at this value. Consider how to observe it Actually, as mentioned just now it represents the number of the records which should have been Category 0 but judged as Category 2 Other figures are in a similar way You know how to observe them, right Which category has the highest classification precision Is it Category 2 which refers to the "medium" category Only 6 items have been mis-judged to be Category 0 That's to say, the original "medium" category has been mis-judged to be the "high" category Almost clear, right Moreover, we mentioned before that apart from argument design based on experience, arguments may be manually adjusted The use of the GridSearchCV() function is to adjust arguments In machine learning models the parameters that require manual selection are known as hyperparameters such as the number of decision trees in Random Forest which is the value corresponding to the n_estimators argument The GridSearchCV() function is actually brute-force search As long as you input an argument the optimal result and argument can be provided But this method is suitable for small datasets Let's briefly demonstrate it Brute-force searching is on Quite slow Over Let's have a look. Here's an argument: best_params which saves the combination of arguments with optimal results Here, we select the two arguments to be adjusted Let's view the result of best_params There's only the value of 30 Then, re-train the model based on the best argument combination Predict the result You may explore the specific details on your own as you need Due to our time limit, we only proposed one regular method for argument adjustment That's all for the example Hope that this example has given you basic knowledge of utilizing machine learning algorithms to conduct data mining and complete classification tasks As your reserve of knowledge increases you can really apply these algorithms to solve your own practical problems