In this video, we discuss data reduction and unsupervised learning, which are two essential concepts in cluster analysis. A dataset is essentially a table where the variables, which are also called features or attributes, are in the columns and the observations are in the rows. This means that all the data values are in the body of the table. The process of reducing the number of variables is known, as dimensionality reduction, while grouping observations is a form of data reduction. Isolating the key variables in a dataset is important in order to build robust, predictive models. It turns out that often, there is some degree of redundancy among the variables in a dataset. And this is why it is possible to reduce the number of dimensions without losing critical information. Redundancy occurs when different attributes respond in similar ways to some common underlying factor. For instance, let's assume the human resource department of a company creates an instrument to measure job satisfaction. Employees are asked to rate seven statements using a scale from one to seven, where one means that they strongly disagree with the statement and seven means that they strongly agree. Let's also assume that the statements in this survey are. My supervisor treats me with consideration. My supervisor consults me concerning important decisions that affect my work. My supervisor gives me recognition when I do a good a job. My supervisor gives me the support I need to do my job well. My pay is fair. My pay is appropriate, given the amount of responsibility that comes with my job. My pay is comparable to the pay earned by other employees whose job are similar to mine. Let's suppose that the HR department wants to use the responses as seven separate variables to predict intention to quit. The problem with conducting this study the way it is set up is a redundancy in the predictive variables. The seven items in the questionnaire are not really measuring seven different constructs. More likely, items one to four are measuring a single construct that could be reasonably be labeled satisfaction with supervision. While items five to seven are measuring a different construct that could be label satisfaction with pay. These constructs could be identified with a technique called principal component analysis or PCA for short. This technique creates new variables as linear combinations of the original variables. These new variables are called principal components. In our job satisfaction example, a principal component analysis would identify two components. PCA would transform the original seven values into two scores, one for each component. We don't show how these scores were calculated. But for example, in this case, the employee with ID 102274 seems to be more satisfied with supervision than with pay. Cluster analysis, on the other hand, is a data reduction technique in the sense that it can take a large number of observations and reduce them into a small number of identifiable groups. Each of these groups can be interpreted more easily and is represented by a centroid. The scatter plot shows four clusters for the scores in the job satisfaction survey. The stars represent the centroid of each cluster and can be used to characterize all the observations in the group. For instance, the gray cluster consists of employees with low job satisfaction and is represented by average scores close to two. Cluster analysis can achieve very significant data reductions by transforming thousands or even hundreds of thousands of observations into interpretable groups. Now, let's talk about the concept of unsupervised learning. You may recall that classification techniques were discussed in the second course of this special session. In classification, the objective is to find a set of rules that can be applied to a new observation in order to assign this new observation to a group. The methods for classification develop rules by discovering patterns in historical data. The critical feature of this historical data is that classification of the observations is known and it is used to learn how to classify future observations. Because this piece of information is available, the process is known as supervised learning. For instance, this table shows ten answers to the job satisfaction survey and it also indicates whether or not the employee quit. The two employees that quit had low ratings for the salary questions five, six and seven and some mixed ratings for the supervisor questions one through four. A prediction model built on this data will fall in the category of supervised learning, because the outcome that the model is trying to predict is known in historical data. In unsupervised learning, the observations in the historical data are not labeled. That is, we don't know if an observation belongs to one group or another. This means that we don't know how many different groups there are in a population from which the dataset originated. Discovering the number of groups is therefore, one of the main outcomes of the analysis. For example, in a previous video, we described how the market intelligence firm, Information Resources Incorporated, conducted a cluster analysis of survey data to establish that the market of natural and organic products consisted of seven distinct segments, a number that was not known prior to the completion of the analysis. Cluster analysis can also be applied to historical data that is labeled with the purpose of finding new labels. For example, in one study, cluster analysis was used to categorize mutual funds based on their financial characteristics instead of their investment objectives. The historical data for the study consisted of 904 different funds that fund managers had classified into seven categories according to the investment objectives. That is the fund managers assigned a label to each fund and decided there were seven possible labels. However, a cluster analysis on financial variables related to the funds concluded that there were only three distinct fund categories. The reduction in the number of categories has significant benefits to investors seeking to diversify their portfolios. The study determined that the consolidated categories were more informative about performance and risk than the original seven categories created by the fund managers. In terms of data to use, the analyst initially considered 28 financial variables that were related to risk and return. However, after applying principal component analysis, they found that 16 out of the 28 variables were able to explain 98% of the variation in the dataset. Therefore, they only use 16 variables per cluster which as we all ready mentioned, resulted in three fund categories. This example shows that dimensionality reduction and data reduction compliment each other. As a matter fact, it is a common practice to apply dimensionality reduction techniques such as PCA before clustering.