Welcome to Classification. After watching this video, you will be able to: Define key concepts of classification. List the types of classification. List some common classification algorithms. Evaluate the accuracy and results of a classification model. Differentiate between classification and regression. And determine whether classification or regression is suitable for your problem type. Classification and regression are two types of prediction problems in machine learning and data science. Classification answers the question “What category does this fall into?” For example, when asking yourself, “Will I pass or fail my biology exam?“ there are two categories: you will pass your biology exam, or you will fail your biology exam. Note that this is an either/or situation because you will either pass or fail, you can’t do both. On the other hand, regression answers “What will my biology exam score be?” For example, I can determine how my hours of sleep and studying impact my biology exam scores. Let’s start with classification. Classification is the process of predicting an outcome, known as a “class,” based on some given inputs. What are inputs? Let’s use our pass or fail example, “Will I pass or fail my biology exam?” Input variables are independent variables or features that are used to make predictions. It can be one variable, for example, “Average score on previous biology tests” or multiple variables including the following features. “Percent of classes attended” and “Number of hours studied.” So really, the direction of the arrow shows how the input variable helps answer the question, which then predicts the outcome of whether I will pass or fail. This means that the outcome of passing or failing my biology exam is dependent on these input variables and knowing these values can give me information about the outcome of the test. Now let’s put a name to some of these classification types. The previous example focused on predicting two classes: Pass or fail. Another example is classifying if an email is spam or not spam. When you have only two classes, it is called binary classification. When you have three or more classes, it is called multiclass classification. For example, predicting the handwritten digits from 1 to 9, or predicting if a piece of fruit is an apple, orange, or mango. Before we go any further, let’s go through these very important terminologies. We will add to them later, but let’s start with the basics: A classifier is a machine learning algorithm that is used to solve the classification problem. A feature is the independent variable that is used as an input in the model. And when you ‘evaluate’ a model, you are validating how well it has performed. You can divide classification algorithms into two. The first one is a “lazy learner.” A “lazy learner” doesn’t have a training phase per se as it waits to have a test data set before making predictions. This means that it doesn’t generalize the model, therefore taking longer to predict. A very popular example is the k-nearest neighbor algorithm, also known as KNN. KNN classifies the unknown data points by finding the most common classes in the k-nearest examples. Then, it finds the closest match to the test data. Now, let’s continue using our exam grade example. Consider when two sets of points are given on a plane. One set is a class of grey circles that represent students who failed, and the other set is a class of blue circles that represent the students who passed. If I appear on the plane and want to predict if I will pass or fail, KNN finds the k most similar students to me based on some inputs. Then, it calculates the distance from them, with k being the number of neighbors it checks. Let’s assume k is 5. K will classify my grade as a “pass,” using a majority vote approach. Here, four out of five of my ”neighbors” are classified as “pass.” The second kind of learner is the “eager learner.” The eager learner spends a lot of time training and generalizing the model, so it spends less time predicting the test data set. You can also use logistic regression. This model is used to predict the probability of a class. For example, given the number of classes I attended, what is the probability I will pass or fail my biology exam? Finally, you have decision trees, which are tree-like algorithms that use an ”If-then” rule. In this example, it classifies if you will pass or fail based on some rules. Other advanced algorithms are support vector machines, naïve Bayes, discriminant analysis, and neural networks, just to name a few. In this video, you learned that: Classification and regression are two of the prediction problems in machine learning and data science. Classification is for predefined classes while regression is for continuous variables. Very important classification terminologies are classifier, feature, and evaluation. And some common algorithms for classification are KNN, logistic regression, and decision trees.