In this video, we discuss how to perform cross-validation for classification problems.

In particular, we use a two by two table called a confusion matrix

to assess classification performance.

As for a regression problem, the first step of cross-validation is data partitioning,

where we randomly split the entire dataset by row into two sets.

The training set is used to fit the models; other set is called the validation set,

and is used to choose among different models.

We use different measures to assess prediction accuracy for regression and classification.

For regression, we want error or residual to be small in the validation set.

For that purpose, we use sum of squared errors or, equivalently,

the root mean square on validation data.

For the classification, those measures cannot be used.

Instead, we look at a confusion matrix, which we will explain using example.

Let's go back to the appointment data.

We will randomly divide the data by row into training

and validation sets using a 60/40 split.

4,478 rows are in the training set, and the 2,985 rows are in the validation set.

Now, to assess the prediction accuracy of our model,

we make a prediction of whether each appointment will be canceled in the validation data.

Note that the outcome of the logistic regression model is a predicted probability.

How do we make a binary prediction using predictive probabilities?

One way is to set a threshold value t between zero and one.

When the predicted probability of a cancellation is above this threshold,

then we predict cancellation.

Otherwise, we predict no cancellation.

In this manner, we can arrive at a binary prediction from the predicted probabilities.

This immediately leads to the question: What value should we pick for t?

The threshold value is an important parameter we have to choose.

Choosing different values of t leads

to changes in different encountered prediction errors.

If we choose a large t, then we rarely predict cancellation.

Therefore, we will only detect the cancellation when the chance is very high.

This will lead to more errors when we predict no cancellation,

but the appointment is actually canceled.

We choose a small t, we will predict lots of cancellations.

This allows us to detect more appointments that may be canceled.

However, we will make more mistakes where we predict cancellation,

but the appointment is actually not canceled.

By default, we usually choose the threshold value of 0.5,

where we predict the more likely outcome.

Here is a confusion matrices for two different threshold values.

Note that cancellation is denoted by one.

The matrix shows the observed or actual class and predicted class.

Actual=0 means the observed status in the data is arrival.

And actually, you could do one means the observes status is cancellation.

Similarly, Predicted=0 means we predict the appointment will not be canceled,

and Predicted=1 means that the appointment is predicted to cancel.

When the threshold is 0.5,

2,303 appointments that were not canceled match our prediction.

However, 17 appointments that were not to canceled are predicted to cancel.

649 appointments that were canceled are predicted not to cancel,

where 16 canceled appointments are correctly predicted.

Note that very few appointments are predicted to cancel even though

there are more than 600 canceled appointments in our validation data.

The confusion matrix changes considerably when we set the threshold to 0.3.

In particular, a lot more appointments are predicted to cancel.

All of the appointments that were not canceled, 216, were predicted to cancel.

122 canceled appointment were also correctly predicted.

Therefore, with a smaller threshold, we'd correctly predict more canceled appointments,

but also makes more mistakes of predicting non-canceled appointments to cancel.

This shows the tradeoff in picking the threshold.

The confusion matrix essentially shows the possible outcomes

when we make binary predictions on the validation data.

We would like to detect the class with value one.

In the example we discussed, we would like to detect the appointment cancellations.

In this case, we usually call the event that we would like to detect positives.

The complement event, in this case non-cancellations or arrival, is called negatives.

A correct prediction can be either a true positive or true negative.

A canceled appointment that is correctly predicted to cancel is called a true positive.

Similarly, a non-canceled appointment

that is correctly predicted not to cancel is called a true negative.

An incorrect prediction can be a false positive or a false negative.

A canceled appointment that is incorrectly predicted

not to cancel is called a false negative.

Similarly, a non-canceled appointment that is incorrectly predicted

to cancel is called a false positive.

This terminology has their roots in the medical literature

but are commonly used in data analytics.