In this lecture, we will discuss what a validation set is and how it relates to overfitting and model performance evaluation. After this video, you will be able to describe how validation sets can be used to avoid overfitting. Articulate how training, validation, and test sets are used. And list three ways that validation can be performed. In our lesson on classification, we discussed that there is a training phase in which the model is built, and a testing phase in which the model is applied to new data. The model is built using training data and evaluated on test data. The training and test data are two different datasets. The goal in building a machine learning model is to have the model perform well on the training set, as well as generalize well on new data in the test set. Recall that a model that overfits does not generalize well to new data. Recall also that overfitting generally occurs when a model is too complex. So to have a model with good generalization performance, model training has to stop before the model gets too complex. How do you determine when this should occur? A validation set can be used to guide the training process to avoid overfitting and deliver good generalization performance. We have discussed having a training set and a separate test set. The training set is used to build a model and the test set is used to see how the model performs a new data. Now we want to further divide up the training data into a training set and a validation set. The training set is used to train the model as before and the validation set is used to determine when to stop training the model to avoid overfitting, in order to get the best generalization performance. The idea is to look at the errors on both training set and validation set during model training as shown here. The orange solid line on the plot is the training error and the green line is the validation error. We see that as model building progresses along the x-axis, the number of nodes increases. That is the complexity of the model increases. We can see that as the model complexity increases, the training error decreases. On the other hand, the validation error initially decreases but then starts to increase. When the validation error increases, this indicates that the model is overfitting, resulting in decreased generalization performance. This can be used to determine when to stop training. Where validation error starts to increase is when you get the best generalization performance, so training should stop there. This method of using a validation set to determine when to stop training is referred to as model selection since you're selecting one from many of varying complexities. Note that this was illustrated for a decision tree classifier, but the same method can be applied to any type of machine learning model. There are several ways to create and use the validation set to avoid overfitting. The different methods are holdout method, random subsampling, k-fold cross-validation, and leave-one-out cross-validation. The first way to use a validation set is the holdout method. This describes the scenario that we have been discussing, where part of the training data is reserved as a validation set. The validation set is then the holdout set. Errors on the training set and the holdout set are calculated at each step during model training and plotted together as we've seen before. And the lowest error on the holdout set is when training should stop. This is the just the process that we have described here before. There's some limitations to the holdout method however. First, since some samples are reserved for the holdout validation set, the training set now has less data than it originally started out with. Secondly, if the training and holdout sets do not have the same data distributions, then the results will be misleading. For example, if the training data has many more samples of one class and the holdout dataset has many more samples of another class. The next method for using a validation set is repeated holdout. As the name implies, this is essentially repeating the holdout method several times. With each iteration, samples are randomly selected from the original training data to create the holdout validation set. This is repeated several times with different training and validation sets. Then the iterates on the holdout set for the different iterations are averaged together to get the overall iterate for model selection. A potential problem with repeated holdout is that you could end up with some samples being used more than others for training. Since a sample can be used for either testing or training any number of times, some samples may be put in the training set more times than other samples. So you might end up with some samples being overrepresented while other samples are underrepresented in training or testing. A way to improve on the repeated holdout method is use cross-validation. Cross-validation works as follows. Segment the data into k number of disjoint partitions. During each iteration, one partition is used as the validation set. Repeat the process k times. Each time using a different partition for validation. So each partition is used for validation exactly once. This is illustrated in this figure. In the fist iteration, the first partition, specified in green, is used for validation. In the second iteration, the second partition is used for validation and so on. The overall validation error is calculated by averaging the validation errors for all k iterations. The model with the smallest average validation error then is selected. The process we just described is referred to as k-fold cross-validation. This is a very commonly used approach to model selection in practice. This approach gives you a more structured way to divide available data up between training and validation datasets and provides a way to overcome the variability in performance that you can get when using a single partitioning of the data. Leave-one-out cross-validation is a special case of k-fold cross-validation where k equals N, where N is the size of your dataset. Here, for each iteration the validation set has exactly one sample. So the model is trained to using N minus one samples and is validated on the remaining sample. The rest of the process works the same way as regular k-fold cross-validation. Note that cross-validation is often abbreviated CV and leave-one-out cross-validation is in abbreviated L-O-O-C-V and pronounced LOOCV. We have described several ways to use a validation set to address overfitting. Error on the validation set is used to determine when to stop training so that the model does not overfit. Note that the validation error that comes out of this process can also be used to estimate generalization performance of the model. In other words, the error on the validation set provides an estimate of the error on the test set. With the addition of the validation set, you really need three distinct datasets when you build a model. Let's review these datasets. The training dataset is used to train the model, that is to adjust the parameters of the model to learn the input to output mapping. The validation dataset is used to determine when training should stop in order to avoid overfitting. The test data set is used to evaluate the performance of the model on new data. Note that the test data set should never, ever be used in any way to create or tune the model. It should not be used, for example, in a cross-validation process to determine when to stop training. The test dataset must always remain independent from model training and remain untouched until the very end when all training has been completed. Note that in sampling the original dataset to create the training, validation, and test sets, all datasets must contain the same distribution of the target classes. For example, if in the original dataset, 70% of the samples belong to one class and 30% to the other class, then this same distribution should approximately be present in each of the training, validation, and test sets. Otherwise, analysis results will be misleading. To summarize, we have discuss the need for three different datasets in building model. A training set to train the model, a validation set to determine when to stop training, and a test to evaluate performance on new data. We learned how a validation set can be used to avoid overfitting and in the process, provide an estimate of generalization performance. And we covered different ways to create and use a validation set such as k-fold cross-validation.