Let's first address generalization, which we'll help us answer the question about when the most accurate ML model is not always your best choice. Once again here we find ourselves with a familiar natality dataset, but this time we're going to use the mother's weight gain on the x-axis, to predict the duration of the pregnancy there on the y-axis. What do you observe about the pattern that you see in the data? It looks very strongly correlated. The more weight gained, the longer the duration of the pregnancy, which intuitively makes sense as the baby's growing. To model this behavior and prove a correlation, what model would you typically want to call on first? If you said a linear regression model you're exactly correct. So as we covered for regression problems the loss metric that you want to optimize for is typically Mean Squared Error, MSE, or RMSE, the Root Mean Squared Error. The mean squared error tells us how close a regression line is to the set of points from it. It does this by taking those distances from the points to the actual regression line, and those distances are called the errors, and then it squares them. The squaring is necessary to remove any negative signs, and MSE also gives more weight to those larger differences from the line. Taking the square root of the MSE gives us the RMSE, which is simply the distance on average of a data point from the fitted line measured along a vertical line. The RMSE is directly interpretable in terms of the measurement units on that y-axis, so it's a better measure of goodness of fit than a correlation coefficient. Now for both error measures a lower value indicates a better performing model, and the closer the error is to zero, the better. Here we're using a linear regression model, which simply draws that line of best fit to minimize the error. Our final RMSE is 2.224, and let's say for our problem that's pretty good. All right, but look at this, what if we used a more complex model? A more complex model could have more free parameters, in this case these free parameters let us capture every single squiggle in that dataset as you see there. Well, we've reduced our RMSE all the way down to zero, the model is now perfectly accurate. Are we done? Is this the best model? Can we productionalize this? Well, people and you might feel there's something fishy going on with model number two, but how can we tell? In ML we often have lots of data, and no such intuition is a neural network. With eight nodes better than a neural network with 12 nodes, we have lower RMSE for one with 16 nodes, should we pick that one? The example here that you see might be a polynomial of the 100th order or a neural network with hundreds of nodes. As you saw in the spiral example at the end of the last lecture on optimization, a more complex model has more of these parameters that can be optimized. While this can help fit more complex data like the spiral, it also might help it memorize simpler, smaller datasets. So at what point do we say to a model, "Stop training, you're memorizing the dataset and possibly overfitting?" Now one of the best ways to assess the quality of a model is to see how it performs well against a new dataset that it hasn't seen before, then we can determine whether or not the model generalizes well across new data points, it's a good proxy for production of real world data. So let's check back on the linear regression model and the neural network models and see how they're doing now. Our linear regression model on these new data points is generalizing pretty well. The RMSE is comparable to what we saw before, and in this case no surprises is a good thing, we want consistent performance out of our models across training and validation. So looking back at model two, we can see that it doesn't generalize well at all on the new training dataset, and this is really alarming. The RMSE jumped from zero to 3.2 which is a huge problem, and indicates that the model was completely overfitting itself on the training dataset that it was provided, and that proved to be too brittle or not generalizable to new data. Now you may be asking, how can I make sure that my model is not overfitting? How do I know when to stop training? The answer is surprisingly simple, we're going to split your data. Now by dividing your original dataset into completely separated and isolated groups, you can iteratively train your model and train it on the training dataset, and then once you're done with training compare its performance against an independent siloed validation dataset. Models that generalize well will have similar loss metrics or error values across training and validation. As soon as you start seeing your model's not perform well against your validation dataset, like if your loss metric start to increase or creep up, it's time to stop. Training and evaluating an ML model is an experiment with finding the right generalizable model and model parameters that fits your training dataset without memorizing it. As you see here, we have an overly simplistic linear model that doesn't fit the relationships true to the data. You can see how bad it is almost visually, right? There's quite a few points outside the shape of the trend line, and this is called underfitting. On the opposite end of the spectrum and slightly even more dangerous is overfitting as we talked about, and this is shown on the right extreme. Here we've greatly increased the complexity of a linear model and turned it into an n-th order polynomial, which seems to help the model and fit the data and all of the squiggles that we were talking about earlier. So this is where your evaluation dataset comes into play, and you're going to determine if the model parameters are leading to overfitting, is it too complex? Overfitting or memorizing a training dataset can be often far worse than having a model that only adequately fits your data. Sometimes you might not know until production and that's what we validate. Somewhere in between and underfit and an overfit is the right level of model complexity. So let's look at how we can use our validation data set to help us know when to stop training and to prevent overfitting.