In this lecture, we'll look at some strategies to deal with datasets that contain missing values. So we've already seen some datasets of missing observations. So far, we've just discard them as a means of dealing with them, but they could be alternate strategies and in this lecture we'll investigate some different options to deal with these missing values. So we've already seen cases where we have missing data in our datasets for example, the lecture on Air Quality Prediction or PM2.5 prediction using Python. Even if we had this simple dataset, we had some missing values which indicated by an NA attribute in a particular feature. So far, to deal with that data when it is discarded those instances. So, we built a new dataset that just throughout every observation where there was an NA measurement. So maybe this isn't okay strategy where we had only a single feature with missing data, but how would a generalize? So how this approach work if there were many, many features such that we're going to be some missing features for a large fraction of our observations which could happen in any realistic dataset of size census measurements that have several observations associated with each data point. If we just tried to discard any data point that had even a single missing feature, we find ourselves throwing out a very large amount of data and may no longer be able to model it. So in this lecture, we'll look at a few different strategies for dealing with missing data. So one option which we saw so far is just filtering, but like I said that can be dangerous. Another is missing data imputation. So, what this means is to try and fill in missing values with reasonable estimates. The final option is modelling. So how can we actually change our regression or classification algorithms to handle missing data explicitly and account for the effect that might have on the prediction? So like I said, even when only a small amount of data you're missing, simply discarding instances may not be an option or maybe good options. What else can we do? The first option we'll look at is missing data imputation and seeks to replace missing values by reasonable estimates. So, one very simple scheme would be to say whenever a feature is missing, let's just replace the missing value with the average value for that feature that we've observed from other instances. So, what will be the consequences of using a scheme like that? So it's not the worst idea, but it does have some risks. For instance, the average is going to be very sensitive to outlying values if you just replaced missing values of in communist census by mean income that could not be a very realistic option since it could be thrown off by very large outliers. Maybe you could address this by different imputation scheme for example, using the median rather than the mean. Also in some cases, this type of invitation may not be very reasonable or realistic. For instance, in our January example or in any example containing categorical features it's not clear how you impute them using an average. So if you just had observations of January male or female even if you encoded them as zero and one, imputing those values by the mean in the case that was missing may not be a good idea. What would mean have a gender say of 0.25 and would that result in realistic predictions. You could also say well it highlights just replace these values by the most common gender or the mode of the distribution that also may not be a good idea because the model would then be making strong assumptions about what value missing features should take which again may not be realistic. So maybe more sophisticated data imputation schemes could help us with this rather than using an average or median or mode maybe we could do imputation on the level of individual subgroups. So, if the height feature was missing, rather than imputing the height using the average height, maybe we can look at the other features and say well if this person is male, let's impute the height by the average height for males or if the female we'll impute the height by the average height for females. So, we'll get a subgroup based or more realistic estimate or amputation of those missing values. Another option that we could actually try in separate predictor, if height is sometimes missing from our data, maybe we could train a predictor with estimates height from the other features that are observed. So this again is a sensible enough idea though you can say that there could be some complexities to actually implemented this in practice for instance, if there's several features that can potentially be missing, then when you try and put height from the other features and some of those features are missing and you try to predict those feature from height again, you get this circular dependency issue when you try and make all of your predictions. You might be able to resolve that with some more complex game, but it's not going to be straightforward. So the third option I suggests we can actually try and directly modeled the missing values within a regression or classification algorithm. So, one simple scheme for doing this would just be to add an additional feature which indicates that a value is missing. In this case, my feature representation for my daughter including males and females would just say there's actually a third dimension in our feature representation which says, for this person, that feature is missing. I'm not saying they're male or they're female, but I'm adding a third dimension to my model which says what prediction should be made in the event that that feature is missing. So, looking at that model where we have an individual feature indicating that a particular dimension is missing, what kind of prediction should our model make onto this scheme? So if we just write out the model equation in full for this feature vector, we would have a prediction of theta naught females, prediction of theta one for males and prediction of theta two for feature missing. So, note that theta two can take any value. If most of the missing features actually corresponded to males, probably theta two would have a very similar value than theta one. If most of the missing feature is of females, probably theta two have a value similar to theta null. Or if people who fail to specify that feature are somehow unique or different or have different predictions that should be made for them, theta two is going to learn what predictions should be made in that instance. So in other words, we're actually modelling what should happen when a particular feature is missing. What kind of prediction is associated with a missing feature? So just to summarize this lecture, we discussed a few fairly simple schemes for dealing with missing data. We looked in detail of imputation versus modeling our missing data by incorporating missing value features. So, on your own, you might look back at our PM2.5 regression example which had various missing features and see if you can adapt your code to handle these missing features. Now, you're not going to be able to handle the prediction or the PM2.5 value being missing, but you should be able to handle missing or bad values in other features either using imputation schemes or this modelling approach that I've described. So, experiment with that and see how does the performance vary using different data imputation approaches.