In Module 4 we'll talk about imputing for missing items, this is almost always an issue in survey. We'll look at reasons for imputation first. And throughout the module we will talk about what sort of thinking do you use to do the imputing and what software is available to actually do it. So to visit this problem in general, there are two kinds of missingness, at least. One is completely missing, you have no data at all in a case. Now when does that occur? It can occur for at least two reasons. One is we didn't sample a unit in the first place, so it's completely missing. Or we sampled it and it didn't respond at all, also completely missing. And how do we handle that? What we do is assign weights to the sample cases. So we take the sample, project it up to the full universe. So we indirectly are doing imputations through this weighting. Now software will have different standards for how they code missing cases, so I've listed some of them here. The default code in R is NA for a missing value. In SAS .a to .z are used, just plain dot is a missing value, ._is a missing value in SAS. Stata is similar, .a to .z and just plain dot are missing values. Also, particular surveys may use certain codes to distinguish types of missingness. In some surveys, it's important to put down a reason, essentially, for why it's missing. So you may see different codes on a single item being used. 99 is a popular one to denote missingness. Sometimes you'll see -9 or -8 as a code for missing. So one thing that you want to be sure of, when you get a data set from somebody else, is what special coding are they using for missingness? You don't want to analyze a 99 or a -9 as if it's a real data point when really, it's just the survey code for not being not there. All right, now how do we go about handling these cases? One is called complete case analysis. So what that means is, if a case is missing on any variable, you just completely delete it. You treat it as not in the data set. Well that seems extreme, less extreme would be available case analysis. So, for example, if you're running a regression of y on a couple of x's, you just use the cases that are complete on those variables. Regardless of whether they're complete on the other variables. That would allow you to use more cases than complete case analysis. But still, it seems mad, I mean, you're throwing away data on cases that are partially complete. Another way that we'll talk about here is just fill in those blanks, those holes, by imputation. That way, you get to use all your cases in every analysis if you impute for every missing value. And it certainly builds up the sample size available to do analysis. Now there are implications of that, of course, that's not real data. So you ought to do something that counts for the fact that it's not real data. One thing that's a problem with complete case analysis, they're a number of problems. But if the units with missing data differ systematically from a completely observed cases, you could have biased estimates. If, say, men and women differ systematically on your y values and there are a lot more missing cases for men. If you just throw them out then the distribution in your sample between men and women is not going to look like what's in the population. And that means that when you combine or do a data set analysis, even with the weights, you could have biased estimates. So we'd like to avoid that. Another problem with complete case analysis is, if you've got many variables included in some model that you're trying to fit, there maybe very few complete cases. You'd be discarding a lot just for the sake of a simple analysis, so that's bad. Another thing to be aware of in complete case analysis is, you're not really ignoring those dropped cases. When you dropped them out, there's an implied imputation there. So for many analyses, like estimating means and totals, what you're doing is implicitly imputing those missing cases by the average of the complete cases. Now that may be poor, that may be a poor imputation. So we'd like to think of better ways of doing that. Now it's good to go back to the missing data mechanisms that Rubin and Little have defined. The ones that we've seen earlier in a previous video are missing completely at random. Every unit's got the same probability of appearing in the sample, you can imply that down at the item level. Every item has got the same probability of being filled in. More realistic, probably, is the missing at random, MAR. That means that after you account for some covariates, then you may be able to make a sensible imputation for the missing cases. The worst is non-ignorable, non-response. That means that the probability of appearing or not appearing depends on covariates, and critically on the variables you're trying to analyze, the y's. So this is bad, we may have covariates for both the complete and missing cases. But we're not going to have y's here for the complete and the missing. We don't observe y's for the cases that have got the missing data. So generally, MAR is the best that we hope for, accounting for as many covariates as seems reasonable. We hope we will give us a way of imputing intelligently for those missing cases. Just as that sort of MAR thinking we hoped would give us a way of adjusting weights for non-response in a way that would produce approximately unbiased estimates. So we'll fill in the details on this imputation in coming videos.