Hey, welcome back today, we're going to be talking about maximizing data source quality for design data. Remember the big picture, we're talking about data source quality. In the case of design data, data source quality has to do with sampling and the error associated with that is sampling error. So, with design data we often attempt to use probability samples, where each population element to which we had access has a known and non zero probability of being selected for the survey. With these probability samples, we can compute unbiased estimates of population quantities. So, on average we can across samples produce estimates that are equal to the population quantities. That's the definition of an unbiased estimate. Once the data are collected, it's important that we reflect the actual sample design in estimates of data source quality. We talked about those, when we talked about measuring data source quality, for example, we looked at the idea of standard errors. Therefore, maximizing data source quality occurs when the sample is design, optimal sample designs, minimize sampling error for a fixed budget. One source for a full description of optimal sample designs is a textbook on sampling by Leslie Kish cited here. It gives a description of optimal sample design techniques. We'll give a brief introduction to some optimal sample designs in a moment, one example of a sample design that maximizes data source quality is stratification, stratification generally reduces sampling variants by using information on the sampling frame. The procedure is to create strata, you should be groups that are homogeneous with respect to the survey variables that we're interested in. And then we allocate the sample to the strata. So we allocate a proportion of the sample to each stratum and those allocations do matter. They can differ across different designs and depending upon the different purposes that we have. So gains from stratification do depend on the allocation. It's not just creating strata, it's also how we allocate sample to those strata. Let's look at some examples. The first example allocation will consider is called proportionate allocation, and in that case we allocate sample to each strata, proportionate to its size in the population. Another kind of allocation is known as Neyman allocation, named after the person who first publicized this sort of allocation. It's proportionate, the allocation, the Neyman allocation is proportionate to the size of the stratum and the variability within the stratum. This type of allocation produces the minimum variance estimate for a fixed sample size. Now it does assume that costs are equal across the strata. If that's not true, then we might want to consider a slightly different allocation known as optimal allocation, which considers costs and produces a minimum variance estimate for a fixed budget. Now, the choice of an allocation depends upon the goals. Another type of allocation, equal allocation is useful when our goal is to compare strata. Neyman allocation is useful for minimizing the variance of an estimate and an optimal allocation is useful for minimizing the variance of an estimate when the cost differ across the strata. The proportionate allocation we described earlier is a robust method when the survey has multiple goals. So depending upon the goal of our survey, we may choose a different type of allocation. Let's look a little more closely at the Neyman allocation. We first define the proportion of the population in the stratum using the notation you see on the slide, so that capital W sub h is equal to the proportion of the population that's in that stratum, little h. The second term that we're going to need is that capital S sub h. The population variability of elements within a stratum. And you can see the formula here and note that it's for the whole population, okay? So how much do cases within that stratum differ from each other. Once we have estimates of those things and often we can know W sub h. In most cases it's difficult to know the variability of the variable of interest within each stratum. So we usually have estimates of S sub h, where do we get those? Possibly from published data, surveys that have been done before or from a census from other published sources where we can identify those or we may have to use our experience to estimate those. Once we have estimates of W sub h and S sub h, then the Neyman allocation uses the formula you see on the bottom bullet of this slide. Where it's the ratio of the proportion of the population and the variability within the stratum divided by that total is multiplied by the full sample size to allocate the sample to each stratum. For non probability samples, on the other hand, maximizing data source quality can be difficult for a non probability sample, the sampling mechanism is unknown. How did elements come into the sample? What was the probability of coming into the sample? Those quantities are unknown. There may be difficult to determine or even predict. An inclusion probability for example, what's the probability that any unit entered the sample may be difficult to predict, if we incorrectly predict what those probabilities were, since we didn't set them, we have to predict them. If were incorrect in our predictions, then we may result in sampling bias. The result might be sampling bias, clustering is also difficult to determine in non probability samples. Another strategy is to combine probability with non probability samples and use the strength of the probability design with known features to help us predict those features for the non probability samples. The bias reduction from the probability sample might help improve estimates combined with the variance reduction from the non probability sample. And here we give a citation that gives examples of strategies for combining probability and non probability samples. Next, we'll take a look at an example of actually calculating and Neyman allocation one of the allocations that we looked at today.