Hello and welcome everybody to Week 4 of this third course on maximizing data quality. This week we're going to talk about how to maximize the quality of our data analysis, is a final step in the overall total data quality framework. Specifically, we're going to start talking about how to maximize the quality of an analysis of design data. Let's revisit our total data quality framework here. We've talked about maximizing quality in terms of all the measurement dimensions, validity, data origin, and data processing. We've talked about how to maximize quality in terms of all the representation dimensions including data access, data source, and missing data. Now putting everything together, we can't forget about that last step where we've done all this prior processing. We've done all these prior adjustments. We've made sure from design perspectives that everything is really up to par in terms of overall quality. But now we actually want to analyze the edited and selected data. We're at that final phase here, but we're focusing on Data Analysis. Given that this course is talking about how to maximize quality, we want to make sure that the Data Analysis phase that we're also performing a high-quality analysis of the data so that we don't negate any of the earlier work that was done to maximize overall quality. That's going to be our focus in this week and we're going to start with a discussion of design data. How do we think about maximizing data analysis quality for designed data? In studies that are using designed data, features of the study design often need to be accounted for in data analysis. If we're talking about experiments, for example, we might need to account for experimental strata or the experimental factors where we want to compare different groups in terms of experimental treatments or something like that. We need to account for those variables when we're conducting our analysis. If we're talking about survey data collection, we might be talking about the need to account for sampling weights or sampling strata based on the original sample designs, sampling clusters based on the original design, multiple imputation of item missing data. These are all features of the study design that we would need to account for when we're ultimately performing the analysis. In general, these different design features should be accounted for in an analysis when they're informative about the primary measures of interests. If these different design features have relationships with the variables that we're ultimately interested in analyzing, we want to make sure that we're accounting for these design features when we do the overall analysis. In surveys that are collected from probability samples in particular, where everybody in a population has a known probability of selection, there are generally three main design features to account for. The textbook by Heeringa and colleagues in 2017 talks at length about analyses they account for these design features. First of all, we need to make sure that we're thinking about weights that reflect probabilities of selection for different sampled units. We need to account for those weights in our analyses in order to compute unbiased estimates with respect to the original sample designs, so we want to make sure that the estimates that we're computing are unbiased with respect to that sample design. A failure to account for weights could lead to biased estimates of population features. A second key design feature that we need to account for in probability samples typically are codes in the dataset that represents sampling strata. These codes need to be accounted for in the analysis in order to reduce our standard errors. When we're computing estimates of population features, we want to attach measures of sampling variability to those estimates. Stratification at the sample design stage will typically increase the precision of our estimates, and we want to make sure that's reflected when we compute our standard error. We need to account for those sampling strata in our analyses if they're applicable for a given sample design. The third main feature, again, if applicable for a given sample design are codes that represent sampling clusters. Possibly geographic areas or other clusters of units that were randomly sampled as a part of the sample design where units within the same cluster will tend to have similar values on a variable of interest. Now when we have cluster sampling, that tends to increase our standard errors or reduce the precision of our overall estimates. If cluster sampling was part of the design, we want to make sure that that's reflected in terms of the standard errors that we're computing. These are three key features of probability samples, these complex probability samples that we do need to account for in our analyses when we're working with data like this. A failure to account for them, that's reducing the quality of the analysis that we're performing. We want to make sure we're computing unbiased estimates with standard errors that reflect the design that was actually used. When working with design survey data that were collected from these so-called complex probability samples, make sure to do the following to make sure that your data analysis is of maximum quality. Make sure to carefully review the dataset documentation, especially any analytic guidelines that are provided to go along with that dataset. Oftentimes, you might want to download a national survey dataset from a national survey program. With that electronic dataset, you'll also have documentation describing how to use the data. Make sure to review that documentation carefully before you get into the analysis especially if there are subsections that talk about analytic guidelines because that's going to help to make sure that you're performing a high-quality analysis if you follow those guidelines. Always very important to review that dataset documentation. Second key point, consider those sample design features when you're making inferences about your population of interests. There are many different software procedures that you can use in order to account for those design features that we talked about on the previous slide, the weights, the sampling strata, and the sampling clusters. You can use, for example, the contributed survey package in the R software, there SPY commands in the status software, there are the survey commands in the SAS software. For those of you who might be using SPSS, there's the complex samples module in SPSS, there's a variety of tools that allow you to account for complex sample design features. The key is that you're using these tools when you're analyzing data from these complex samples. You can find more information at the web link here on the slide, along with many examples of syntax to perform different analyses. This is the website for the Heeringa et al. Textbook that was referenced on the previous slide. Again, you can refer to that Heeringa et al textbook for general guidelines. But remember from course 2, we have ways of measuring whether it's necessary to use weights when analyzing survey data. We talked at length about measuring the need for using weights when performing different kinds of analyses. Don't forget about those discussions from our course on measuring data quality. Now, all of that pertains to probability samples again, where units in the population can be assigned a known probability of being included in the sample. In many samples nowadays, we're actually dealing with non-probability samples. These could be volunteers or convenience samples where we don't really know or have any control over the probability that an individual is going to be included in a given sample. In these types of non-probability samples, there are approaches that have been developed in the recent literature that can be used to make population and friends about a particular parameter of interests. Valliant and Dever wrote a very nice book in 2018 that outlines these different approaches that can be used for non-probability samples and making inference. We want to talk about those in a second here. The key distinction is that with probability samples, we have a good statistical basis following these procedures that I've been describing to make inferences about population features. With non-probability samples, we don't have that same statistical basis again, because we have no control over the probability of selection. We want to follow these approaches that have been recently developed in the literature. So with non-probability samples when you're analyzing the data, there are three general approaches that one can use to make population inference, and again, we're dealing with a sample that could be of convenience, a snowball sample, a volunteer sample, something like that, where again, we had no control over who selects to be in that sample. First general approach known as quasi-randomization. In this kind of approach, we combine the data from that non-probability sample. You think of a dataset that you collected from that non-probability sample, you stack the data from that non-probability sample with a reference probability sample. We refer to that as a reference sample that measured several common variables. The same variables, same kinds of values, were measured in both a reference probability sample that you might get from a national survey or something like that with your non-probability sample, and you'll line up those variables, and now you have a stack dataset that has the same variables in both of those datasets that you put together, and we find those common variables and then put the datasets together. Once we have that stack dataset then we predict the probability of being included in that non-probability sample using a weighted logistic regression model. You use the weights from the probability sample, that key design feature that we talked about earlier, and then everybody in your non-probability sample, that portion of the dataset gets a weight of one. So you have your weights for the probability sample in a variable, and in that same variable, everybody in the non-probability sample has a value of one, and you fit a logistic regression model predicting the probability of being in that non-probability sample. Everybody in the non-probability sample gets a one and the dependent variable, everybody in the probability sample gets a zero and a dependent variable, and that's our dependent variable in a logistic regression model. We include those common variables in the probability sample and the non-probability sample as predictors in that logistic regression model. Now what that model is going to yield is a predicted probability of being included in the non-probability sample. We would then invert those predicted probabilities for that non-probability sample, for all the people in your non-probability sample, we would invert them one divided by that predicted probability, and that inverted value is what we would use as a weight in a design-based analysis. Treating that as if it were a sampling weight in our original analyses for probability samples, we would use this kind of pseudo weights in analyzing data based on this quasi-randomization approach and proceed with our analysis just like we would with probability samples. That's one possible approach. A second possible approach is more model-based, where we would try to predict the variable of interest for the non sampled cases in the population. Now, this kind of approach would require having data, and then non-sample cases. All the cases that did not self-select into your non-probability sample. That's a bit of a trickier endeavor because we need to find those data, or at least we need to find aggregate data for the non-sample cases. Based on a regression model fitted to data from the non-probability sample, we would use that model to predict the variable or variables of interest for non-sample cases in the population. It's like a missing data problem. We would fill in the values on the variables of interests for all the cases that weren't included in our non-probability sample, and then analyze that entire dataset like it was from the population. That's called super-population modeling. A third general approach is referred to as doubly robust or doubly robust approaches. In that case, we would combine the ideas of quasi-randomization and super-population modeling, and if our regression model for making predictions for the non sampled cases is correct, or the model for predicting the probability of being included in the non-probability sample is correct. If either of those two models is correct, population inferences will still be accurate based on these doubly robust approaches. That's the nice thing about the doubly robust approaches. You can get the models to predict the probability of being in the non-probability sample or the model that's predicting values for cases that are not included in your sample, you can get one of those two models incorrect, but your overall inferences will still be accurate. Now the issue here in practice is that there's not a lot of software out there that implements these three different approaches for non-probability samples, and what that means is that they usually have to be programmed manually. Again, that Valliant and Dever book in 2018 is an excellent practical reference that walks people through how to conduct these kinds of analysis using the Stata software particularly. But you can extend those ideas to other software packages that you might want to use. When we calculate standard errors based on these non-probability approaches, variance estimation generally relies on a technique that's known as bootstrapping or other replication techniques, and again, Valliant and Dever talk about how to carry that out in practice. But in general, there's much more software for design-based analysis of probability samples like we talked about, where you account for weights and strata and clusters, and again, you can visit that website that we talked about earlier to see examples of syntax in different software packages. So that's an overview. The key theme here is that whether we're analyzing data from a probability sample or a non-probability sample, we want to make sure that we're using correct techniques that have been described and discussed and developed in the literature that maximizes the quality of the analysis that we're performing. That's the key theme here. We've talked about these state of the art techniques for how to conduct these analysis to make sure that we're accounting for design features in that overall analysis. Next up, we're going to see multiple case studies of analytic error, and the point there's to show you what can go wrong when you don't perform high-quality analysis of survey data and you fail to account for some of these design features. We're going to talk about what can go wrong, and then we'll turn to a discussion of how to maximize the quality of Data Analysis specifically for gathered data. Okay. Thank you.