Hi, in this module, we're going to introduce the idea of control variables. Imagine that we have two variables, X and Y. Perhaps it's education and health and we're interested in whether or not the relationship between them is causal. The traditional approach to dealing with the omitted variables that might create a spruce relationship between them is to measure them and then control for them in the analysis. So if the omitted variables that were worried about are called W, we think that these are variables that might affect both X and Y, and we want to control for them in an analysis. Now, it's important to note that if a omitted variable only affects the outcome or it only affects the right hand side variables, then it wont lead to a spurious relationship if we don't control for them. Give me the other reasons to control for such omitted variables, but eliminating the possibility of a spurious relationship is not one of them. Imagine that we have some variable x education, and some outcome variable, health y, and we're interested in the relationship between them, but we're worried that there's some w out there, perhaps it's parental characteristics that affect both of them, in might be to a spurious relationship. To control for W, we compare values of the outcome Y among subjects with different values of X to see if there's a systematic relationship between X and Y and Y and X. But we do so for subgroups that have identical values of W. So basically we can imagine dividing our sample into subgroups in within which the values of W are identical. The parental characteristics are identical and then within each of these we think about or look at the relationship between X and Y. So we call this holding a variable, in this case W constant. Because we're looking at the relationship between X and Y in which the value of W is not changing. And so we examine whether the relationship between X and Y that holds across different values of W. There are different approaches to controlling for these omitted variables W. If we only have a limited number of right hand side variables, that is we have a X variable, and then perhaps at most one or two omitted variables W that we've measured and want to control for, then we can do our control by tabulation. Let's look at an example, a simple one, where our Y variable is the crude death rate and our x variable is race, only taking on two variables in this example, black and white, and we're looking at the United States. Now if we look at the United States, this was 2014. The overall crude death rates of black was actually lower than that of whites by potential margin. This might give us the increte misleading impression that somehow with respect to mortality, blacks are better off than whites in United States. This is contrary to common sense given what we know about the socioeconomic differences between blacks and whites in the United states. And it turns out that what we really need to do is control for age. There are big differences between the average ages of blacks and whites. Whites in the United States tend to be older than blacks. And so, when we control for age and we make comparisons between the death rates of blacks and whites within age groups, that is, among people of roughly similar age and every age up to age 85, blacks actually have higher death rates. The only exception is age 85 and above. But for reasons we can't get in here that actually reflects problems with the recording of death rates above age 85. So if we average the difference in the death rates between blacks and whites across the age groups, it turns out that there's a weight overall advantage. And that again controlling forage it seems that white death rates lower than black death rates.Now this is a straightforward example because again. As we said, we had a very limited number of right hand side variables. Just one x variables that we were especially interested in race which we only took on two values. And age which was easy to subdivide meaningfully into a small number of category. Now let's just recap what we just did using a finer gradation for ages. So in the United States in 2014, there were 893 deaths per 100,000. Black death rates were 697 per 100,000. So controlling for age, however, black death rates were higher. Now, we can visualize this, actually, if we look at much more narrowly defined age groups. And we compare black and white death rates within each age group. So the red bars are persistently higher than the blue bars within each age groups. So blacks had higher death rates than whites once we control for age. So we just talked about tabulation as an approach where there are only a few variables, and the right hand side variables the x and the w variables that we want to control for, take on only a limited number of values. We have a more complex situation where there are more values, variables and they tend to take on a wider range of values, then we normally will have to engage in some form of regression analysis. So regression analysis essentially looks at the average change in some outcome variable, as a function of changes in a right-hand side variable. So here's a simple example, where we just have life expectancy as a function of per capita GDP. In different countries and we can fit a curve which indicates that there's a systematic relationship between per capita GDP where every $10,000 increase in per capita GDP increases life expectancy by about three years. Now here we haven't controlled for anything other than the per capita GDP. So regression analysis essentially controls for these additional variables that we might be worried about by adding them to the right hand side of the analysis. So let's think about some examples of control variables that might appear in a regression analysis. So if we have some outcome say income, and we're really interested in the effects of education, perhaps it's measured in years, and income is measured in dollars, we might want to introduce in a regression, control variables like age and work experience. This both are those will effect education, and they will effect income, the effect of age and work experience on incomes, probably fairly straightforward to understand. We have to think about the effect of age on education in societies where education has been changing rapidly in recent decades. So you can have examples of societies where because education has been expanding actually younger people are actually better educated and older people. In most situations feeling the control for age may lead to unusual results for example education being inversely associated with income. This is the best educated people are actually younger people who because they are relatively young are not earning as much as older people have more seniority. So when we introduce control for age those effects go away. Now our work normally something we want a control for. There is a straightforward effect of work experience on income. And it may be related to education. Where people stayed in school longer, then they may have less work experience than other people who left school earlier who are the same age. For looking at a international comparison like the one that we just showed in the previous slide we could be thinking about life expectancies of function per capita GDP. And then we might want to control for things like health expenditures and education to assess whether the effective per capita GDP on life expectancy is direct, or is actually working through health expenditures and education. We're thinking about the individual level the relationship between divorce and education. You'd probably want to control for income and wealth which we know are associated with education, but on the other hand tend to be associated as well with divorce. Finally if we're looking at say marriage, whether or not people get married, there's a function of whether or not their parents were ever divorced. Then we'd also want to control for eduction and income. Because education and income one hand then we find maybe associated with whether or not parents divorced. On the other hand, they may influence marriage chances whether like over marriage. Now, some limitations that we have to think about if we're conducting a regression analysis. Or introducing control variables. So basically, if we want to introduce control variables, we do have to be able to measure the trait or the characteristic that we're concerned about. If we're using secondary data from other sources, that dataset that we 've downloaded from the Internet, then the relevant variables once that we're worried about may not be available to us. So for example, if we've downloaded information on say, health and education. But, we are worried that parental characteristics may effect both of these, it may be that that's just not in the data set that we downloaded and we can't go back and measure it. Now there are other situations where if we're collecting data ourselves, then there may be some important traits or characteristics that may be simply difficult or downright impossible to measure. So there may be intangible characteristics of an individual, a neighborhood where there's sorroundings that are extremely difficult to measure in a meaningful way in an analysis. For example, you may be worried that somehow the neighborhood that somebody grows up in may influence both their education and their health. And yet we're not really specifically sure about what are the features of the neighborhood that really matter in which case we can't go out and measure them. We just think that neighborhood in general, matters. So we run into those sorts of problems where we have intangible or difficult to measure characteristics. So overall, because of the possibility that there maybe variables out there that we can even measure even if we imagine them. Then it's through a straight forward regression analysis just by adding control variables. It's still difficult to completely rule out the possibility of a role for an omitted variable.