This video is about how to carry out an instrumental variable analysis in R. So we'll walk through an example of an instrumental variable analysis in R, using a real data set. So the example we'll consider was this dataset that was published a number of years ago that has to do with the proximity to a four year college and the impact that might have on years of schooling, and then ultimately income. And so this dataset's publicly available. And so you could try it out on your on. So when this, in this study the propose instrument had to do with how close you were to, the individuals in this study, were to a four year college. So in terms of where they resided, how close were they to a four year college? So that's the proposed instrument but the goal is to look at the relationship between years of education on income. So an extra year of education, how much would you expect that to affect your income by? The goal is to look at this relationship then between A, which is a number of years of education, and Y which is the income. And the proposed instrument is whether you grew up near a four year college. It is also covariants in this study that we can consider such as number of years, your parents education, region of country, age, race, and so on. So the motivation is that more schooling is, typically associated with higher income, but there's been a lot of questions about, what is the actual causable effect of more schooling. So it might be reasonable to think that more years of education increases income, but by how much? And it's a little bit difficult to tease apart the actual causal effect from the sort of confounding kind of effect. So we might expect that people who get more schooling than other might different in many ways beyond just the education itself. So there's always this concern that there might be unmeasured confounding in these kinds of studies. So this particular study proposed using proximity to colleges in instrumental variable. In particular, we'll look at living near a four year college as the instrument and you could think of that, well the argument is that, that's a type of encouragement, so if you grew up close to a four year college somehow that's basically encouraging you to get more schooling and it also arguably might not be related directly to your income especially after perhaps you control for some other kinds of variables, demographic variables and number of years of education for your parents, and so on. So now we'll look at some R code. So for the instrument of variable analysis, we'll use the R package, ivpack, which stands for instrumental variable package. And within that package, there's this dataset that I just described, which is card.data. So we use that dataset and in that data set the instrumental variable is called nearc4 so that just mean near year four of your college. The outcome has to do with income or wages, in fact they now says that they did in the original paper they used log of wages and so we'll do the same thing. l wages log of wages. And a lot of times with income data, people log transform it because income data tends to be skewed. And treatment here is the number of years of education so that's this variable. So that's what we were calling a, so that's our treatment is education. Outcome y is lwage, and the instrumental variable was nearc4. And so anytime you do a data analysis you'll probably want to look at the data first. And so one thing I'll do right away is just I'll take a look at how many, what proportion of people got encouragement? So that's this mean of the variable nearc4 and we can see that over here, so about roughly 68% were encouraged lived near four year college. And I'm also going to look at histograms of the outcome variable, and the education variable. So these are two continuous variables. So I'll just look at histograms of those. Just the main reason to do this is just, so you have a better feel of how our data are. So this is a histogram of wage. Well Rob of wages and you'd see that, that's relatively symmetric there with no obvious outliers. And then here's the histogram for education. And you'll see there's quite a bit of variability there. This is number of years of education, and you'll see there's a big spike here, which is at 12. So a lot of people had exactly 12 years of education or in other words finish high school. And then you'll see beyond 12 there is, either some college or these are people here who have probably finished a four year degree. And so now we have some sense of the variability in our treatment if you will variable. This number of years of education. So we do see that there's a fair amount of variability in how much education people had. Next, we'll look at estimating the proportion of compliers here and we're going to do this to estimate the strength of our instrumental variable. And we could actually leave education as the continuous variable, but for the purpose of estimating the proportion of suppliers, and sort of making this analogous to a randomized trial. What I'm going to do here is just create this variable which is education 12 here, which means it's an indicator variable that you have more than 12 years of education. And so I'm imagining right now that treatment is, we're just comparing greater than 12 years of education versus less than or equal to 12 years of education. And I'm imagining that's what we're interested in. You could leave treatment as a continuous variable to analyze it that way. I'm mostly dichotomizing it to just illustrate the idea of compliers. So what were the complier mean here? Will compiler here would mean that, if you lived close to a four year college you would end up having more than 12 years of schooling whereas if you didn't then you wouldn't and that would make you a compiler. And so we can estimate the proportion of compilers by taking the mean indicator of more than 12 years of education but among the sub-population who live near a four year college, and then subtracting the mean of this education variable among those who did not live near a four year college, and we see that that proportion is .12. So we are estimating that the proportion of compilers is .12 or in other words 12% and so in that case, you might argue that the instrument is not extremely strong, but also not so weak that we're alarmed. So it does seem like living near a four year college does increase the chances that you'll have more than 12 years of education. So now we're going to look at estimating both intention to treat effect and comply average causal effect. So we've already estimated the proportion of compliance which is this 0.12. But also note that that's also the causal effect of encouragement and treatment received. So that is a type of causal effect if you believe this is an instrument. Next, we can estimate the intention to treat effect and I'm just calling that variable ITT. And remember, the intention to treat effect would then have to do with the causal effect of encouragement so that would be the causal effect of, in this case, living near a four year college on wages. So we're going to take the mean of love of wages among people who were encouraged, among people who live near a four year college, minus the mean of wages for people who did not live near a four year college, and we get a value of about 0.16. So log of wages tended to be higher among people who lived near a four year college. And now we can use this to estimate the complier or average causal effect. And so that, if you recall is just the intention to treat effect divided by the proportion of compliers. And so we just take 0.15 divided by 0.12, and we get this about a 1.28 as the estimated sort of causal effect among compliers. So the impact of number of years of education on wages among compilers, among people who will get 12 years of education if they live near a four year college, but wouldn't get more than 12 years of education if they didn't live near a four year college. And you'll notice that this number, this 1.27 is larger than the intention to trade effect which is what we would expect as long as we make this no defiers assumption. Right, so in this example, what does no defiers mean? Well defier here would be somebody who does the opposite of what they're encouraged to do. So these are people who, if they live near four year college, would not get more than 12 years of education. But if they don't live near a four year college, then they would get more than 12 years of education. So you have to think about whether you believe that assumption that there wouldn't be people like that. You can also estimate the same kind of effect using two stage least squares. So I'll show you some R code to actually do it in two stages. So we call with two stage least squares. Stage one, we regress treatment on the instrument. So stage one, regress treatment on the instrument. Well what is treatment here? Treatment is education, and the instrument is living near a four year college. So we use this linear model's command, and we can regress treatment on the instrument. And if you do that, then you can get predicted values out of that. So I'm going to ask it to give me predicted values. So we're basically for each person, we want the predicted probability of receiving treatment given the value of their instrument. But because the instrument can only take on two values here, will only get two unique predictive values. And what we'll see here is that people with encouragement have a predicted probability of more than 12 years of college of .54, and people who were not encouraged, people who lived farther away from a four year college, their predicted probability of treatment, their predicted probability of more than 12 years of schooling is 0.42. And these here, there are the counts of the number of people in each group. So those are the predicted values and what we'll do then is in stage two we regress Y on the predicted value of treatment, not on treatment itself. So we regress Y on the predicted value of A, which is this predicted value of treatment that we got from stage one. Pred treatment, and we put that there. So from stage one, we get a predicted value when we use that in the stage two model. And in the stage two model, we just use our outcome. So you do that, you do these two stages and then the coefficient of the predicted value of treatment is the causal effect here. So that's a 1.28 and you'll notice that, that is exactly the same as what we have on this slide a 1.28, so these are two ways of in this case of getting the same answer. So for this simple example where we have no additional covariates, you can use the simple intention to treat effect divided by proportion of compilers or you can do it using two stages of squares, and you'd get the same answer. Now we can also use two stage least squares directly using this ivpack, so this instrument of variable package. So if you want to do that instead of, I sort of, I did two stage of these squares in the previous slide by sort of brute force in some sense. I actually do two separate regression models and put things together but you actually can just use this ivreg command here if you use the ivpack. So instead, I'll tell it to use an instrumental variable regression which means two stages squares here. And so I'm letting it know in this first part that my outcome is wages, my treatment is education. And then this part here is just telling it what the instrument is. So in this case it's near a four year college. And a nice thing about the ID package is it will also give you standard errors out of the model automatically. Whereas previously, I didn't tell you how to get the standard errors, you would've had to do some extra steps. If you use this package, you'll just get standard errors out of it directly. You can use this robust.se command and it will give you standard errors. And what you will notice is then the causal effect will be coefficient here of this education variable. And you will see that the coefficient is exactly the same as I had on the previous two slides. It's about 1.28 but here you also get a standard error. And you also will get a p-value, so you'll see that this is highly significant. It looks like there's strong evidence of causal effect assuming that the instrumental variable assumptions are met. So at this particular example you might be thinking that the IV assumptions don't seem plausible [INAUDIBLE] because people who lived near four year college might differ from people who don't in a lot of different ways moving. For example, living near four year college that the housing in that area might be more expensive, so these might be higher income families. So we might want a control for some of those kinds of variables. Demographic kinds of variables, numbers of years of parents education. But maybe once you control for a number of those variables, maybe given that, then this does living near a four year college does meet the instrumental variable assumptions. So sometimes these iv assumptions you might assume are only true conditional on covariants among people with the same values as the covariants. And so you can do that using this iv, you can still use ivreg, and now you can just control for a bunch of other covariants. So I'm not going to walk through what each of these means they have unfortunate names, very generic kinds of names. But in the data set itself if you look at it they'll, you can find out what each of these variables means, but I just wanted to show you how you can actually do it. So we have an outcome. We have treatment. And then we just have a bunch of variables that we want to control for. And then in the first stage, we have to tell what the instrument is, and also in the first stage we'll control for all of the covariates. So if you do this, you'll get the same kind of result. You want to look at the coefficient of the treatment variable but now, you're just controlling for a bunch of other variables. And so you'll see now, our coefficient's slightly different. It's a 1.2, still highly significant. And so we're still seeing pretty strong evidence of the number of years of education leading to an increase in wages, so here even after you control for all of these different variables. And so this is how you could carry out two stationary square kind of analyst if you had other covariates that you wanted to control for.