0:10

This video is about how to carry out an instrumental variable analysis in R.

So we'll walk through an example of an instrumental variable analysis in R,

using a real data set.

So the example we'll consider was this dataset that was published

a number of years ago that has to do with the proximity to a four year college and

the impact that might have on years of schooling, and then ultimately income.

And so this dataset's publicly available.

And so you could try it out on your on.

So when this, in this study the propose instrument had to do with

how close you were to, the individuals in this study, were to a four year college.

So in terms of where they resided, how close were they to a four year college?

So that's the proposed instrument but the goal is to look at the relationship

between years of education on income.

So an extra year of education,

how much would you expect that to affect your income by?

The goal is to look at this relationship then between A,

which is a number of years of education, and Y which is the income.

And the proposed instrument is whether you grew up near a four year college.

It is also covariants in this study that we can consider such as number of years,

1:31

your parents education, region of country, age, race, and so on.

So the motivation is that more schooling is,

typically associated with higher income, but there's been a lot of questions about,

what is the actual causable effect of more schooling.

So it might be reasonable to think that more years of education increases income,

but by how much?

And it's a little bit difficult to

tease apart the actual causal effect from the sort of confounding kind of effect.

So we might expect that people who get more schooling than other might different

in many ways beyond just the education itself.

So there's always this concern that there

might be unmeasured confounding in these kinds of studies.

So this particular study proposed using proximity to colleges in instrumental

variable.

In particular, we'll look at living near a four year college

as the instrument and you could think of that, well the argument is that,

that's a type of encouragement, so if you grew up close to a four year college

3:14

So we use that dataset and in that data set the instrumental variable

is called nearc4 so that just mean near year four of your college.

The outcome has to do with income or wages, in fact they now says that they

did in the original paper they used log of wages and so we'll do the same thing.

l wages log of wages.

And a lot of times with income data,

people log transform it because income data tends to be skewed.

And treatment here is the number of years of education so that's this variable.

3:52

So that's what we were calling a, so that's our treatment is education.

Outcome y is lwage, and the instrumental variable was nearc4.

And so

anytime you do a data analysis you'll probably want to look at the data first.

And so one thing I'll do right away is just I'll take a look at how many,

what proportion of people got encouragement?

So that's this mean of the variable nearc4 and we can see that over here,

so about roughly 68% were encouraged lived near four year college.

And I'm also going to look at histograms of the outcome variable, and

the education variable.

So these are two continuous variables.

So I'll just look at histograms of those.

Just the main reason to do this is just, so

you have a better feel of how our data are.

So this is a histogram of wage.

Well Rob of wages and you'd see that,

that's relatively symmetric there with no obvious outliers.

And then here's the histogram for education.

And you'll see there's quite a bit of variability there.

This is number of years of education, and you'll see there's a big spike here,

which is at 12.

So a lot of people had exactly 12 years of education or

in other words finish high school.

And then you'll see beyond 12 there is, either some college or

these are people here who have probably finished a four year degree.

And so

now we have some sense of the variability in our treatment if you will variable.

This number of years of education.

So we do see that there's a fair amount of variability in how much

education people had.

Next, we'll look at estimating the proportion of compliers here and

we're going to do this to estimate the strength of our instrumental variable.

And we could actually leave education as the continuous variable, but

for the purpose of estimating the proportion of suppliers, and

sort of making this analogous to a randomized trial.

What I'm going to do here is just create this variable which is education 12 here,

which means it's an indicator variable

that you have more than 12 years of education.

And so I'm imagining right now that treatment is, we're just comparing greater

than 12 years of education versus less than or equal to 12 years of education.

And I'm imagining that's what we're interested in.

You could leave treatment as a continuous variable to analyze it that way.

I'm mostly dichotomizing it to just illustrate the idea of compliers.

So what were the complier mean here?

Will compiler here would mean that, if you lived close to a four year college

you would end up having more than 12 years of schooling

whereas if you didn't then you wouldn't and that would make you a compiler.

And so we can estimate the proportion of compilers by taking the mean

6:40

indicator of more than 12 years of education but

among the sub-population who live near a four year college, and

then subtracting the mean of this education variable among those who did not

live near a four year college, and we see that that proportion is .12.

So we are estimating that the proportion of compilers is .12 or

in other words 12% and so in that case, you might argue that

the instrument is not extremely strong, but also not so weak that we're alarmed.

So it does seem like living near a four year college

does increase the chances that you'll have more than 12 years of education.

9:19

as long as we make this no defiers assumption.

Right, so in this example, what does no defiers mean?

Well defier here would be somebody who does the opposite of what they're

encouraged to do.

So these are people who, if they live near four year college,

would not get more than 12 years of education.

But if they don't live near a four year college,

then they would get more than 12 years of education.

So you have to think about whether you believe

that assumption that there wouldn't be people like that.

9:48

You can also estimate the same kind of effect using two stage least squares.

So I'll show you some R code to actually do it in two stages.

So we call with two stage least squares.

Stage one, we regress treatment on the instrument.

So stage one, regress treatment on the instrument.

Well what is treatment here?

Treatment is education, and the instrument is living near a four year college.

So we use this linear model's command, and

we can regress treatment on the instrument.

10:19

And if you do that, then you can get predicted values out of that.

So I'm going to ask it to give me predicted values.

So we're basically for each person, we want the predicted probability of

receiving treatment given the value of their instrument.

But because the instrument can only take on two values here,

will only get two unique predictive values.

And what we'll see here is that people with encouragement have a predicted

probability of more than 12 years of college of .54, and

people who were not encouraged, people who lived farther away from a four year

college, their predicted probability of treatment, their predicted probability of

12:17

so this instrument of variable package.

So if you want to do that instead of, I sort of, I did two stage of these squares

in the previous slide by sort of brute force in some sense.

I actually do two separate regression models and put things together but

you actually can just use this ivreg command here if you use the ivpack.

So instead, I'll tell it to use an instrumental variable

regression which means two stages squares here.

And so I'm letting it know in this first part that my outcome is wages,

my treatment is education.

And then this part here is just telling it what the instrument is.

12:59

And a nice thing about the ID package is it will also give you

standard errors out of the model automatically.

Whereas previously, I didn't tell you how to get the standard errors,

you would've had to do some extra steps.

If you use this package, you'll just get standard errors out of it directly.

You can use this robust.se command and it will give you standard errors.

13:20

And what you will notice is then the causal effect will be

coefficient here of this education variable.

And you will see that the coefficient is exactly the same as I

had on the previous two slides.

It's about 1.28 but here you also get a standard error.

And you also will get a p-value, so you'll see that this is highly significant.

It looks like there's strong evidence of causal effect

assuming that the instrumental variable assumptions are met.

13:50

So at this particular example you might be thinking that the IV

assumptions don't seem plausible [INAUDIBLE]

because people who lived near four year

college might differ from people who don't in a lot of different ways moving.

For example, living near four year college that the housing in that area

might be more expensive, so these might be higher income families.

14:19

So we might want a control for some of those kinds of variables.

Demographic kinds of variables, numbers of years of parents education.

But maybe once you control for a number of those variables, maybe given that,

then this does living near a four year college

does meet the instrumental variable assumptions.

So sometimes these iv assumptions you might assume are only true

conditional on covariants among people with the same values as the covariants.

And so you can do that using this iv, you can still use ivreg,

and now you can just control for a bunch of other covariants.

So I'm not going to walk through what each of these means they have unfortunate

names, very generic kinds of names.

But in the data set itself if you look at it they'll, you can find out what each of

these variables means, but I just wanted to show you how you can actually do it.

So we have an outcome.

We have treatment.

And then we just have a bunch of variables that we want to control for.

15:14

And then in the first stage, we have to tell what the instrument is, and

also in the first stage we'll control for all of the covariates.

So if you do this, you'll get the same kind of result.

You want to look at the coefficient of the treatment variable but

now, you're just controlling for a bunch of other variables.

And so you'll see now, our coefficient's slightly different.

It's a 1.2, still highly significant.

And so we're still seeing pretty strong evidence of

the number of years of education leading to an increase in wages, so

here even after you control for all of these different variables.

And so this is how you could carry out two stationary square kind of analyst if you

had other covariates that you wanted to control for.