Hi, in this module, we'll talk about Matching approaches. In experiments, matching is sometimes used as part of a randomized assignment to control and treatment. Pairs of subjects identical on the key variables that are used for the matching are then randomly divided into control and treatment. So for example, if we have a group of subjects we might pick out pairs of subject that have similar ages or perhaps are the same sex. Or similar on other characteristics and then randomly assign one of them to control and then one of them to treatment. This reduces the probability that chance variation in the distribution of the absorbed variables could be two differences between control and treatment. That is chance variation in the observable variables that were the basis for the matching. So for example, if we matched our subjects based on age group, sex, and marital status, we would ensure that the control and treatment groups had identical numbers of people in terms of the distribution of age, group, sex, and marital status. And the differences between the control and treatment groups in terms of the composition on those variables, would not drive any observed differences between the control and treatment outcomes. Control variables may be further introduced in analysis of the results to achieve the same effect. But may not be as effective. So, if we think about a study design with a matching. Here we have people who come in different shapes and different colors. Red and blue, square and circle. So the idea of matching is that if we're dividing into control and treatment we pick pairs of subjects who are identical on, in this case, their shape and their color and then, randomly divide the members of each pair into control and treatment. So here, we have two blue squares divided. And we continue doing this until we run out of subjects. The result is that we have control and treatment groups that are identical in terms of the numbers of red and blue, and the numbers of squares and circles. So that any differences between them can't be the result of differences in the proportions of the subjects that are squares or circles, or red and blue. If we don't have matching and we just pick people at random, it's always possible that the control and treatment groups may differ in terms of their distribution of some of these observables. So, for example, may have different numbers of reds and blues, and squares and circles. Now if these characteristics have nothing to do with the outcome of interest, then we're okay. But actually, if they influence the outcome of interest, then we could end up with a problem for the interpretation of our results. Now for observational studies, we can actually conduct matching. There's one approach, propensity score matching, which seeks to identify treatment effects in observational data. Subjects are matched according to their propensity to experience the treatment. So for example, important studies of the effects of college education attempt to match people in terms of their likelihood of going to college based on their other characteristics. But who differ in terms of whether or not they actually went to college. So if we think of going to college as the treatment and we take pairs of people with identical chances of going to college at least based on the other observables and you differ only in terms of whether or not they actually went, that's like a difference between control and treatment. So, the propensity that is used as the basis of the matching is predicted from the results of a regression of whether the subject experienced the treatment on the other variables. So in the example I just gave, the chances that a person went to college are regressed on a wide variety of family background characteristics. And other things that we normally expect to affect the chances of going on to college. Then outcomes are compared for subjects who had similar propensities. In this case, similar chances of going to college would actually differ in terms of whether or not they experience the treatment and it is whether or not they actually did go to college. So among people who are equally likely to go to college based on their background characteristics but who differed in terms of whether or not they went, then we look at differences in their outcomes. For example, differences in their incomes or differences in other outcomes of interest. And then claim that because these people were matched in terms of their chances of going to college, the differences in their outcomes are the result of the differences between them in terms of whether they actually went to college. So for estimating the propensity, people typically will look at some outcome variable Y and think of it as a 0 or 1 variable according to whether or not the subject experiences the treatment. So, Y here might be 0 or 1 according to whether somebody goes to college. And then, X are various variables that influence the chances of going to college. And then we assumed that Y is Independent of unobserved variables. There's no W's working around there that might also be influencing Y. This is called the ignorability assumption. And then observations are matched on estimated propensities of Y based on X, and their outcomes compared. So we get the propensities. By essentially once we have run the regression, predicting the probability of Y being 1 or 0 for every single observation. And then again, we match observations that have similar chances of experiencing Y. So here's an example where we've got some subjects who have different propensities of experiencing a treatment, perhaps different chances of going to college. So as much as possible, we put them into pairs where they are similar in terms of their chances of going to college. Now we can't make the match exact but roughly speaking we have now four pairs of people where they're different in terms of whether or not they actually went to college. But they have similar chances of actually having gone to college as predicted from their background characteristics. Now if we're looking at the effects of college attendance on earnings. So the basic idea is that we first look at the effects of background characteristic on the chances of going to college to then estimate our propensity to go to college as predicted by the background characteristic. And then we divide people according to their propensity to attend college, that is, the probability that they would go to college. And then compare their earnings according to whether they attended college and or didn't attend. And in this example, based on an actual study that was done by Jennie Brand and Yu Xie, we compare the earnings of people that attended college and the people who didn't. And in this particular example, which is similar to the published one, the effects of going to college were strongest for the people who were, actually, least likely to attend college. That the gap between the earnings of the people who attended college and didn't attend college were largest for the people who have the lowest chances of attending college. So this is a example of a approach to matching, in this case it's called propensity square matching. It's sometimes used for observational data. Again, trying to match people in terms of their probability of experiencing the treatment of interest. In this case, attending college. And then comparing them on some outcome according to whether or not they actually did experience the treatment. In this case again, going to college. It's becoming extremely popular but there are numerous critiques. One of the most important is whether or not the ignorability assumption as mentioned earlier is really valid. Whether the observed characteristics that are used to predict, obtaining the treatment and therefore calculate the propensity really capture all of the differences of interest and that unobserved variables are not affecting the outcome. That's one critique and there also critiques about, very technical ones, about some of the issues that are related to the actual matching on the propensities.