5 Matching and Subclassification
5.1 Subclassification
One of the main things I wanted to cover in the chapter on directed acylical graphical models was the idea of the backdoor criterion. Specifically, insofar as there exists a conditioning strategy that will satisfy the backdoor criterion, then you can use that strategy to identify some causal effect. We now discuss three different kinds of conditioning strategies. They are subclassification, exact matching, and approximate matching.1
Subclassification is a method of satisfying the backdoor criterion by weighting differences in means by strata-specific weights. These strata-specific weights will, in turn, adjust the differences in means so that their distribution by strata is the same as that of the counterfactual’s strata. This method implicitly achieves distributional balance between the treatment and control in terms of that known, observable confounder. This method was created by statisticians like Cochran (1968), who tried to analyze the causal effect of smoking on lung cancer, and while the methods today have moved beyond it, we include it because some of the techniques implicit in subclassification are present throughout the rest of the book.
One of the concepts threaded through this chapter is the conditional independence assumption, or CIA. Sometimes we know that randomization occurred only conditional on some observable characteristics. For instance, in Krueger (1999), Tennessee randomly assigned kindergarten students and their teachers to small classrooms, large classrooms, and large classrooms with an aide. But the state did this conditionally—specifically, schools were chosen, and then students were randomized. Krueger therefore estimated regression models that included a school fixed effect because he knew that the treatment assignment was only conditionally random.
This assumption is written as
Let me link together some concepts. First, insofar as CIA is credible, then CIA means you have found a conditioning strategy that satisfies the backdoor criterion. Second, when treatment assignment had been conditional on observable variables, it is a situation of selection on observables. The variable
5.1.1 Some background
A major public health problem of the mid- to late twentieth century was the problem of rising lung cancer. For instance, the mortality rate per 100,000 from cancer of the lungs in males reached 80–100 per 100,000 by 1980 in Canada, England, and Wales. From 1860 to 1950, the incidence of lung cancer found in cadavers during autopsy grew from 0% to as high as 7%. The rate of lung cancer incidence appeared to be increasing.
Studies began emerging that suggested smoking was the cause since it was so highly correlated with incidence of lung cancer. For instance, studies found that the relationship between daily smoking and lung cancer in males was monotonically increasing in the number of cigarettes a male smoked per day. But some statisticians believed that scientists couldn’t draw a causal conclusion because it was possible that smoking was not independent of potential health outcomes. Specifically, perhaps the people who smoked cigarettes differed from non-smokers in ways that were directly related to the incidence of lung cancer. After all, no one is flipping coins when deciding to smoke.
Thinking about the simple difference in means decomposition from earlier, we know that contrasting the incidence of lung cancer between smokers and non-smokers will be biased in observational data if the independence assumption does not hold. And because smoking is endogenous—that is, people choose to smoke—it’s entirely possible that smokers differed from the non-smokers in ways that were directly related to the incidence of lung cancer.
Criticisms at the time came from such prominent statisticians as Joseph Berkson, Jerzy Neyman, and Ronald Fisher. They made several compelling arguments. First, they suggested that the correlation was spurious due to a non-random selection of subjects. Functional form complaints were also common. This had to do with people’s use of risk ratios and odds ratios. The association, they argued, was sensitive to those kinds of functional form choices, which is a fair criticism. The arguments were really not so different from the kinds of arguments you might see today when people are skeptical of a statistical association found in some observational data set.
Probably most damning, though, was the hypothesis that there existed an unobservable genetic element that both caused people to smoke and independently caused people to develop lung cancer. This confounder meant that smokers and non-smokers differed from one another in ways that were directly related to their potential outcomes, and thus independence did not hold. And there was plenty of evidence that the two groups were different. For instance, smokers were more extroverted than non-smokers, and they also differed in age, income, education, and so on.
The arguments against the smoking cause mounted. Other criticisms included that the magnitudes relating smoking and lung cancer were implausibly large. And again, the ever-present criticism of observational studies: there did not exist any experimental evidence that could incriminate smoking as a cause of lung cancer.2
The theory that smoking causes lung cancer is now accepted science. I wouldn’t be surprised if more people believe in a flat Earth than that smoking doesn’t causes lung cancer. I can’t think of a more well-known and widely accepted causal theory, in fact. So how did Fisher and others fail to see it? Well, in Fisher’s defense, his arguments were based on sound causal logic. Smoking was endogenous. There was no experimental evidence. The two groups differed considerably on observables. And the decomposition of the simple difference in means shows that contrasts will be biased if there is selection bias. Nonetheless, Fisher was wrong, and his opponents were right. They just were right for the wrong reasons.
To motivate what we’re doing in subclassification, let’s work with Cochran (1968), which was a study trying to address strange patterns in smoking data by adjusting for a confounder. Cochran lays out mortality rates by country and smoking type (Table 5.1).
Smoking group | Canada | UK | US |
---|---|---|---|
Non-smokers | 20.2 | 11.3 | 13.5 |
Cigarettes | 20.5 | 14.1 | 13.5 |
Cigars/pipes | 35.5 | 20.7 | 17.4 |
As you can see, the highest death rate for Canadians is among the cigar and pipe smokers, which is considerably higher than for non-smokers or for those who smoke cigarettes. Similar patterns show up in both countries, though smaller in magnitude than what we see in Canada.
This table suggests that pipes and cigars are more dangerous than cigarette smoking, which, to a modern reader, sounds ridiculous. The reason it sounds ridiculous is because cigar and pipe smokers often do not inhale, and therefore there is less tar that accumulates in the lungs than with cigarettes. And insofar as it’s the tar that causes lung cancer, it stands to reason that we should see higher mortality rates among cigarette smokers.
But, recall the independence assumption. Do we really believe that:
Is it the case that factors related to these three states of the world are truly independent to the factors that determine death rates? Well, let’s assume for the sake of argument that these independence assumptions held. What else would be true across these three groups? Well, if the mean potential outcomes are the same for each type of smoking category, then wouldn’t we expect the observable characteristics of the smokers themselves to be as well? This connection between the independence assumption and the characteristics of the groups is called balance. If the means of the covariates are the same for each group, then we say those covariates are balanced and the two groups are exchangeable with respect to those covariates.
One variable that appears to matter is the age of the person. Older people were more likely at this time to smoke cigars and pipes, and without stating the obvious, older people were more likely to die. In Table 5.2 we can see the mean ages of the different groups.
Smoking group | Canada | British | US |
---|---|---|---|
Non-smokers | 54.9 | 49.1 | 57.0 |
Cigarettes | 50.5 | 49.8 | 53.2 |
Cigars/pipes | 65.9 | 55.7 | 59.7 |
The high means for cigar and pipe smokers are probably not terribly surprising. Cigar and pipe smokers are typically older than cigarette smokers, or at least they were in 1968 when Cochran was writing. And since older people die at a higher rate (for reasons other than just smoking cigars), maybe the higher death rate for cigar smokers is because they’re older on average. Furthermore, maybe by the same logic, cigarette smoking has such a low mortality rate because cigarette smokers are younger on average. Note, using DAG notation, this simply means that we have the following DAG:
where
So how does subclassification achieve covariate balance? Our first step is to divide age into strata: say, 20–40, 41–70, and 71 and older. Then we can calculate the mortality rate for some treatment group (cigarette smokers) by strata (here, that is age). Next, weight the mortality rate for the treatment group by a strata-specific (or age-specific) weight that corresponds to the control group. This gives us the age-adjusted mortality rate for the treatment group. Let’s explain with an example by looking at Table 5.3. Assume that age is the only relevant confounder between cigarette smoking and mortality.4
Death rates | # of Cigarette smokers | # of Pipe or cigar smokers | |
---|---|---|---|
Age 20–40 | 20 | 65 | 10 |
Age 41–70 | 40 | 25 | 25 |
Age |
60 | 10 | 65 |
Total | 100 | 100 |
What is the average death rate for pipe smokers without subclassification? It is the weighted average of the mortality rate column where each weight is equal to
But notice that the age distribution of cigarette smokers is the exact opposite (by construction) of pipe and cigar smokers. Thus the age distribution is imbalanced. Subclassification simply adjusts the mortality rate for cigarette smokers so that it has the same age distribution as the comparison group. In other words, we would multiply each age-specific mortality rate by the proportion of individuals in that age strata for the comparison group. That would be
Cochran uses a version of this subclassification method in his paper and recalculates the mortality rates for the three countries and the three smoking groups (see Table 5.4). As can be seen, once we adjust for the age distribution, cigarette smokers have the highest death rates among any group.
Smoking group | Canada | UK | US |
---|---|---|---|
Non-smokers | 20.2 | 11.3 | 13.5 |
Cigarettes | 29.5 | 14.8 | 21.2 |
Cigars/pipes | 19.8 | 11.0 | 13.7 |
This kind of adjustment raises a question—which variable(s) should we use for adjustment? First, recall what we’ve emphasized repeatedly. Both the backdoor criterion and CIA tell us precisely what we need to do. We need to choose a set of variables that satisfy the backdoor criterion. If the backdoor criterion is met, then all backdoor paths are closed, and if all backdoor paths are closed, then CIA is achieved. We call such a variable the covariate. A covariate is usually a random variable assigned to the individual units prior to treatment. This is sometimes also called exogenous. Harkening back to our DAG chapter, this variable must not be a collider as well. A variable is exogenous with respect to
5.1.2 Identifying assumptions
Let me now formalize what we’ve learned. In order to estimate a causal effect when there is a confounder, we need (1) CIA and (2) the probability of treatment to be between 0 and 1 for each strata. More formally,
(conditional independence) with probability one (common support)
These two assumptions yield the following identity
where each value of
Whereas we need treatment to be conditionally independent of both potential outcomes to identify the ATE, we need only treatment to be conditionally independent of
5.1.3 Subclassification exercise: Titanic
For what we are going to do next, I find it useful to move into actual data. We will use an interesting data set to help us better understand subclassification. As everyone knows, the Titanic ocean cruiser hit an iceberg and sank on its maiden voyage. Slightly more than 700 passengers and crew survived out of the 2,200 people on board. It was a horrible disaster. One of the things about it that was notable, though, was the role that wealth and norms played in passengers’ survival.
Imagine that we wanted to know whether or not being seated in first class made someone more likely to survive. Given that the cruiser contained a variety of levels for seating and that wealth was highly concentrated in the upper decks, it’s easy to see why wealth might have a leg up for survival. But the problem was that women and children were explicitly given priority for boarding the scarce lifeboats. If women and children were more likely to be seated in first class, then maybe differences in survival by first class is simply picking up the effect of that social norm. Perhaps a DAG might help us here, as a DAG can help us outline the sufficient conditions for identifying the causal effect of first class on survival.
Now before we commence, let’s review what this DAG is telling us. This says that being a female made you more likely to be in first class but also made you more likely to survive because lifeboats were more likely to be allocated to women. Furthermore, being a child made you more likely to be in first class and made you more likely to survive. Finally, there are no other confounders, observed or unobserved.5
Here we have one direct path (the causal effect) between first class (
Code
use https://github.com/scunning1975/mixtape/raw/master/titanic.dta, clear
gen female=(sex==0)
label variable female "Female"
gen male=(sex==1)
label variable male "Male"
gen s=1 if (female==1 & age==1)
replace s=2 if (female==1 & age==0)
replace s=3 if (female==0 & age==1)
replace s=4 if (female==0 & age==0)
gen d=1 if class==1
replace d=0 if class!=1
summarize survived if d==1
gen ey1=r(mean)
summarize survived if d==0
gen ey0=r(mean)
gen sdo=ey1-ey0
su sdo
Code
library(tidyverse)
library(haven)
<- function(df)
read_data
{<- paste("https://github.com/scunning1975/mixtape/raw/master/",
full_path sep = "")
df, <- read_dta(full_path)
df return(df)
}
<- read_data("titanic.dta") %>%
titanic mutate(d = case_when(class == 1 ~ 1, TRUE ~ 0))
<- titanic %>%
ey1 filter(d == 1) %>%
pull(survived) %>%
mean()
<- titanic %>%
ey0 filter(d == 0) %>%
pull(survived) %>%
mean()
<- ey1 - ey0 sdo
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
## Simple Difference in Outcomes
= read_data("titanic.dta")
titanic
'd'] = 0
titanic['class']=='1st class', 'd'] = 1
titanic.loc[titanic[
'sex_d'] = 0
titanic['sex']=='man', 'sex_d'] = 1
titanic.loc[titanic[
'age_d'] = 0
titanic['age']=='adults', 'age_d'] = 1
titanic.loc[titanic[
'survived_d'] = 0
titanic['survived']=='yes', 'survived_d'] = 1
titanic.loc[titanic[
= titanic.loc[titanic['d']==0, 'survived_d'].mean()
ey0 = titanic.loc[titanic['d']==1, 'survived_d'].mean()
ey1
= ey1 - ey0
sdo print("The simple difference in outcomes is {:.2%}".format(sdo))
Using the data set on the Titanic, we calculate a simple difference in mean outcomes (SDO), which finds that being seated in first class raised the probability of survival by 35.4%. But note, since this does not adjust for observable confounders age and gender, it is a biased estimate of the ATE. So next we use subclassification weighting to control for these confounders. Here are the steps that will entail:
Stratify the data into four groups: young males, young females, old males, old females.
Calculate the difference in survival probabilities for each group.
Calculate the number of people in the non-first-class groups and divide by the total number of non-first-class population. These are our strata-specific weights.
Calculate the weighted average survival rate using the strata weights.
Let’s review this with some code so that you can better understand what these four steps actually entail.
Code
* Subclassificationdrop ey1 ey0
cap n if s==1 & d==1
su survived gen ey11=r(mean)
label variable ey11 "Average survival for male child in treatment"
if s==1 & d==0
su survived gen ey10=r(mean)
label variable ey10 "Average survival for male child in control"
gen diff1=ey11-ey10
label variable diff1 "Difference in survival for male children"
if s==2 & d==1
su survived gen ey21=r(mean)
if s==2 & d==0
su survived gen ey20=r(mean)
gen diff2=ey21-ey20
if s==3 & d==1
su survived gen ey31=r(mean)
if s==3 & d==0
su survived gen ey30=r(mean)
gen diff3=ey31-ey30
if s==4 & d==1
su survived gen ey41=r(mean)
if s==4 & d==0
su survived gen ey40=r(mean)
gen diff4=ey41-ey40
count if s==1 & d==0
count if s==2 & d==0
count if s==3 & d==0
count if s==4 & d==0
count if d == 0
gen wt1=281/1876
gen wt2=44/1876
gen wt3=1492/1876
gen wt4=59/1876
gen wate=diff1*wt1 + diff2*wt2 + diff3*wt3 + diff4*wt4
sum wate sdo
Code
library(stargazer)
library(magrittr) # for %$% pipes
library(tidyverse)
library(haven)
<- read_data("titanic.dta") %>%
titanic mutate(d = case_when(class == 1 ~ 1, TRUE ~ 0))
%<>%
titanic mutate(s = case_when(sex == 0 & age == 1 ~ 1,
== 0 & age == 0 ~ 2,
sex == 1 & age == 1 ~ 3,
sex == 1 & age == 0 ~ 4,
sex TRUE ~ 0))
<- titanic %>%
ey11 filter(s == 1 & d == 1) %$%
mean(survived)
<- titanic %>%
ey10 filter(s == 1 & d == 0) %$%
mean(survived)
<- titanic %>%
ey21 filter(s == 2 & d == 1) %$%
mean(survived)
<- titanic %>%
ey20 filter(s == 2 & d == 0) %$%
mean(survived)
<- titanic %>%
ey31 filter(s == 3 & d == 1) %$%
mean(survived)
<- titanic %>%
ey30 filter(s == 3 & d == 0) %$%
mean(survived)
<- titanic %>%
ey41 filter(s == 4 & d == 1) %$%
mean(survived)
<- titanic %>%
ey40 filter(s == 4 & d == 0) %$%
mean(survived)
= ey11 - ey10
diff1 = ey21 - ey20
diff2 = ey31 - ey30
diff3 = ey41 - ey40
diff4
= nrow(titanic %>% filter(d == 0))
obs
<- titanic %>%
wt1 filter(s == 1 & d == 0) %$%
nrow(.)/obs
<- titanic %>%
wt2 filter(s == 2 & d == 0) %$%
nrow(.)/obs
<- titanic %>%
wt3 filter(s == 3 & d == 0) %$%
nrow(.)/obs
<- titanic %>%
wt4 filter(s == 4 & d == 0) %$%
nrow(.)/obs
= diff1*wt1 + diff2*wt2 + diff3*wt3 + diff4*wt4
wate
stargazer(wate, sdo, type = "text")
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
## Weighted Average Treatment Effect
= read_data("titanic.dta")
titanic
'd'] = 0
titanic['class']=='1st class', 'd'] = 1
titanic.loc[titanic[
'sex_d'] = 0
titanic['sex']=='man', 'sex_d'] = 1
titanic.loc[titanic[
'age_d'] = 0
titanic['age']=='adults', 'age_d'] = 1
titanic.loc[titanic[
'survived_d'] = 0
titanic['survived']=='yes', 'survived_d'] = 1
titanic.loc[titanic[
's'] = 0
titanic[== 0) & (titanic.age_d==1), 's'] = 1
titanic.loc[(titanic.sex_d == 0) & (titanic.age_d==0), 's'] = 2
titanic.loc[(titanic.sex_d == 1) & (titanic.age_d==1), 's'] = 3
titanic.loc[(titanic.sex_d == 1) & (titanic.age_d==0), 's'] = 4
titanic.loc[(titanic.sex_d
= titanic.loc[titanic.d == 0].shape[0]
obs
def weighted_avg_effect(df):
= df[df.d==1].survived_d.mean() - df[df.d==0].survived_d.mean()
diff = df[df.d==0].shape[0]/obs
weight return diff*weight
= titanic.groupby('s').apply(weighted_avg_effect).sum()
wate
print("The weigthted average treatment effect estimate is {:.2%}".format(wate))
Here we find that once we condition on the confounders gender and age, first-class seating has a much lower probability of survival associated with it (though frankly, still large). The weighted ATE is 18.9%, versus the SDO, which is 35.4%.
5.1.4 Curse of dimensionality
Here we’ve been assuming two covariates, each of which has two possible set of values. But this was for convenience. Our data set, for instance, only came to us with two possible values for age—child and adult. But what if it had come to us with multiple values for age, like specific age? Then once we condition on individual age and gender, it’s entirely likely that we will not have the information necessary to calculate differences within strata, and therefore be unable to calculate the strata-specific weights that we need for subclassification.
For this next part, let’s assume that we have precise data on Titanic survivor ages. But because this will get incredibly laborious, let’s just focus on a few of them.
Age and Gender | Survival Prob. 1st Class | Controls | Diff | # of 1st Class | # of Controls |
---|---|---|---|---|---|
Male 11-yo | 1.0 | 0 | 1 | 1 | 2 |
Male 12-yo | – | 1 | – | 0 | 1 |
Male 13-yo | 1.0 | 0 | 1 | 1 | 2 |
Male 14-yo | – | 0.25 | – | 0 | 4 |
… |
Here we see an example of the common support assumption being violated. The common support assumption requires that for each strata, there exist observations in both the treatment and control group, but as you can see, there are not any 12-year-old male passengers in first class. Nor are there any 14-year-old male passengers in first class. And if we were to do this for every combination of age and gender, we would find that this problem was quite common. Thus, we cannot estimate the ATE using subclassification. The problem is that our stratifying variable has too many dimensions, and as a result, we have sparseness in some cells because the sample is too small.
But let’s say that the problem was always on the treatment group, not the control group. That is, let’s assume that there is always someone in the control group for a given combination of gender and age, but there isn’t always for the treatment group. Then we can calculate the ATT. Because as you see in this table, for those two strata, 11-year-olds and 13-year-olds, there are both treatment and control group values for the calculation. So long as there exist controls for a given treatment strata, we can calculate the ATT. The equation to do so can be compactly written as:
We’ve seen a problem that arises with subclassification—in a finite sample, subclassification becomes less feasible as the number of covariates grows, because as
5.2 Exact Matching
Subclassification uses the difference between treatment and control group units and achieves covariate balance by using the
But there are alternative approaches. For instance, what if we estimated
There are two broad types of matching that we will consider: exact matching and approximate matching. We will first start by describing exact matching. Much of what I am going to be discussing is based on Abadie and Imbens (2006).
A simple matching estimator is the following:
where
But, what if there’s more than one variable “closest to” the
Notationally, we can describe this estimator as
This estimator really isn’t too different from the one just before it; the main difference is that this one averages over several close matches as opposed to just picking one. This approach works well when we can find a number of good matches for each treatment group unit. We usually define
Those were both ATT estimators. You can tell that these are
The
Let’s see this work in action by working with an example. Table 5.6 shows two samples: a list of participants in a job trainings program and a list of non-participants, or non-trainees. The left-hand group is the treatment group and the right-hand group is the control group. The matching algorithm that we defined earlier will create a third group called the matched sample, consisting of each treatment group unit’s matched counterfactual. Here we will match on the age of the participant.
Trainees | Non-Trainees | ||||
Unit | Age | Earnings | Unit | Age | Earnings |
1 | 18 | 9500 | 1 | 20 | 8500 |
2 | 29 | 12250 | 2 | 27 | 10075 |
3 | 24 | 11000 | 3 | 21 | 8725 |
4 | 27 | 11750 | 4 | 39 | 12775 |
5 | 33 | 13250 | 5 | 38 | 12550 |
6 | 22 | 10500 | 6 | 29 | 10525 |
7 | 19 | 9750 | 7 | 39 | 12775 |
8 | 20 | 10000 | 8 | 33 | 11425 |
9 | 21 | 10250 | 9 | 24 | 9400 |
10 | 30 | 12500 | 10 | 30 | 10750 |
11 | 33 | 11425 | |||
12 | 36 | 12100 | |||
13 | 22 | 8950 | |||
14 | 18 | 8050 | |||
15 | 43 | 13675 | |||
16 | 39 | 12775 | |||
17 | 19 | 8275 | |||
18 | 30 | 9000 | |||
19 | 51 | 15475 | |||
20 | 48 | 14800 | |||
Mean | 24.3 | $11,075 | 31.95 | $11,101.25 |
Before we do this, though, I want to show you how the ages of the trainees differ on average from the ages of the non-trainees. We can see that in Table 5.6—the average age of the participants is 24.3 years, and the average age of the non-participants is 31.95 years. Thus, the people in the control group are older, and since wages typically rise with age, we may suspect that part of the reason their average earnings are higher ($11,075 vs. $11,101) is because the control group is older. We say that the two groups are not exchangeable because the covariate is not balanced. Let’s look at the age distribution. To illustrate this, we need to download the data first. We will create two histograms—the distribution of age for treatment and non-trainee group—as well as summarize earnings for each group. That information is also displayed in Figure 5.1.
Code
use https://github.com/scunning1975/mixtape/raw/master/training_example.dta, clear
histogram age_treat, bin(10) frequency
histogram age_control, bin(10) frequency
su age_treat age_control
su earnings_treat earnings_control
histogram age_treat, bin(10) frequency
histogram age_matched, bin(10) frequency
su age_treat age_control su earnings_matched earnings_matched
Code
library(tidyverse)
library(haven)
<- function(df)
read_data
{<- paste("https://github.com/scunning1975/mixtape/raw/master/",
full_path sep = "")
df, <- read_dta(full_path)
df return(df)
}
<- read_data("training_example.dta") %>%
training_example slice(1:20)
ggplot(training_example, aes(x=age_treat)) +
stat_bin(bins = 10, na.rm = TRUE)
ggplot(training_example, aes(x=age_control)) +
geom_histogram(bins = 10, na.rm = TRUE)
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
= read_data("training_example.dta")
training_example
='age_treat')) + p.stat_bin(bins = 10)
p.ggplot(training_example, p.aes(x
='age_control')) + p.geom_histogram(bins = 10) p.ggplot(training_example, p.aes(x
As you can see from Figure 5.1, these two populations not only have different means (Table 5.6); the entire distribution of age across the samples is different. So let’s use our matching algorithm and create the missing counterfactuals for each treatment group unit. This method, since it only imputes the missing units for each treatment unit, will yield an estimate of the
Trainees | Non-Trainees | Matched Sample | ||||||
Unit | Age | Earnings | Unit | Age | Earnings | Unit | Age | Earnings |
1 | 18 | 9500 | 1 | 20 | 8500 | 14 | 18 | 8050 |
2 | 29 | 12250 | 2 | 27 | 10075 | 6 | 29 | 10525 |
3 | 24 | 11000 | 3 | 21 | 8725 | 9 | 24 | 9400 |
4 | 27 | 11750 | 4 | 39 | 12775 | 8 | 27 | 10075 |
5 | 33 | 13250 | 5 | 38 | 12550 | 11 | 33 | 11425 |
6 | 22 | 10500 | 6 | 29 | 10525 | 13 | 22 | 8950 |
7 | 19 | 9750 | 7 | 39 | 12775 | 17 | 19 | 8275 |
8 | 20 | 10000 | 8 | 33 | 11425 | 1 | 20 | 8500 |
9 | 21 | 10250 | 9 | 24 | 9400 | 3 | 21 | 8725 |
10 | 30 | 12500 | 10 | 30 | 10750 | 10,18 | 30 | 9875 |
11 | 33 | 11425 | ||||||
12 | 36 | 12100 | ||||||
13 | 22 | 8950 | ||||||
14 | 18 | 8050 | ||||||
15 | 43 | 13675 | ||||||
16 | 39 | 12775 | ||||||
17 | 19 | 8275 | ||||||
18 | 30 | 9000 | ||||||
19 | 51 | 15475 | ||||||
20 | 48 | 14800 | ||||||
Mean | 24.3 | $11,075 | 31.95 | $11,101.25 | 24.3 | $9,380 |
Now let’s move to creating the matched sample. As this is exact matching, the distance traveled to the nearest neighbor will be zero integers. This won’t always be the case, but note that as the control group sample size grows, the likelihood that we find a unit with the same covariate value as one in the treatment group grows. I’ve created a data set like this. The first treatment unit has an age of 18. Searching down through the non-trainees, we find exactly one person with an age of 18, and that’s unit 14. So we move the age and earnings information to the new matched sample columns.
We continue doing that for all units, always moving the control group unit with the closest value on
Now we see that the mean age is the same for both groups. We can also check the overall age distribution (Figure 5.2). As you can see, the two groups are exactly balanced on age. We might say the two groups are exchangeable. And the difference in earnings between those in the treatment group and those in the control group is $1,695. That is, we estimate that the causal effect of the program was $1,695 in higher earnings.
Let’s summarize what we’ve learned. We’ve been using a lot of different terms, drawn from different authors and different statistical traditions, so I’d like to map them onto one another. The two groups were different in ways that were likely a direction function of potential outcomes. This means that the independence assumption was violated. Assuming that treatment assignment was conditionally random, then matching on
5.3 Approximate Matching
The previous example of matching was relatively simple—find a unit or collection of units that have the same value of some covariate
But what if you couldn’t find another unit with that exact same value? Then you’re in the world of approximate matching.
5.3.1 Nearest neighbor covariate matching
One of the instances where exact matching can break down is when the number of covariates,
Matching on a single covariate is straightforward because distance is measured in terms of the covariate’s own values. For instance, distance in age is simply how close in years or months or days one person is to another person. But what if we have several covariates needed for matching? Say, age and log income. A 1-point change in age is very different from a 1-point change in log income, not to mention that we are now measuring distance in two, not one, dimensions. When the number of matching covariates is more than one, we need a new definition of distance to measure closeness. We begin with the simplest measure of distance, the Euclidean distance:
The problem with this measure of distance is that the distance measure itself depends on the scale of the variables themselves. For this reason, researchers typically will use some modification of the Euclidean distance, such as the normalized Euclidean distance, or they’ll use a wholly different alternative distance. The normalized Euclidean distance is a commonly used distance, and what makes it different is that the distance of each variable is scaled by the variable’s variance. The distance is measured as:
Finally, there is the Mahalanobis distance, which like the normalized Euclidean distance measure, is a scale-invariant distance metric. It is:
Basically, more than one covariate creates a lot of headaches. Not only does it create the curse-of-dimensionality problem; it also makes measuring distance harder. All of this creates some challenges for finding a good match in the data. As you can see in each of these distance formulas, there are sometimes going to be matching discrepancies. Sometimes
How severe is this bias? First, the good news. What we know is that the matching discrepancies tend to converge to zero as the sample size increases—which is one of the main reasons that approximate matching is so data greedy. It demands a large sample size for the matching discrepancies to be trivially small. But what if there are many covariates? The more covariates, the longer it takes for that convergence to zero to occur. Basically, if it’s hard to find good matches with an
5.3.2 Bias correction
Speaking of matching discrepancies, what sorts of options are available to us, putting aside seeking a large data set with lots of controls? Well, enter stage left, Abadie and Imbens (2011), who introduced bias-correction techniques with matching estimators when there are matching discrepancies in finite samples. So let’s look at that more closely, as you’ll likely need this in your work.
Everything we’re getting at is suggesting that matching is biased because of these poor matching discrepancies. So let’s derive this bias. First, we write out the sample ATT estimate, and then we subtract out the true ATT. So:
Notice, these are just the expected conditional outcome functions based on the switching equation for both control and treatment groups.
As always, we write out the observed value as a function of expected conditional outcomes and some stochastic element:
Now rewrite the ATT estimator using the above
Notice, the first line is just the ATT with the stochastic element included from the previous line. And the second line rearranges it so that we get two terms: the estimated ATT plus the average difference in the stochastic terms for the matched sample.
Now we compare this estimator with the true value of ATT.
Applying the central limit theorem and the difference,
As you can see, the bias of the matching estimator can be severe depending on the magnitude of these matching discrepancies. However, one piece of good news is that these discrepancies are observed. We can see the degree to which each unit’s matched sample has severe mismatch on the covariates themselves. Second, we can always make the matching discrepancy small by using a large donor pool of untreated units to select our matches, because recall, the likelihood of finding a good match grows as a function of the sample size, and so if we are content to estimating the ATT, then increasing the size of the donor pool can get us out of this mess. But let’s say we can’t do that and the matching discrepancies are large. Then we can apply bias-correction methods to minimize the size of the bias. So let’s see what the bias-correction method looks like. This is based on Abadie and Imbens (2011).
Note that the total bias is made up of the bias associated with each individual unit
where
Unit | ||||
---|---|---|---|---|
1 | 5 | 1 | 11 | |
2 | 2 | 1 | 7 | |
3 | 10 | 1 | 5 | |
4 | 6 | 1 | 3 | |
5 | 4 | 0 | 10 | |
6 | 0 | 0 | 8 | |
7 | 5 | 0 | 4 | |
8 | 1 | 0 | 1 |
Notice in this example that we cannot implement exact matching because none of the treatment group units has an exact match in the control group. It’s worth emphasizing that this is a consequence of finite samples; the likelihood of finding an exact match grows when the sample size of the control group grows faster than that of the treatment group. Instead, we use nearest-neighbor matching, which is simply going to match each treatment unit to the control group unit whose covariate value is nearest to that of the treatment group unit itself. But, when we do this kind of matching, we necessarily create matching discrepancies, which is simply another way of saying that the covariates are not perfectly matched for every unit. Nonetheless, the nearest-neighbor “algorithm” creates Table 5.9.
Recall that
Unit | ||||
---|---|---|---|---|
1 | 5 | 4 | 1 | 11 |
2 | 2 | 0 | 1 | 7 |
3 | 10 | 5 | 1 | 5 |
4 | 6 | 1 | 1 | 3 |
5 | 4 | 0 | 10 | |
6 | 0 | 0 | 8 | |
7 | 5 | 0 | 4 | |
8 | 1 | 0 | 1 |
Code
use https://github.com/scunning1975/mixtape/raw/master/training_bias_reduction.dta, clear
reg Y X
gen muhat = _b[_cons] + _b[X]*X
list
Code
library(tidyverse)
library(haven)
<- function(df)
read_data
{<- paste("https://github.com/scunning1975/mixtape/raw/master/",
full_path sep = "")
df, <- read_dta(full_path)
df return(df)
}
<- read_data("training_bias_reduction.dta") %>%
training_bias_reduction mutate(
Y1 = case_when(Unit %in% c(1,2,3,4) ~ Y),
Y0 = c(4,0,5,1,4,0,5,1))
<- lm(Y ~ X, training_bias_reduction)
train_reg
<- training_bias_reduction %>%
training_bias_reduction mutate(u_hat0 = predict(train_reg))
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
= read_data("training_bias_reduction.dta")
training_bias_reduction
'Y1'] = 0
training_bias_reduction['Unit'].isin(range(1,5)), 'Y1'] = 1
training_bias_reduction.loc[training_bias_reduction['Y0'] = (4,0,5,1,4,0,5,1)
training_bias_reduction[
= sm.OLS.from_formula('Y ~ X', training_bias_reduction).fit()
train_reg 'u_hat0'] = train_reg.predict(training_bias_reduction)
training_bias_reduction[= training_bias_reduction[['Unit', 'Y1', 'Y0', 'Y', 'D', 'X', 'u_hat0']]
training_bias_reduction
training_bias_reduction
When we regress
This gives us the outcomes, treatment status, and predicted values in Table 5.10.
Unit | ||||||
---|---|---|---|---|---|---|
1 | 5 | 4 | 5 | 1 | 11 | 3.89 |
2 | 2 | 0 | 2 | 1 | 7 | 4.08 |
3 | 10 | 5 | 10 | 1 | 5 | 4.18 |
4 | 6 | 1 | 6 | 1 | 3 | 4.28 |
5 | 4 | 4 | 0 | 10 | 3.94 | |
6 | 0 | 0 | 0 | 8 | 4.03 | |
7 | 5 | 5 | 0 | 4 | 4.23 | |
8 | 1 | 1 | 0 | 1 | 4.37 |
And then this would be done for the other three simple differences, each of which is added to a bias-correction term based on the fitted values from the covariate values.
Now, care must be given when using the fitted values for bias correction, so let me walk you through it. You are still going to be taking the simple differences (e.g., 5 – 4 for row 1), but now you will also subtract out the fitted values associated with each observation’s unique covariate. So for instance, in row 1, the outcome 5 has a covariate of 11, which gives it a fitted value of 3.89, but the counterfactual has a value of 10, which gives it a predicted value of 3.94. So therefore we would use the following bias correction:
which is slightly higher than the unadjusted ATE of 3.25. Note that this bias-correction adjustment becomes more significant as the matching discrepancies themselves become more common. But, if the matching discrepancies are not very common in the first place, then by definition, bias adjustment doesn’t change the estimated parameter very much.
Bias arises because of the effect of large matching discrepancies. To minimize these discrepancies, we need a small number of
The matching estimators have a normal distribution in large samples provided that the bias is small. For matching without replacement, the usual variance estimator is valid. That is:
For matching with replacement:
where
5.3.3 Propensity score methods
There are several ways of achieving the conditioning strategy implied by the backdoor criterion, and we’ve discussed several. But one popular one was developed by Donald Rubin in the mid-1970s to early 1980s called the propensity score method (Rubin 1977; Rosenbaum and Rubin 1983). The propensity score is similar in many respects to both nearest-neighbor covariate matching by Abadie and Imbens (2006) and subclassification. It’s a very popular method, particularly in the medical sciences, of addressing selection on observables, and it has gained some use among economists as well (Dehejia and Wahba 2002).
Before we dig into it, though, a couple of words to help manage expectations. Despite some early excitement caused by Dehejia and Wahba (2002), subsequent enthusiasm was more tempered (Smith and Todd 2001, 2005; King and Nielsen 2019). As such, propensity score matching has not seen as wide adoption among economists as in other nonexperimental methods like regression discontinuity or difference-in-differences. The most common reason given for this is that economists are oftentimes skeptical that CIA can be achieved in any data set—almost as an article of faith. This is because for many applications, economists as a group are usually more concerned about selection on unobservables than they are selection on observables, and as such, they reach for matching methods less often. But I am agnostic as to whether CIA holds or doesn’t hold in your particular application. There’s no theoretical reason to dismiss a procedure designed to estimate causal effects on some ad hoc principle one holds because of a hunch. Only prior knowledge and deep familiarity with the institutional details of your application can tell you what the appropriate identification strategy is, and insofar as the backdoor criterion can be met, then matching methods may be perfectly appropriate. And if it cannot, then matching is inappropriate. But then, so is a naı̈ve multivariate regression in such cases.
We’ve mentioned that propensity score matching is an application used when a conditioning strategy can satisfy the backdoor criterion. But how exactly is it implemented? Propensity score matching takes those necessary covariates, estimates a maximum likelihood model of the conditional probability of treatment (usually a logit or probit so as to ensure that the fitted values are bounded between 0 and 1), and uses the predicted values from that estimation to collapse those covariates into a single scalar called the propensity score. All comparisons between the treatment and control group are then based on that value.
There is some subtlety to the propensity score in practice, though. Consider this scenario: two units, A and B, are assigned to treatment and control, respectively. But their propensity score is 0.6. Thus, they had the same 60% conditional probability of being assigned to treatment, but by random chance, A was assigned to treatment and B was assigned to control. The idea with propensity score methods is to compare units who, based on observables, had very similar probabilities of being placed into the treatment group even though those units differed with regard to actual treatment assignment. If conditional on
Implicit in that example, though, we see another assumption needed for this procedure, and that’s the common support assumption. Common support simply requires that there be units in the treatment and control group across the estimated propensity score. We had common support for 0.6 because there was a unit in the treatment group (A) and one in the control group (B) for 0.6. In ways that are connected to this, the propensity score can be used to check for covariate balance between the treatment group and control group such that the two groups become observationally equivalent. But before walking through an example using real data, let’s review some papers that use it.8
5.3.4 Example: The NSW job training program
The National Supported Work Demonstration (NSW) job-training program was operated by the Manpower Demonstration Research Corp (MRDC) in the mid-1970s. The NSW was a temporary employment program designed to help disadvantaged workers lacking basic job skills move into the labor market by giving them work experience and counseling in a sheltered environment. It was also unique in that it randomly assigned qualified applicants to training positions. The treatment group received all the benefits of the NSW program. The controls were basically left to fend for themselves. The program admitted women receiving Aid to Families with Dependent Children, recovering addicts, released offenders, and men and women of both sexes who had not completed high school.
Treatment group members were guaranteed a job for nine to eighteen months depending on the target group and site. They were then divided into crews of three to five participants who worked together and met frequently with an NSW counselor to discuss grievances with the program and performance. Finally, they were paid for their work. NSW offered the trainees lower wages than they would’ve received on a regular job, but allowed for earnings to increase for satisfactory performance and attendance. After participants’ terms expired, they were forced to find regular employment. The kinds of jobs varied within sites—some were gas-station attendants, some worked at a printer shop—and men and women frequently performed different kinds of work.
The MDRC collected earnings and demographic information from both the treatment and the control group at baseline as well as every nine months thereafter. MDRC also conducted up to four post-baseline interviews. There were different sample sizes from study to study, which can be confusing.
NSW was a randomized job-training program; therefore, the independence assumption was satisfied. So calculating average treatment effects was straightforward—it’s the simple difference in means estimator that we discussed in the potential outcomes chapter.9
Lalonde (1986) is an interesting study both because he is evaluating the NSW program and because he is evaluating commonly used econometric methods from that time. He evaluated the econometric estimators’ performance by trading out the experimental control group data with data on the non-experimental control group drawn from the population of US citizens. He used three samples of the Current Population Survey (CPS) and three samples of the Panel Survey of Income Dynamics (PSID) for this non-experimental control group data, but I will use just one for each. Non-experimental data is, after all, the typical situation an economist finds herself in. But the difference with the NSW is that it was a randomized experiment, and therefore we know the average treatment effect. Since we know the average treatment effect, we can see how well a variety of econometric models perform. If the NSW program increased earnings by approximately
Lalonde (1986) reviewed a number of popular econometric methods used by his contemporaries with both the PSID and the CPS samples as nonexperimental comparison groups, and his results were consistently horrible. Not only were his estimates usually very different in magnitude, but his results were almost always the wrong sign! This paper, and its pessimistic conclusion, was influential in policy circles and led to a greater push for more experimental evaluations.11 We can see these results in the following tables from Lalonde (1986). Table 5.11 shows the effect of the treatment when comparing the treatment group to the experimental control group. The baseline difference in real earnings between the two groups was negligible. The treatment group made $39 more than the control group in the pre-treatment period without controls and $21 less in the multivariate regression model, but neither is statistically significant. But the post-treatment difference in average earnings was between $798 and $886.12
NSW Treatment - Control Earnings | |||||
Name of comparison group | Pre-Treatment Unadjusted | Pre-Treatment Adjusted | Post-Treatment Unadjusted | Post-Treatment Adjusted | Difference-in-Differences |
Experimental controls | $ 39 | $ -21 | $ 886 | $ 798 | $ 856 |
(383) | (378) | (476) | (472) | (558) | |
PSID-1 | |||||
(795) | (851) | (913) | (990) | (692) | |
CPS-SSA-1 | $195 | ||||
(539) | (509) | (562) | (557) | (441) |
Table 5.11 also shows the results he got when he used the non-experimental data as the comparison group. Here I report his results when using one sample from the PSID and one from the CPS, although in his original paper he used three of each. In nearly every point estimate, the effect is negative. The one exception is the difference-in-differences model which is positive, small, and insignificant.
So why is there such a stark difference when we move from the NSW control group to either the PSID or CPS? The reason is because of selection bias:
But as I will show shortly, a violation of independence also implies that covariates will be unbalanced across the propensity score—something we call the balancing property. Table 5.12 illustrates this showing the mean values for each covariate for the treatment and control groups, where the control is the 15,992 observations from the CPS. As you can see, the treatment group appears to be very different on average from the control group CPS sample along nearly every covariate listed. The NSW participants are more black, more Hispanic, younger, less likely to be married, more likely to have no degree and less schooling, more likely to be unemployed in 1975, and more likely to have considerably lower earnings in 1975. In short, the two groups are not exchangeable on observables (and likely not exchangeable on unobservables either).
All | CPS Controls | NSW Trainees | ||||
Covariate | Mean | SD | Mean | Mean | T-statistic | Diff. |
Black | 0.09 | 0.28 | 0.07 | 0.80 | 47.04 | |
Hispanic | 0.07 | 0.26 | 0.07 | 0.94 | 1.47 | |
Age | 33.07 | 11.04 | 33.2 | 24.63 | 13.37 | 8.6 |
Married | 0.70 | 0.46 | 0.71 | 0.17 | 20.54 | 0.54 |
No degree | 0.30 | 0.46 | 0.30 | 0.73 | 16.27 | |
Education | 12.0 | 2.86 | 12.03 | 10.38 | 9.85 | 1.65 |
1975 Earnings | 13.51 | 9.31 | 13.65 | 3.1 | 19.63 | 10.6 |
1975 Unemp | 0.11 | 0.32 | 0.11 | 0.37 | 14.29 |
The first paper to reevaluate Lalonde (1986) using propensity score methods was Dehejia and Wahba (1999). Their interest was twofold. First, they wanted to examine whether propensity score matching could be an improvement in estimating treatment effects using non-experimental data. And second, they wanted to show the diagnostic value of propensity score matching. The authors used the same non-experimental control group data sets from the CPS and PSID as Lalonde (1986) did.
Let’s walk through this, and what they learned from each of these steps. First, the authors estimated the propensity score using maximum likelihood modeling. Once they had the estimated propensity score, they compared treatment units to control units within intervals of the propensity score itself. This process of checking whether there are units in both treatment and control for intervals of the propensity score is called checking for common support.
One easy way to check for common support is to plot the number of treatment and control group observations separately across the propensity score with a histogram. Dehejia and Wahba (1999) did this using both the PSID and CPS samples and found that the overlap was nearly nonexistent, but here I’ll focus on their CPS sample. The overlap was so bad that they opted to drop 12,611 observations in the control group because their propensity scores were outside the treatment group range. Also, a large number of observations have low propensity scores, evidenced by the fact that the first bin contains 2,969 comparison units. Once this “trimming” was done, the overlap improved, though still wasn’t great.
We learn some things from this kind of diagnostic, though. We learn, for one, that the selection bias on observables is probably extreme if for no other reason than the fact that there are so few units in both treatment and control for given values of the propensity score. When there is considerable bunching at either end of the propensity score distribution, it suggests you have units who differ remarkably on observables with respect to the treatment variable itself. Trimming around those extreme values has been a way of addressing this when employing traditional propensity score adjustment techniques.
NSW T-C Earnings | Propensity Score Adjusted | ||||||
Stratification | Matching | ||||||
Comparison group | Unadj. | Adj. | Quadratic Score | Unadj. | Adj. | Unadj. | Adj. |
Experimental controls | 1,794 | 1,672 | |||||
(633) | (638) | ||||||
PSID-1 | 731 | 294 | 1,608 | 1,494 | 1,691 | 1,473 | |
(1154) | (886) | (1389) | (1571) | (1581) | (2209) | (809) | |
CPS-1 | 972 | 1,117 | 1,713 | 1,774 | 1,582 | 1,616 | |
(712) | (550) | (747) | (1115) | (1152) | (1069) | (751) |
With estimated propensity score in hand, Dehejia and Wahba (1999) estimated the treatment effect on real earnings 1978 using the experimental treatment group compared with the non-experimental control group. The treatment effect here differs from what we found in Lalonde because Dehejia and Wahba (1999) used a slightly different sample. Still, using their sample, they find that the NSW program caused earnings to increase between $1,672 and $1,794 depending on whether exogenous covariates were included in a regression. Both of these estimates are highly significant.
The first two columns labeled “unadjusted” and “adjusted” represent OLS regressions with and without controls. Without controls, both PSID and CPS estimates are extremely negative and precise. This, again, is because the selection bias is so severe with respect to the NSW program. When controls are included, effects become positive and imprecise for the PSID sample though almost significant at 5% for CPS. But each effect size is only about half the size of the true effect.
Table 5.13 shows the results using propensity score weighting or matching.13 As can be seen, the results are a considerable improvement over Lalonde (1986). I won’t review every treatment effect the authors calculated, but I will note that they are all positive and similar in magnitude to what they found in columns 1 and 2 using only the experimental data.
Finally, the authors examined the balance between the covariates in the treatment group (NSW) and the various non-experimental (matched) samples in Table 5.14. In the next section, I explain why we expect covariate values to balance along the propensity score for the treatment and control group after trimming the outlier propensity score units from the data. Table 5.14 shows the sample means of characteristics in the matched control sample versus the experimental NSW sample (first row). Trimming on the propensity score, in effect, helped balance the sample. Covariates are much closer in mean value to the NSW sample after trimming on the propensity score.
Matched Sample | Age | Education | Black | Hispanic | No Degree | Married | RE74 | RE75 | |
NSW | 185 | 25.81 | 10.335 | 0.84 | 0.06 | 0.71 | 0.19 | 2,096 | 1,532 |
PSID | 56 | 26.39 | 10.62 | 0.86 | 0.02 | 0.55 | 0.15 | 1,794 | 1,126 |
(2.56) | (0.63) | (0.13) | (0.06) | (0.13) | (0.13) | (0.12) | (1,406) | ||
CPS | 119 | 26.91 | 10.52 | 0.86 | 0.04 | 0.64 | 0.19 | 2,110 | 1,396 |
(1.25) | (0.32) | (0.06) | (0.04) | (0.07) | (0.06) | (841) | (563) |
Standard error on the difference in means with NSW sample is given in parentheses.
Propensity score is best explained using actual data. We will use data from Dehejia and Wahba (2002) for the following exercises. But before using the propensity score methods for estimating treatment effects, let’s calculate the average treatment effect from the actual experiment. Using the following code, we calculate that the NSW job-training program caused real earnings in 1978 to increase by $1,794.343.
Code
use https://github.com/scunning1975/mixtape/raw/master/nsw_mixtape.dta, clear
if treat
su re78 gen y1 = r(mean)
if treat==0
su re78 gen y0 = r(mean)
gen ate = y1-y0
su atedi 6349.144 - 4554.801
* ATE is 1794.34 drop if treat==0
drop y1 y0 ate
compress
Code
library(tidyverse)
library(haven)
<- function(df)
read_data
{<- paste("https://github.com/scunning1975/mixtape/raw/master/",
full_path sep = "")
df, <- read_dta(full_path)
df return(df)
}
<- read_data("nsw_mixtape.dta")
nsw_dw
%>%
nsw_dw filter(treat == 1) %>%
summary(re78)
<- nsw_dw %>%
mean1 filter(treat == 1) %>%
pull(re78) %>%
mean()
$y1 <- mean1
nsw_dw
%>%
nsw_dw filter(treat == 0) %>%
summary(re78)
<- nsw_dw %>%
mean0 filter(treat == 0) %>%
pull(re78) %>%
mean()
$y0 <- mean0
nsw_dw
<- unique(nsw_dw$y1 - nsw_dw$y0)
ate
<- nsw_dw %>%
nsw_dw filter(treat == 1) %>%
select(-y1, -y0)
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
= read_data('nsw_mixtape.dta')
nsw_dw
= nsw_dw[nsw_dw.treat==1].re78.mean()
mean1 = nsw_dw[nsw_dw.treat==0].re78.mean()
mean0 = np.unique(mean1 - mean0)[0]
ate print("The experimental ATE estimate is {:.2f}".format(ate))
Next we want to go through several examples in which we estimate the average treatment effect or some if its variants such as the average treatment effect on the treatment group or the average treatment effect on the untreated group. But here, rather than using the experimental control group from the original randomized experiment, we will use the non-experimental control group from the Current Population Survey. It is very important to stress that while the treatment group is an experimental group, the control group now consists of a random sample of Americans from that time period. Thus, the control group suffers from extreme selection bias since most Americans would not function as counterfactuals for the distressed group of workers who selected into the NSW program. In the following, we will append the CPS data to the experimental data and estimate the propensity score using logit so as to be consistent with Dehejia and Wahba (2002).
Code
group data
* Reload experimental use https://github.com/scunning1975/mixtape/raw/master/nsw_mixtape.dta, clear
drop if treat==0
merge in the CPS controls from footnote 2 of Table 2 (Dehejia and Wahba 2002)
* Now append using https://github.com/scunning1975/mixtape/raw/master/cps_mixtape.dta
gen agesq=age*age
gen agecube=age*age*age
gen edusq=educ*edu
gen u74 = 0 if re74!=.
replace u74 = 1 if re74==0
gen u75 = 0 if re75!=.
replace u75 = 1 if re75==0
gen interaction1 = educ*re74
gen re74sq=re74^2
gen re75sq=re75^2
gen interaction2 = u74*hisp
score
* Now estimate the propensity logit treat age agesq agecube educ edusq marr nodegree black hisp re74 re75 u74 u75 interaction1
predict pscore
mean propensity scores for treatment and control groups
* Checking if treat==1, detail
su pscore if treat==0, detail
su pscore
at the propensity score distribution for treatment and control groups
* Now look histogram pscore, by(treat) binrescale
Code
library(tidyverse)
library(haven)
<- function(df)
read_data
{<- paste("https://github.com/scunning1975/mixtape/raw/master/",
full_path sep = "")
df, <- read_dta(full_path)
df return(df)
}
<- read_data("cps_mixtape.dta") %>%
nsw_dw_cpscontrol bind_rows(nsw_dw) %>%
mutate(agesq = age^2,
agecube = age^3,
educsq = educ*educ,
u74 = case_when(re74 == 0 ~ 1, TRUE ~ 0),
u75 = case_when(re75 == 0 ~ 1, TRUE ~ 0),
interaction1 = educ*re74,
re74sq = re74^2,
re75sq = re75^2,
interaction2 = u74*hisp)
# estimating
<- glm(treat ~ age + agesq + agecube + educ + educsq +
logit_nsw + nodegree + black + hisp + re74 + re75 + u74 +
marr + interaction1, family = binomial(link = "logit"),
u75 data = nsw_dw_cpscontrol)
<- nsw_dw_cpscontrol %>%
nsw_dw_cpscontrol mutate(pscore = logit_nsw$fitted.values)
# mean pscore
<- nsw_dw_cpscontrol %>%
pscore_control filter(treat == 0) %>%
pull(pscore) %>%
mean()
<- nsw_dw_cpscontrol %>%
pscore_treated filter(treat == 1) %>%
pull(pscore) %>%
mean()
# histogram
%>%
nsw_dw_cpscontrol filter(treat == 0) %>%
ggplot() +
geom_histogram(aes(x = pscore))
%>%
nsw_dw_cpscontrol filter(treat == 1) %>%
ggplot() +
geom_histogram(aes(x = pscore))
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
# Prepare data for logit
= read_data('cps_mixtape.dta')
nsw_dw_cpscontrol
= pd.concat((nsw_dw_cpscontrol, nsw_dw))
nsw_dw_cpscontrol 'u74'], nsw_dw_cpscontrol['u75'] = 0, 0
nsw_dw_cpscontrol[==0, 'u74'] = 1
nsw_dw_cpscontrol.loc[nsw_dw_cpscontrol.re74==0, 'u75'] = 1
nsw_dw_cpscontrol.loc[nsw_dw_cpscontrol.re75# estimating propensity score
= smf.glm(formula="""treat ~ age + I(age**2) + I(age**3) + educ + I(educ**2) +
logit_nsw marr + nodegree + black + hisp + re74 + re75 + u74 + u75 + educ*re74""",
=sm.families.Binomial(),
family=nsw_dw_cpscontrol).fit()
data
'pscore'] = logit_nsw.predict(nsw_dw_cpscontrol)
nsw_dw_cpscontrol[
'treat')['pscore'].mean()
nsw_dw_cpscontrol.groupby(
='pscore')) + p.geom_histogram(bins=50) + p.facet_wrap("treat", scales='free') p.ggplot(nsw_dw_cpscontrol, p.aes(x
The propensity score is the fitted values of the logit model. Put differently, we used the estimated coefficients from that logit regression to estimate the conditional probability of treatment, assuming that probabilities are based on the cumulative logistic distribution:
As I said earlier, the propensity score used the fitted values from the maximum likelihood regression to calculate each unit’s conditional probability of treatment regardless of actual treatment status. The propensity score is just the predicted conditional probability of treatment or fitted value for each unit. It is advisable to use maximum likelihood when estimating the propensity score so that the fitted values are in the range
The definition of the propensity score is the selection probability conditional on the confounding variables;
Treatment group | ||
Percentiles | Values | Smallest |
1% | 0.0011757 | 0.0010614 |
5% | 0.0072641 | 0.0011757 |
10% | 0.0260147 | 0.0018463 |
25% | 0.1322174 | 0.0020981 |
50% | 0.4001992 | |
Percentiles | Values | Largest |
75% | 0.6706164 | 0.935645 |
90% | 0.8866026 | 0.93718 |
95% | 0.9021386 | 0.9374608 |
99% | 0.9374608 | 0.9384554 |
CPS Control group | ||
Percentiles | Values | Largest |
1% | 5.90e-07 | 1.18e-09 |
5% | 1.72e-06 | 4.07e-09 |
10% | 3.58e-06 | 4.24e-09 |
25% | 0.0000193 | 1.55e-08 |
50% | 0.0001187 | |
50% | .0003544 | |
Percentiles | Values | Largest |
75% | 0.0009635 | 0.8786677 |
90% | 0.0066319 | 0.8893389 |
95% | 0.0163109 | 0.9099022 |
99% | 0.1551548 | 0.9239787 |
Common support is required to calculate any particular kind of defined average treatment effect, and without it, you will just get some kind of weird weighted average treatment effect for only those regions that do have common support. The reason it is “weird” is that average treatment effect doesn’t correspond to any of the interesting treatment effects the policymaker needed. Common support requires that for each value of
The mean value of the propensity score for the treatment group is 0.43, and the mean for the CPS control group is 0.007. The 50th percentile for the treatment group is 0.4, but the control group doesn’t reach that high a number until the 99th percentile. Let’s look at the distribution of the propensity score for the two groups using a histogram now.
These two simple diagnostic tests show what is going to be a problem later when we use inverse probability weighting. The probability of treatment is spread out across the units in the treatment group, but there is a very large mass of nearly zero propensity scores in the CPS. How do we interpret this? What this means is that the characteristics of individuals in the treatment group are rare in the CPS sample. This is not surprising given the strong negative selection into treatment. These individuals are younger, less likely to be married, and more likely to be uneducated and a minority. The lesson is, if the two groups are significantly different on background characteristics, then the propensity scores will have grossly different distributions by treatment status. We will discuss this in greater detail later.
For now, let’s look at the treatment parameter under both assumptions.
The conditional independence assumption allows us to make the following substitution,
This is an extremely valuable theorem because stratifying on
The proof of the propensity score theorem is fairly straightforward, as it’s just an application of the law of iterated expectations with nested conditioning.15 If we can show that the probability an individual receives treatment conditional on potential outcomes and the propensity score is not a function of potential outcomes, then we will have proved that there is independence between the potential outcomes and the treatment conditional on
and the second term cancels out because it’s multiplied by zero. The formal proof is as follows:
Using a similar argument, we obtain:
and
Like the omitted variable bias formula for regression, the propensity score theorem says that you need only control for covariates that determine the likelihood a unit receives the treatment. But it also says something more than that. It technically says that the only covariate you need to condition on is the propensity score. All of the information from the
A corollary of the propensity score theorem, therefore, states that given CIA, we can estimate average treatment effects by weighting appropriately the simple difference in means.16
Because the propensity score is a function of
Therefore, conditional on the propensity score, the probability that
Notice that there exist two paths between
5.3.5 Weighting on the propensity score
There are several ways researchers can estimate average treatment effects using an estimated propensity score. Busso, DiNardo, and McCrary (2014) examined the properties of various approaches and found that inverse probability weighting was competitive in several simulations. As there are different ways in which the weights are incorporated into a weighting design, I discuss a few canonical versions of the method of inverse probability weighting and associated methods for inference. This is an expansive area in causal inference econometrics, so consider this merely an overview of and introduction to the main concepts.
Assuming that CIA holds in our data, then one way we can estimate treatment effects is to use a weighting procedure in which each individual’s propensity score is a weight of that individual’s outcome (Imbens 2000). When aggregated, this has the potential to identify some average treatment effect. This estimator is based on earlier work in survey methodology first proposed by Horvitz and Thompson (1952). The weight enters the expression differently depending on each unit’s treatment status and takes on two different forms depending on whether the target parameter is the ATE or the ATT (or the ATU, which is not shown here):
A proof for ATE is provided:
and the results follow from integrating over
The sample versions of both ATE and ATT are obtained by a two-step estimation procedure. In the first step, the researcher estimates the propensity score using logit or probit. In the second step, the researcher uses the estimated score to produce sample versions of one of the average treatment effect estimators shown above. Those sample versions can be written as follows:
We have a few options for estimating the variance of this estimator, but one is simply to use bootstrapping. First created by Efron (1979), bootstrapping is a procedure used to estimate the variance of an estimator. In the context of inverse probability weighting, we would repeatedly draw (“with replacement”) a random sample of our original data and then use that smaller sample to calculate the sample analogs of the ATE or ATT. More specifically, using the smaller “bootstrapped” data, we would first estimate the propensity score and then use the estimated propensity score to calculate sample analogs of the ATE or ATT over and over to obtain a distribution of treatment effects corresponding to different cuts of the data itself.18 If we do this 1,000 or 10,000 times, we get a distribution of parameter estimates from which we can calculate the standard deviation. This standard deviation becomes like a standard error and gives us a measure of the dispersion of the parameter estimate under uncertainty regarding the sample itself.19 Adudumilli (2018) and Bodory et al. (2020) discuss the performance of various bootstrapping procedures, such as the standard bootstrap or the wild bootstrap. I encourage you to read these papers more closely when choosing which bootstrap is suitable for your question.
The sensitivity of inverse probability weighting to extreme values of the propensity score has led some researchers to propose an alternative that can handle extremes a bit better. Hirano and Imbens (2001) propose an inverse probability weighting estimator of the average treatment effect that assigns weights normalized by the sum of propensity scores for treated and control groups as opposed to equal weights of
Code
using all the data
* Manual with non-normalized weights gen d1=treat/pscore
gen d0=(1-treat)/(1-pscore)
egen s1=sum(d1)
egen s0=sum(d0)
gen y1=treat*re78/pscore
gen y0=(1-treat)*re78/(1-pscore)
gen ht=y1-y0
* Manual with normalized weightsreplace y1=(treat*re78/pscore)/(s1/_N)
replace y0=((1-treat)*re78/(1-pscore))/(s0/_N)
gen norm=y1-y0
norm
su ht
$11,876
* ATT under non-normalized weights is -$7,238
* ATT under normalized weights is -
drop d1 d0 s1 s0 y1 y0 ht norm
score
* Trimming the propensity drop if pscore <= 0.1
drop if pscore >= 0.9
using trimmed data
* Manual with non-normalized weights gen d1=treat/pscore
gen d0=(1-treat)/(1-pscore)
egen s1=sum(d1)
egen s0=sum(d0)
gen y1=treat*re78/pscore
gen y0=(1-treat)*re78/(1-pscore)
gen ht=y1-y0
using trimmed data
* Manual with normalized weights replace y1=(treat*re78/pscore)/(s1/_N)
replace y0=((1-treat)*re78/(1-pscore))/(s0/_N)
gen norm=y1-y0
norm
su ht
$2,006
* ATT under non-normalized weights is $1,806 * ATT under normalized weights is
Code
library(tidyverse)
library(haven)
#continuation
<- nrow(nsw_dw_cpscontrol)
N #- Manual with non-normalized weights using all data
<- nsw_dw_cpscontrol %>%
nsw_dw_cpscontrol mutate(d1 = treat/pscore,
d0 = (1-treat)/(1-pscore))
<- sum(nsw_dw_cpscontrol$d1)
s1 <- sum(nsw_dw_cpscontrol$d0)
s0
<- nsw_dw_cpscontrol %>%
nsw_dw_cpscontrol mutate(y1 = treat * re78/pscore,
y0 = (1-treat) * re78/(1-pscore),
ht = y1 - y0)
#- Manual with normalized weights
<- nsw_dw_cpscontrol %>%
nsw_dw_cpscontrol mutate(y1 = (treat*re78/pscore)/(s1/N),
y0 = ((1-treat)*re78/(1-pscore))/(s0/N),
norm = y1 - y0)
%>%
nsw_dw_cpscontrol pull(ht) %>%
mean()
%>%
nsw_dw_cpscontrol pull(norm) %>%
mean()
#-- trimming propensity score
<- nsw_dw_cpscontrol %>%
nsw_dw_cpscontrol select(-d1, -d0, -y1, -y0, -ht, -norm) %>%
filter(!(pscore >= 0.9)) %>%
filter(!(pscore <= 0.1))
<- nrow(nsw_dw_cpscontrol)
N
#- Manual with non-normalized weights using trimmed data
<- nsw_dw_cpscontrol %>%
nsw_dw_cpscontrol mutate(d1 = treat/pscore,
d0 = (1-treat)/(1-pscore))
<- sum(nsw_dw_cpscontrol$d1)
s1 <- sum(nsw_dw_cpscontrol$d0)
s0
<- nsw_dw_cpscontrol %>%
nsw_dw_cpscontrol mutate(y1 = treat * re78/pscore,
y0 = (1-treat) * re78/(1-pscore),
ht = y1 - y0)
#- Manual with normalized weights with trimmed data
<- nsw_dw_cpscontrol %>%
nsw_dw_cpscontrol mutate(y1 = (treat*re78/pscore)/(s1/N),
y0 = ((1-treat)*re78/(1-pscore))/(s0/N),
norm = y1 - y0)
%>%
nsw_dw_cpscontrol pull(ht) %>%
mean()
%>%
nsw_dw_cpscontrol pull(norm) %>%
mean()
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
# Prepare data for logit
= read_data('cps_mixtape.dta')
nsw_dw_cpscontrol
= pd.concat((nsw_dw_cpscontrol, nsw_dw))
nsw_dw_cpscontrol 'u74', 'u75']] = 0
nsw_dw_cpscontrol[[==0, 'u74'] = 1
nsw_dw_cpscontrol.loc[nsw_dw_cpscontrol.re74==0, 'u75'] = 1
nsw_dw_cpscontrol.loc[nsw_dw_cpscontrol.re75# estimating propensity score
= smf.glm(formula="""treat ~ age + age**2 + age**3 + educ + educ**2 +
logit_nsw marr + nodegree + black + hisp + re74 + re75 + u74 + u75 + educ*re74""",
=sm.families.Binomial(),
family=nsw_dw_cpscontrol).fit()
data
'pscore'] = logit_nsw.predict(nsw_dw_cpscontrol)
nsw_dw_cpscontrol[
# continuation
= nsw_dw_cpscontrol.shape[0]
N
# Manual with non-normalized weights using all data
= nsw_dw_cpscontrol
nsw_dw_cpscontrol 'd1'] = nsw_dw_cpscontrol.treat/nsw_dw_cpscontrol.pscore
nsw_dw_cpscontrol['d0'] = (1-nsw_dw_cpscontrol.treat)/(1-nsw_dw_cpscontrol.pscore)
nsw_dw_cpscontrol[
= nsw_dw_cpscontrol.d1.sum()
s1 = nsw_dw_cpscontrol.d0.sum()
s0
'y1'] = nsw_dw_cpscontrol.treat * nsw_dw_cpscontrol.re78 / nsw_dw_cpscontrol.pscore
nsw_dw_cpscontrol['y0'] = (1 - nsw_dw_cpscontrol.treat) * nsw_dw_cpscontrol.re78 / (1 - nsw_dw_cpscontrol.pscore)
nsw_dw_cpscontrol['ht'] = nsw_dw_cpscontrol['y1'] - nsw_dw_cpscontrol['y0']
nsw_dw_cpscontrol[
= nsw_dw_cpscontrol.ht.mean()
te_1
print("Treatment Effect (non-normalized, all data): {:.2f}".format(te_1))
'y1'] = nsw_dw_cpscontrol.treat * nsw_dw_cpscontrol.re78 / nsw_dw_cpscontrol.pscore
nsw_dw_cpscontrol['y1'] /= s1/N
nsw_dw_cpscontrol['y0'] = (1 - nsw_dw_cpscontrol.treat) * nsw_dw_cpscontrol.re78 / (1 - nsw_dw_cpscontrol.pscore)
nsw_dw_cpscontrol['y0'] /= s0/N
nsw_dw_cpscontrol['ht'] = nsw_dw_cpscontrol['y1'] - nsw_dw_cpscontrol['y0']
nsw_dw_cpscontrol[
= nsw_dw_cpscontrol.ht.mean()
te_2
print("Treatment Effect (normalized, all data): {:.2f}".format(te_2))
= nsw_dw_cpscontrol.drop(['d1', 'd0', 'y1', 'y0'], axis=1)
nsw_dw_trimmed = nsw_dw_trimmed[nsw_dw_trimmed.pscore.between(.1, .9)]
nsw_dw_trimmed = nsw_dw_trimmed.shape[0]
N
'y1'] = nsw_dw_trimmed.treat * nsw_dw_trimmed.re78 / nsw_dw_trimmed.pscore
nsw_dw_trimmed['y0'] = (1 - nsw_dw_trimmed.treat) * nsw_dw_trimmed.re78 / (1 - nsw_dw_trimmed.pscore)
nsw_dw_trimmed['ht'] = nsw_dw_trimmed['y1'] - nsw_dw_trimmed['y0']
nsw_dw_trimmed[
= nsw_dw_trimmed.ht.mean()
te_3
print("Treatment Effect (non-normalized, trimmed data): {:.2f}".format(te_3))
'y1'] = nsw_dw_trimmed.treat * nsw_dw_trimmed.re78 / nsw_dw_trimmed.pscore
nsw_dw_trimmed['y1'] /= s1/N
nsw_dw_trimmed['y0'] = (1 - nsw_dw_trimmed.treat) * nsw_dw_trimmed.re78 / (1 - nsw_dw_trimmed.pscore)
nsw_dw_trimmed['y0'] /= s0/N
nsw_dw_trimmed['ht'] = nsw_dw_trimmed['y1'] - nsw_dw_trimmed['y0']
nsw_dw_trimmed[
= nsw_dw_trimmed.ht.mean()
te_4
print("Treatment Effect (normalized, trimmed data): {:.2f}".format(te_4))
When we estimate the treatment effect using inverse probability weighting using the non-normalized weighting procedure described earlier, we find an estimated ATT of
Recall what inverse probability weighting is doing. It is weighting treatment and control units according to
Now let’s repeat the analysis having trimmed the propensity score, keeping only values whose scores are between 0.1 and 0.9. Now we find $2,006 using the non-normalized weights and $1,806 using the normalized weights. This is very similar to what we know is the true causal effect using the experimental data, which was $1,794. And we can see that the normalized weights are even closer. We still need to calculate standard errors, such as based on a bootstrapping method, but I leave it to you investigate that more carefully by reading Adudumilli (2018) and Bodory et al. (2020), who, as I mentioned, discuss the performance of various bootstrapping procedures such as the standard bootstrap and the wild bootstrap.
5.3.6 Nearest-neighbor matching
An alternative, very popular approach to inverse probability weighting is matching on the propensity score. This is often done by finding a couple of units with comparable propensity scores from the control unit donor pool within some ad hoc chosen radius distance of the treated unit’s own propensity score. The researcher then averages the outcomes and then assigns that average as an imputation to the original treated unit as a proxy for the potential outcome under counterfactual control. Then effort is made to enforce common support through trimming.
But this method has been criticized by King and Nielsen (2019). The King and Nielsen (2019) critique is not of the propensity score itself. For instance, the critique does not apply to stratification based on the propensity score (Rosenbaum and Rubin 1983), regression adjustment or inverse probability weighting. The problem is only focused on nearest-neighbor matching and is related to the forced balance through trimming as well as myriad other common research choices made in the course of the project that together ultimately amplify bias. King and Nielsen (2019) write: “The more balanced the data, or the more balance it becomes by trimming some of the observations through matching, the more likely propensity score matching will degrade inferences” (p.1).
Nevertheless, nearest-neighbor matching, along with inverse probability weighting, is perhaps the most common method for estimating a propensity score model. Nearest-neighbor matching using the propensity score pairs each treatment unit
where
Code
black hisp re74 re75 u74 u75 interaction1, logit), atet gen(pstub_cps) nn(5) teffects psmatch (re78) (treat age agesq agecube educ edusq marr nodegree
Code
library(MatchIt)
library(Zelig)
<- matchit(treat ~ age + agesq + agecube + educ +
m_out + marr + nodegree +
educsq + hisp + re74 + re75 + u74 + u75 + interaction1,
black data = nsw_dw_cpscontrol, method = "nearest",
distance = "logit", ratio =5)
<- match.data(m_out)
m_data
<- zelig(re78 ~ treat + age + agesq + agecube + educ +
z_out + marr + nodegree +
educsq + hisp + re74 + re75 + u74 + u75 + interaction1,
black model = "ls", data = m_data)
<- setx(z_out, treat = 0)
x_out <- setx(z_out, treat = 1)
x1_out
<- sim(z_out, x = x_out, x1 = x1_out)
s_out
summary(s_out)
Code
# Missing python code
I chose to match using five nearest neighbors. Nearest neighbors, in other words, will find the five nearest units in the control group, where “nearest” is measured as closest on the propensity score itself. Unlike covariate matching, distance here is straightforward because of the dimension reduction afforded by the propensity score. We then average actual outcome, and match that average outcome to each treatment unit. Once we have that, we subtract each unit’s matched control from its treatment value, and then divide by
5.3.7 Coarsened exact matching
There are two kinds of matching we’ve reviewed so far. Exact matching matches a treated unit to all of the control units with the same covariate value. But sometimes this is impossible, and therefore there are matching discrepancies. For instance, say that we are matching continuous age and continuous income. The probability we find another person with the exact same value of both is very small, if not zero. This leads therefore to mismatching on the covariates, which introduces bias.
The second kind of matching we’ve discussed are approximate matching methods, which specify a metric to find control units that are “close” to the treated unit. This requires a distance metric, such as Euclidean, Mahalanobis, or the propensity score. All of these can be implemented in Stata or R.
Iacus, King, and Porro (2012) introduced a kind of exact matching called coarsened exact matching (CEM). The idea is very simple. It’s based on the notion that sometimes it’s possible to do exact matching once we coarsen the data enough. If we coarsen the data, meaning we create categorical variables (e.g., 0- to 10-year-olds, 11- to 20-year olds), then oftentimes we can find exact matches. Once we find those matches, we calculate weights on the basis of where a person fits in some strata, and those weights are used in a simple weighted regression.
First, we begin with covariates
But there are trade-offs. Larger bins mean more coarsening of the data, which results in fewer strata. Fewer strata result in more diverse observations within the same strata and thus higher covariate imbalance. CEM prunes both treatment and control group units, which changes the parameter of interest, but so long as you’re transparent about this and up front, readers may be willing to give you the benefit of the doubt.20 Just know, though, that you are not estimating the ATE or the ATT when you start trimming (just as you aren’t doing so when you trim propensity scores).
The key benefit of CEM is that it is part of a class of matching methods called monotonic imbalance bounding (MIB). MIB methods bound the maximum imbalance in some feature of the empirical distributions by an ex ante decision by the user. In CEM, this ex ante choice is the coarsening decision. By choosing the coarsening beforehand, users can control the amount of imbalance in the matching solution. It’s also very fast.
There are several ways of measuring imbalance, but here we focus on the
Now let’s get to the fun part: estimation. We will use the same job-training data we’ve been working with for this estimation.
Code
ssc install cem
group data
* Reload experimental use https://github.com/scunning1975/mixtape/raw/master/nsw_mixtape.dta, clear
drop if treat==0
merge in the CPS controls from footnote 2 of Table 2 (Dehejia and Wahba 2002)
* Now append using https://github.com/scunning1975/mixtape/raw/master/cps_mixtape.dta
gen agesq=age*age
gen agecube=age*age*age
gen edusq=educ*edu
gen u74 = 0 if re74!=.
replace u74 = 1 if re74==0
gen u75 = 0 if re75!=.
replace u75 = 1 if re75==0
gen interaction1 = educ*re74
gen re74sq=re74^2
gen re75sq=re75^2
gen interaction2 = u74*hisp
black hisp re74 re75 u74 u75 interaction1, treatment(treat)
cem age (10 20 30 40 60) age agesq agecube educ edusq marr nodegree reg re78 treat [iweight=cem_weights], robust
Code
library(cem)
library(MatchIt)
library(Zelig)
library(tidyverse)
library(estimatr)
<- matchit(treat ~ age + agesq + agecube + educ +
m_out + marr + nodegree +
educsq + hisp + re74 + re75 +
black + u75 + interaction1,
u74 data = nsw_dw_cpscontrol,
method = "cem",
distance = "logit")
<- match.data(m_out)
m_data
<- lm_robust(re78 ~ treat,
m_ate data = m_data,
weights = m_data$weights)
Code
# Missing python code
The estimated ATE is $2,152, which is larger than our estimated experimental effect. But this ensured a high degree of balance on the covariates, as can be seen from the output from the cem
command itself.
Covariate | L1 | Mean | Min. | 25% | 50% | 75% | Max. |
---|---|---|---|---|---|---|---|
age | .08918 | .55337 | 1 | 1 | 0 | 1 | 0 |
agesq | .1155 | 21.351 | 33 | 35 | 0 | 49 | 0 |
agecube | .05263 | 626.9 | 817 | 919 | 0 | 1801 | 0 |
school | 6.0e-16 | 0 | 0 | 0 | 0 | 0 | |
schoolsq | 5.4e-16 | 0 | 0 | 0 | 0 | 0 | |
married | 1.1e-16 | 0 | 0 | 0 | 0 | 0 | |
nodegree | 4.7e-16 | 0 | 0 | 0 | 0 | 0 | |
black | 4.7e-16 | 0 | 0 | 0 | 0 | 0 | |
hispanic | 7.1e-17 | 0 | 0 | 0 | 0 | 0 | |
re74 | .06096 | 42.399 | 0 | 0 | 0 | 0 | |
re75 | .03756 | 0 | 0 | 0 | |||
u74 | 1.9e-16 | 0 | 0 | 0 | 0 | 0 | |
u75 | 2.5e-16 | 0 | 0 | 0 | 0 | 0 | |
interaction1 | .06535 | 425.68 | 0 | 0 | 0 | 0 |
As can be seen from Table 5.17, the values of
5.4 Conclusion
Matching methods are an important member of the causal inference arsenal. Propensity scores are an excellent tool to check the balance and overlap of covariates. It’s an under-appreciated diagnostic, and one that you might miss if you only ran regressions. There are extensions for more than two treatments, like multinomial models, but I don’t cover those here. The propensity score can make groups comparable, but only on the variables used to estimate the propensity score in the first place. It is an area that continues to advance to include covariate balancing (Imai and Ratkovic 2013; Zubizarreta 2015; Zhao 2019) and doubly robust estimators (Band and Robins 2005). Consider this chapter more about the mechanics of matching when you have exact and approximate matching situations.
Learning about the propensity score is particularly valuable given that it appears to have a very long half-life. For instance, propensity scores make their way into other contemporary designs too, such as difference-in-differences (Sant’Anna and Zhao 2018). So investing in a basic understanding of these ideas and methods is likely worthwhile. You never know when the right project comes along for which these methods are the perfect solution, so there’s no intelligent reason to write them off.
But remember, every matching solution to a causality problem requires a credible belief that the backdoor criterion can be achieved by conditioning on some matrix
Everything I know about matching I learned from the Northwestern causal inference workshops in lectures taught by the econometrician Alberto Abadie. I would like to acknowledge him as this chapter is heavily indebted to him and those lectures.↩︎
But think about the hurdle that the last criticism actually creates. Just imagine the hypothetical experiment: a large sample of people, with diverse potential outcomes, are assigned to a treatment group (smoker) and control (non-smoker). These people must be dosed with their corresponding treatments long enough for us to observe lung cancer develop—so presumably years of heavy smoking. How could anyone ever run an experiment like that? Who in their right mind would participate!? Just to describe the idealized experiment is to admit it’s impossible. But how do we answer the causal question without independence (i.e., randomization)?↩︎
Interestingly, this issue of covariate balance weaves throughout nearly every identification strategy that we will discuss.↩︎
A truly hilarious assumption, but this is just illustrative.↩︎
I’m sure you can think of others, though, in which case this DAG is misleading.↩︎
Notice the
in the subscript of the summation operator.↩︎I cannot emphasize this enough—this method, like regression more generally, only has value for your project if you can satisfy the backdoor criterion by conditioning on
. If you cannot satisfy the backdoor criterion in your data, then the propensity score does not assist you in identifying a causal effect. At best, it helps you better understand issues related to balance on observables (but not unobservables). It is absolutely critical that your DAG be, in other words, credible, defensible, and accurate, as you depend on those theoretical relationships to design the appropriate identification strategy.↩︎Remember, randomization means that the treatment was independent of the potential outcomes, so simple difference in means identifies the average treatment effect.↩︎
Lalonde (1986) lists several studies that discuss the findings from the program.↩︎
It’s since been cited a little more than 1,700 times.↩︎
Lalonde reports a couple different diff-in-diff models, but for simplicity, I will only report one.↩︎
Let’s hold off digging into exactly how they used the propensity score to generate these estimates.↩︎
CIA is expressed in different ways according to the econometric or statistical tradition. Rosenbaum and Rubin (1983) called it the ignorable treatment assignment, or unconfoundedness. Barnow, Cain, and Goldberger (1981) and Dale and Krueger (2002) called it selection on observables. In the traditional econometric pedagogy, as we discussed earlier, it’s called the zero conditional mean assumption.↩︎
This all works if we match on the propensity score and then calculate differences in means. Direct propensity score matching works in the same way as the covariate matching we discussed earlier (e.g., nearest-neighbor matching), except that we match on the score instead of the covariates directly.↩︎
Just because something is exchangeable on observables does not make it exchangeable on unobservables. The propensity score theorem does not imply balanced unobserved covariates. See Brooks and Ohsfeldt (2013).↩︎
Bootstrapping and randomization inference are mechanically similar. Each randomizes something over and over, and under each randomization, reestimates treatment effects to obtain a distribution of treatment effects. But that is where the similarity ends. Bootstrapping is a method for computing the variance in an estimator where we take the treatment assignment as given. The uncertainty in bootstrapping stems from the sample, not the treatment assignment. And thus with each bootstrapped sample, we use fewer observations than exist in our real sample. That is not the source of uncertainty in randomization inference, though. In randomization inference, as you recall from the earlier chapter, the uncertainty in question regards the treatment assignment, not the sample. And thus in randomization inference, we randomly assign the treatment in order to reject or fail to reject Fisher’s sharp null of no individual treatment effects.↩︎
Abadie and Imbens (2008) show that the bootstrap fails for matching, but inverse probability weighting is not matching. This may seem like a subtle point, but in my experience many people conflate propensity score based matching with other methods that use the propensity score, calling all of them “matching.” But inverse probability weighting is not a matching procedure. Rather, it is a weighting procedure whose properties differ from that of using imputation and generally the bootstrap is fine.↩︎
They also may not. The methods are easy. It’s convincing readers that’s hard.↩︎