# 9 Difference-in-Differences

## Causal Inference: The Mixtape.

The difference-in-differences design is an early quasi-experimental identification strategy for estimating causal effects that predates the randomized experiment by roughly eighty-five years. It has become the single most popular research design in the quantitative social sciences, and as such, it merits careful study by researchers everywhere.138 In this chapter, I will explain this popular and important research design both in its simplest form, where a group of units is treated at the same time, and the more common form, where groups of units are treated at different points in time. My focus will be on the identifying assumptions needed for estimating treatment effects, including several practical tests and robustness exercises commonly performed, and I will point you to some of the work on difference-in-differences design (DD) being done at the frontier of research. I have included several replication exercises as well.

## 9.1 John Snow’s Cholera Hypothesis

When thinking about situations in which a difference-in-differences design can be used, one usually tries to find an instance where a consequential treatment was given to some people or units but denied to others “haphazardly.” This is sometimes called a “natural experiment” because it is based on naturally occurring variation in some treatment variable that affects only some units over time. All good difference-in-differences designs are based on some kind of natural experiment. And one of the most interesting natural experiments was also one of the first difference-in-differences designs. This is the story of how John Snow convinced the world that cholera was transmitted by water, not air, using an ingenious natural experiment .

Cholera is a vicious disease that attacks victims suddenly, with acute symptoms such as vomiting and diarrhea. In the nineteenth century, it was usually fatal. There were three main epidemics that hit London, and like a tornado, they cut a path of devastation through the city. Snow, a physician, watched as tens of thousands suffered and died from a mysterious plague. Doctors could not help the victims because they were mistaken about the mechanism that caused cholera to spread between people.

The majority medical opinion about cholera transmission at that time was miasma, which said diseases were spread by microscopic poisonous particles that infected people by floating through the air. These particles were thought to be inanimate, and because microscopes at that time had incredibly poor resolution, it would be years before microorganisms would be seen. Treatments, therefore, tended to be designed to stop poisonous dirt from spreading through the air. But tried and true methods like quarantining the sick were strangely ineffective at slowing down this plague.

John Snow worked in London during these epidemics. Originally, Snow—like everyone—accepted the miasma theory and tried many ingenious approaches based on the theory to block these airborne poisons from reaching other people. He went so far as to cover the sick with burlap bags, for instance, but the disease still spread. People kept getting sick and dying. Faced with the theory’s failure to explain cholera, he did what good scientists do—he changed his mind and began look for a new explanation.

Snow developed a novel theory about cholera in which the active agent was not an inanimate particle but was rather a living organism. This microorganism entered the body through food and drink, flowed through the alimentary canal where it multiplied and generated a poison that caused the body to expel water. With each evacuation, the organism passed out of the body and, importantly, flowed into England’s water supply. People unknowingly drank contaminated water from the Thames River, which caused them to contract cholera. As they did, they would evacuate with vomit and diarrhea, which would flow into the water supply again and again, leading to new infections across the city. This process repeated through a multiplier effect which was why cholera would hit the city in epidemic waves.

Snow’s years of observing the clinical course of the disease led him to question the usefulness of miasma to explain cholera. While these were what we would call “anecdote,” the numerous observations and imperfect studies nonetheless shaped his thinking. Here’s just a few of the observations which puzzled him. He noticed that cholera transmission tended to follow human commerce. A sailor on a ship from a cholera-free country who arrived at a cholera-stricken port would only get sick after landing or taking on supplies; he would not get sick if he remained docked. Cholera hit the poorest communities worst, and those people were the very same people who lived in the most crowded housing with the worst hygiene. He might find two apartment buildings next to one another, one would be heavily hit with cholera, but strangely the other one wouldn’t. He then noticed that the first building would be contaminated by runoff from privies but the water supply in the second building was cleaner. While these observations weren’t impossible to reconcile with miasma, they were definitely unusual and didn’t seem obviously consistent with miasmis.

Snow tucked away more and more anecdotal evidence like these. But, while this evidence raised some doubts in his mind, he was not convinced. He needed a smoking gun if he were to eliminate all doubt that cholera was spread by water, not air. But where would he find that evidence? More importantly, what would evidence like that evenlook like?

Let’s imagine the following thought experiment. If Snow was a dictator with unlimited wealth and power, how could he test his theory that cholera is waterborne? One thing he could do is flip a coin over each household member—heads you drink from the contaminated Thames, tails you drink from some uncontaminated source. Once the assignments had been made, Snow could simply compare cholera mortality between the two groups. If those who drank the clean water were less likely to contract cholera, then this would suggest that cholera was waterborne.

Knowledge that physical randomization could be used to identify causal effects was still eighty-five years away. But there were other issues besides ignorance that kept Snow from physical randomization. Experiments like the one I just described are also impractical, infeasible, and maybe even unethical—which is why social scientists so often rely on natural experiments that mimic important elements of randomized experiments. But what natural experiment was there? Snow needed to find a situation where uncontaminated water had been distributed to a large number of people as if by random chance, and then calculate the difference between those those who did and did not drink contaminated water. Furthermore, the contaminated water would need to be allocated to people in ways that were unrelated to the ordinary determinants of cholera mortality, such as hygiene and poverty, implying a degree of balance on covariates between the groups. And then he remembered—a potential natural experiment in London a year earlier had reallocated clean water to citizens of London. Could this work?

In the 1800s, several water companies served different areas of the city. Some neighborhoods were even served by more than one company. They took their water from the Thames, which had been polluted by victims’ evacuations via runoff. But in 1849, the Lambeth water company had moved its intake pipes upstream higher up the Thames, above the main sewage discharge point, thus giving its customers uncontaminated water. They did this to obtain cleaner water, but it had the added benefit of being too high up the Thames to be infected with cholera from the runoff. Snow seized on this opportunity. He realized that it had given him a natural experiment that would allow him to test his hypothesis that cholera was waterborne by comparing the households. If his theory was right, then the Lambeth houses should have lower cholera death rates than some other set of households whose water was infected with runoff—what we might call today the explicit counterfactual. He found his explicit counterfactual in the Southwark and Vauxhall Waterworks Company.

Unlike Lambeth, the Southwark and Vauxhall Waterworks Company had not moved their intake point upstream, and Snow spent an entire book documenting similarities between the two companies’ households. For instance, sometimes their service cut an irregular path through neighborhoods and houses such that the households on either side were very similar; the only difference being they drank different water with different levels of contamination from runoff. Insofar as the kinds of people that each company serviced were observationally equivalent, then perhaps they were similar on the relevant unobservables as well.

Snow meticulously collected data on household enrollment in water supply companies, going door to door asking household heads the name of their utility company. Sometimes these individuals didn’t know, though, so he used a saline test to determine the source himself . He matched those data with the city’s data on the cholera death rates at the household level. It was in many ways as advanced as any study we might see today for how he carefully collected, prepared, and linked a variety of data sources to show the relationship between water purity and mortality. But he also displayed scientific ingenuity for how he carefully framed the research question and how long he remained skeptical until the research design’s results convinced him otherwise. After combining everthing, he was able to generate extremely persuasive evidence that influenced policymakers in the city.139

Snow wrote up all of his analysis in a manuscript entitled On the Mode of Communication of Cholera . Snow’s main evidence was striking, and I will discuss results based on Table XII and Table IX (not shown) in Table 9.1. The main difference between my version and his version of Table XII is that I will use his data to estimate a treatment effect using difference-in-differences.

Table 9.1: Modified Table XII (Snow 1854).
Company name 1849 1854
Southwark and Vauxhall 135 147
Lambeth 85 19

### 9.1.1 Table XII

In 1849, there were 135 cases of cholera per 10,000 households at Southwark and Vauxhall and 85 for Lambeth. But in 1854, there were 147 per 100,000 in Southwark and Vauxhall, whereas Lambeth’s cholera cases per 10,000 households fell to 19.

While Snow did not explicitly calculate the difference-in-differences, the ability to do so was there . If we difference Lambeth’s 1854 value from its 1849 value, followed by the same after and before differencing for Southwark and Vauxhall, we can calculate an estimate of the ATT equaling 78 fewer deaths per 10,000. While Snow would go on to produce evidence showing cholera deaths were concentrated around a pump on Broad Street contaminated with cholera, he allegedly considered the simple difference-in-differences the more convincing test of his hypothesis.

The importance of the work Snow undertook to understand the causes of cholera in London cannot be overstated. It not only lifted our ability to estimate causal effects with observational data, it advanced science and ultimately saved lives. Of Snow’s work on the cause of cholera transmission, Freedman (1991) states:

The force of Snow’s argument results from the clarity of the prior reasoning, the bringing together of many different lines of evidence, and the amount of shoe leather Snow was willing to use to get the data. Snow did some brilliant detective work on nonexperimental data. What is impressive is not the statistical technique but the handling of the scientific issues. He made steady progress from shrewd observation through case studies to analyze ecological data. In the end, he found and analyzed a natural experiment. (p.298)

## 9.2 Estimation

### 9.2.1 A simple table

Let’s look at this example using some tables, which hopefully will help give you an idea of the intuition behind DD, as well as some of its identifying assumptions.140 Assume that the intervention is clean water, which I’ll write as $$D$$, and our objective is to estimate $$D$$’s causal effect on cholera deaths. Let cholera deaths be represented by the variable $$Y$$. Can we identify the causal effect of D if we just compare the post-treatment 1854 Lambeth cholera death values to that of the 1854 Southwark and Vauxhall values? This is in many ways an obvious choice, and in fact, it is one of the more common naive approaches to causal inference. After all, we have a control group, don’t we? Why can’t we just compare a treatment group to a control group? Let’s look and see.

One of the things we immediately must remember is that the simple difference in outcomes, which is all we are doing here, only collapsed to the ATE if the treatment had been randomized. But it is never randomized in the real world where most choices if not all choices made by real people is endogenous to potential outcomes. Let’s represent now the differences between Lambeth and Southwark and Vauxhall with fixed level differences, or fixed effects, represented by $$L$$ and $$SV$$. Both are unobserved, unique to each company, and fixed over time. What these fixed effects mean is that even if Lambeth hadn’t changed its water source there, would still be something determining cholera deaths, which is just the time-invariant unique differences between the two companies as it relates to cholera deathsin 1854.

Compared to what? Different companies.
Company Outcome
Lambeth $$Y=L + D$$
Southwark and Vauxhall $$Y=SV$$

When we make a simple comparison between Lambeth and Southwark and Vauxhall, we get an estimated causal effect equalling $$D+(L-SV)$$. Notice the second term, $$L-SV$$. We’ve seen this before. It’s the selection bias we found from the decomposition of the simple difference in outcomes from earlier in the book.

Okay, so say we realize that we cannot simply make cross-sectional comparisons between two units because of selection bias. Surely, though, we can compare a unit to itself? This is sometimes called an interrupted time series. Let’s consider that simple before-and-after difference for Lambeth now.

Compared to what? Before and after.
Company Time Outcome
Lambeth Before $$Y=L$$
After $$Y=L + (T + D)$$

While this procedure successfully eliminates the Lambeth fixed effect (unlike the cross-sectional difference), it doesn’t give me an unbiased estimate of $$D$$ because differences can’t eliminate the natural changes in the cholera deaths over time. Recall, these events were oscillating in waves. I can’t compare Lambeth before and after ($$T+D$$) because of $$T$$, which is an omitted variable.

The intuition of the DD strategy is remarkably simple: combine these two simpler approaches so the selection bias and the effect of time are, in turns, eliminated. Let’s look at it in the followingtable.

Compared to what? Difference in each company’s differences.
Companies Time Outcome $$D_1$$ $$D_2$$
Lambeth Before $$Y=L$$
After $$Y=L + T + D$$ $$T+D$$
$$D$$
Southwark and Vauxhall Before $$Y=SV$$
After $$Y=SV + T$$ $$T$$

The first difference, $$D_1$$, does the simple before-and-after difference. This ultimately eliminates the unit-specific fixed effects. Then, once those differences are made, we difference the differences (hence the name) to get the unbiased estimate of $$D$$.

But there’s a a key assumption with a DD design, and that assumption is discernible even in this table. We are assuming that there is no time-variant company specific unobservables. Nothing unobserved in Lambeth households that is changing between these two periods that also determines cholera deaths. This is equivalent to assuming that $$T$$ is the same for all units. And we call this the parallel trends assumption. We will discuss this assumption repeatedly as the chapter proceeds, as it is the most important assumption in the design’s engine. If you can buy off on the parallel trends assumption, then DD will identify the causal effect.

DD is a powerful, yet amazingly simple design. Using repeated observations on a treatment and control unit (usually several units), we can eliminate the unobserved heterogeneity to provide a credible estimate of the average treatment effect on the treated (ATT) by transforming the data in very specific ways. But when and why does this process yield the correct answer? Turns out, there is more to it than meets the eye. And it is imperative on the front end that you understand what’s under the hood so that you can avoid conceptual errors about this design.

### 9.2.2 The simple $$2\times 2$$ DD

The cholera case is a particular kind of DD design that Goodman-Bacon (2019) calls the $$2\times 2$$ DD design. The $$2\times 2$$ DD design has a treatment group $$k$$ and untreated group $$U$$. There is a pre-period for the treatment group, $$\mathop{\mathrm{pre}}(k)$$; a post-period for the treatment group, $$\mathop{\mathrm{post}}(k)$$; a pre-treatment period for the untreated group, $$\mathop{\mathrm{pre}}(U)$$; and a post-period for the untreated group, $$\mathop{\mathrm{post}}(U)$$ So: $\widehat{\delta}^{2\times 2}_{kU} = \bigg ( \overline{y}_k^{\mathop{\mathrm{post}}(k)} - \overline{y}_k^{\mathop{\mathrm{pre}}(k)} \bigg ) - \bigg ( \overline{y}_U^{\mathop{\mathrm{post}}(k)} - \overline{y}_U^{\mathop{\mathrm{pre}}(k)} \bigg )$ where $$\widehat{\delta}_{kU}$$ is the estimated ATT for group $$k$$, and $$\overline{y}$$ is the sample mean for that particular group in a particular time period. The first paragraph differences the treatment group, $$k$$, after minus before, the second paragraph differences the untreated group, $$U$$, after minus before. And once those quantities are obtained, we difference the second term from the first.

But this is simply the mechanics of calculations. What exactly is this estimated parameter mapping onto? To understand that, we must convert these sample averages into conditional expectations of potential outcomes. But that is easy to do when working with sample averages, as we will see here. First let’s rewrite this as a conditional expectation. $\widehat{\delta}^{2\times 2}_{kU} = \bigg(E\big[Y_k \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big] - E\big[Y_k \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Pre}}\big]\bigg)- \bigg(E\big[Y_U \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big] - E\big[Y_U \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Pre}}\big]\bigg)$

Now let’s use the switching equation, which transforms historical quantities of $$Y$$ into potential outcomes. As we’ve done before, we’ll do a little trick where we add zero to the right-hand side so that we can use those terms to help illustrate something important. \begin{align} &\widehat{\delta}^{2\times 2}_{kU} = \bigg ( \underbrace{E\big[Y^1_k \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big] - E\big[Y^0_k \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Pre}}\big] \bigg ) - \bigg(E\big[Y^0_U \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big] - E\big[ Y^0_U \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Pre}}\big]}_{\text{Switching equation}} \bigg) \\ &+ \underbrace{E\big[Y_k^0 \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big] - E\big[Y^0_k \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big]}_{\text{Adding zero}} \end{align}

Now we simply rearrange these terms to get the decomposition of the $$2\times 2$$ DD in terms of conditional expected potential outcomes. \begin{align} &\widehat{\delta}^{2\times 2}_{kU} = \underbrace{E\big[Y^1_k \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big] - E\big[Y^0_k \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big]}_{\text{ATT}} \\ &+\Big[\underbrace{E\big[Y^0_k \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big] - E\big[Y^0_k \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Pre}}\big] \Big] - \Big[E\big[Y^0_U \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big] - E\big[Y_U^0 \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Pre}}\big] }_{\text{Non-parallel trends bias in 2\times 2 case}} \Big] \end{align}

Now, let’s study this last term closely. This simple $$2\times 2$$ difference-in-differences will isolate the ATT (the first term) if and only if the second term zeroes out. But why would this second term be zero? It would equal zero if the first difference involving the treatment group, $$k$$, equaled the second difference involving the untreated group, $$U$$.

But notice the term in the second line. Notice anything strange about it? The object of interest is $$Y^0$$, which is some outcome in a world without the treatment. But it’s the post period, and in the post period, $$Y=Y^1$$ not $$Y^0$$ by the switching equation. Thus, the first term is counterfactual. And as we’ve said over and over, counterfactuals are not observable. This bottom line is often called the parallel trends assumption and it is by definition untestable since we cannot observe this counterfactual conditional expectation. We will return to this again, but for now I simply present it for your consideration.

### 9.2.3 DD and the Minimum Wage

Now I’d like to talk about more explicit economic content, and the minimum wage is as good a topic as any. The modern use of DD was brought into the social sciences through esteemed labor economist Orley Ashenfelter (1978). His study was no doubt influential to his advisee, David Card, arguably the greatest labor economist of his generation. Card would go on to use the method in several pioneering studies, such as Card (1990). But I will focus on one in particular—his now-classic minimum wage study .

Card and Krueger (1994) is an infamous study both because of its use of an explicit counterfactual for estimation, and because the study challenges many people’s common beliefs about the negative effects of the minimum wage. It lionized a massive back-and-forth minimum-wage literature that continues to this day.141 So controversial was this study that James Buchanan, the Nobel Prize winner, called those influenced by Card and Krueger (1994) “camp following whores” in a letter to the editor of the Wall Street Journal .142

Suppose you are interested in the effect of minimum wages on employment. Theoretically, you might expect that in competitive labor markets, an increase in the minimum wage would move us up a downward-sloping demand curve, causing employment to fall. But in labor markets characterized by monopsony, minimum wages can increase employment. Therefore, there are strong theoretical reasons to believe that the effect of the minimum wage on employment is ultimately an empirical question depending on many local contextual factors. This is where Card and Krueger (1994) entered. Could they uncover whether minimum wages were ultimately harmful or helpful in some local economy?

It’s always useful to start these questions with a simple thought experiment: if you had a billion dollars, complete discretion and could run a randomized experiment, how would you test whether minimum wages increased or decreased employment? You might go across the hundreds of local labor markets in the United States and flip a coin—heads, you raise the minimum wage; tails, you keep it at the status quo. As we’ve done before, these kinds of thought experiments are useful for clarifying both the research design and the causal question.

Lacking a randomized experiment, Card and Krueger (1994) decided on a next-best solution by comparing two neighboring states before and after a minimum-wage increase. It was essentially the same strategy that Snow used in his cholera study and a strategy that economists continue to use, in one form or another, to this day .

mean = regddd$coefficients[110:124], year = c(1986:2000)) abortion_plot %>% ggplot(aes(x = year, y = mean)) + geom_rect(aes(xmin=1986, xmax=1992, ymin=-Inf, ymax=Inf), fill = "cyan", alpha = 0.01)+ geom_point()+ geom_text(aes(label = year), hjust=-0.002, vjust = -0.03)+ geom_hline(yintercept = 0) + geom_errorbar(aes(ymin = mean-sd*1.96, ymax = mean+sd*1.96), width = 0.2, position = position_dodge(0.05)) Python Code abortion_ddd.py import numpy as np import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf from itertools import combinations import plotnine as p # read data import ssl ssl._create_default_https_context = ssl._create_unverified_context def read_data(file): return pd.read_stata("https://raw.github.com/scunning1975/mixtape/master/" + file) abortion = read_data('abortion.dta') abortion = abortion[~pd.isnull(abortion.lnr)] abortion['yr'] = 0 abortion.loc[(abortion.younger==1) & (abortion.repeal==1), 'yr'] = 1 abortion['wm'] = 0 abortion.loc[(abortion.wht==1) & (abortion.male==1), 'wm'] = 1 abortion['wf'] = 0 abortion.loc[(abortion.wht==1) & (abortion.male==0), 'wf'] = 1 abortion['bm'] = 0 abortion.loc[(abortion.wht==0) & (abortion.male==1), 'bm'] = 1 abortion['bf'] = 0 abortion.loc[(abortion.wht==0) & (abortion.male==0), 'bf'] = 1 abortion_filt = abortion[(abortion.bf==1) & (abortion.age.isin([15,25]))] reg = ( smf .wls("""lnr ~ C(repeal)*C(year) + C(younger)*C(repeal) + C(younger)*C(year) + C(yr)*C(year) + C(fip)*t + acc + ir + pi + alcohol + crack + poverty + income + ur""", data=abortion_filt, weights=abortion_filt.totpop.values) .fit( cov_type='cluster', cov_kwds={'groups': abortion_filt.fip.values}, method='pinv') ) abortion_plot = pd.DataFrame({'sd': reg.bse['C(yr)[T.1]:C(year)[T.1986.0]':'C(yr)[T.1]:C(year)[T.2000.0]'], 'mean': reg.params['C(yr)[T.1]:C(year)[T.1986.0]':'C(yr)[T.1]:C(year)[T.2000.0]'], 'year':np.arange(1986, 2001)}) abortion_plot['lb'] = abortion_plot['mean'] - abortion_plot['sd']*1.96 abortion_plot['ub'] = abortion_plot['mean'] + abortion_plot['sd']*1.96 p.ggplot(abortion_plot, p.aes(x = 'year', y = 'mean')) + p.geom_rect(p.aes(xmin=1986, xmax=1991, ymin=-np.inf, ymax=np.inf), fill = "cyan", alpha = 0.01)+ p.geom_point()+ p.geom_text(p.aes(label = 'year'), ha='right')+ p.geom_hline(yintercept = 0) + p.geom_errorbar(p.aes(ymin = 'lb', ymax = 'ub'), width = 0.2, position = p.position_dodge(0.05)) +\ p.labs(title= "Estimated effect of abortion legalization on gonorrhea") Here we see the prediction start to break down. Though there are negative effects for years 1986 to 1990, the 1991 and 1992 coefficients are positive, which is not consistent with our hypothesis. Furthermore, only the first four coefficients are statistically significant. Nevertheless, given the demanding nature of DDD, perhaps this is a small victory in favor of Gruber, Levine, and Staiger (1999) and Donohue and Levitt (2001). Perhaps the theory that abortion legalization had strong selection effects on cohorts has some validity. Putting aside whether you believe the results, it is still valuable to replicate the results based on this staggered design. Recall that I said the DDD design requires stacking the data, which may seem like a bit of a black box, so I’d like to examine these data now.144 The second line estimates the regression equation. The dynamic DD coefficients are captured by the repeal-year interactions. These are the coefficients we used to create box plots in Figure 9.11. You can check these yourself. Note, for simplicity, I only estimated this for the black females (bf15==1) but you could estimate for the black males (bm15==1), white females (wf15==1), or white males (wm15==1). We do all four in the paper, but here we only focus on the black females aged 15–19 because the purpose of this section is to help you understand the estimation. I encourage you to play around with this model to see how robust the effects are in your mind using only this linear estimation. But now I want to show you the code for estimating a triple difference model. Some reshaping had to be done behind the scenes for this data structure, but it would take too long to post that here. For now, I will simply produce the commands that produce the black female result, and I encourage you to explore the panel data structure so as to familiarize yourself with the way in which the data are organized. Notice that some of these were already interactions (e.g., yr), which was my way to compactly include all of the interactions. I did this primarily to give myself more control over what variables I was using. But I encourage you to study the data structure itself so that when you need to estimate your own DDD, you’ll have a good handle on what form the data must be in in order to execute so many interactions. ### 9.5.4 Going beyond Cunningham and Cornwell (2013) The US experience with abortion legalization predicted a parabola from 1986 to 1992 for 15- to 19-year-olds, and using a DD design, that’s what I found. I also estimated the effect using a DDD design, and while the effects weren’t as pretty as what I found with DD, there appeared to be something going on in the general vicinity of where the model predicted. So boom goes the dynamite, right? Can’t we be done finally? Not quite. Whereas my original study stopped there, I would like to go a little farther. The reason can be seen in the following Figure 9.13. This is a modified version of Figure 9.9, with the main difference being I have created a new parabola for the 20- to 24-year-olds. Look carefully at Figure 9.13. Insofar as the early 1970s cohorts were treated in utero with abortion legalization, then we should see not just a parabola for the 15- to 19-year-olds for 1986 to 1992 but also for the 20- to 24-year-olds for years 1991 to 1997 as the cohorts continuedto age.145 Stata Code abortion_dd2.do use https://github.com/scunning1975/mixtape/raw/master/abortion.dta, clear * Second DD model for 20-24 year old black females char year[omit] 1985 xi: reg lnr i.repeal*i.year i.fip acc ir pi alcohol crack poverty income ur if (race==2 & sex==2 & age==20) [aweight=totpop], cluster(fip)  R Code abortion_dd2.R library(tidyverse) library(haven) library(estimatr) read_data <- function(df) { full_path <- paste("https://raw.github.com/scunning1975/mixtape/master/", df, sep = "") df <- read_dta(full_path) return(df) } abortion <- read_data("abortion.dta") %>% mutate( repeal = as_factor(repeal), year = as_factor(year), fip = as_factor(fip), fa = as_factor(fa), ) reg <- abortion %>% filter(race == 2 & sex == 2 & age == 20) %>% lm_robust(lnr ~ repeal*year + fip + acc + ir + pi + alcohol+ crack + poverty+ income+ ur, data = ., weights = totpop, clusters = fip) Python Code abortion_dd2.py import numpy as np import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf from itertools import combinations import plotnine as p # read data import ssl ssl._create_default_https_context = ssl._create_unverified_context def read_data(file): return pd.read_stata("https://raw.github.com/scunning1975/mixtape/master/" + file) abortion = read_data('abortion.dta') abortion = abortion[~pd.isnull(abortion.lnr)] abortion_filt = abortion[(abortion.race == 2) & (abortion.sex == 2) & (abortion.age == 20)] regdd = ( smf .wls("""lnr ~ C(repeal)*C(year) + C(fip) + acc + ir + pi + alcohol+ crack + poverty+ income+ ur""", data=abortion_filt, weights=abortion_filt.totpop.values) .fit( cov_type='cluster', cov_kwds={'groups': abortion_filt.fip.values}, method='pinv') ) regdd.summary() I did not examine the 20- to 24-year-old cohort when I first wrote this paper because at that time I doubted that the selection effects for risky sex would persist into adulthood given that youth display considerable risk-taking behavior. But with time come new perspectives, and these days I don’t have strong priors that the selection effects would necessarily vanish after teenage years. So I’d like to conduct that additional analysis here and now for the first time. Let’s estimate the same DD model as before, only for Black females aged 20–24. As before, we will focus just on the coefficient plots. We show that in Figure 9.14. There are a couple of things about this regression output that are troubling. First, there is a negative parabola showing up where there wasn’t necessarily one predicted—the 1986–1992 period. Note that is the period where only the 15- to 19-year-olds were the treated cohorts, suggesting that our 15- to 19-year-old analysis was picking up something other than abortion legalization. But that was also the justification for using DDD, as clearly something else is going on in the repeal versus Roe states during those years that we cannot adequately control for with our controls and fixed effects. The second thing to notice is that there is no parabola in the treatment window for the treatment cohort. The effect sizes are negative in the beginning, but shrink in absolute value when they should be growing. In fact, the 1991 to 1997 period is one of convergence to zero, not divergence between these two sets of states. But as before, maybe there are strong trending unobservables for all groups masking the abortion legalization effect. To check, let’s use my DDD strategy with the 25- to 29-year-olds as the within-state control group. We can implement this by using the Stata code, abortion_ddd2.do and abortion_ddd2.R. Stata Code abortion_ddd2.do use https://github.com/scunning1975/mixtape/raw/master/abortion.dta, clear * Second DDD model for 20-24 year olds vs 25-29 year olds black females in repeal vs Roe states gen younger2 = 0 replace younger2 = 1 if age == 20 gen yr2=(repeal==1) & (younger2==1) gen wm=(wht==1) & (male==1) gen wf=(wht==1) & (male==0) gen bm=(wht==0) & (male==1) gen bf=(wht==0) & (male==0) char year[omit] 1985 char repeal[omit] 0 char younger2[omit] 0 char fip[omit] 1 char fa[omit] 0 char yr2[omit] 0 xi: reg lnr i.repeal*i.year i.younger2*i.repeal i.younger2*i.year i.yr2*i.year i.fip*t acc pi ir alcohol crack poverty income ur if bf==1 & (age==20 | age==25) [aweight=totpop], cluster(fip)  R Code abortion_ddd2.R library(tidyverse) library(haven) library(estimatr) read_data <- function(df) { full_path <- paste("https://raw.github.com/scunning1975/mixtape/master/", df, sep = "") df <- read_dta(full_path) return(df) } abortion <- read_data("abortion.dta") %>% mutate( repeal = as_factor(repeal), year = as_factor(year), fip = as_factor(fip), fa = as_factor(fa), younger2 = case_when(age == 20 ~ 1, TRUE ~ 0), yr2 = as_factor(case_when(repeal == 1 & younger2 == 1 ~ 1, TRUE ~ 0)), wm = as_factor(case_when(wht == 1 & male == 1 ~ 1, TRUE ~ 0)), wf = as_factor(case_when(wht == 1 & male == 0 ~ 1, TRUE ~ 0)), bm = as_factor(case_when(wht == 0 & male == 1 ~ 1, TRUE ~ 0)), bf = as_factor(case_when(wht == 0 & male == 0 ~ 1, TRUE ~ 0)) ) regddd <- abortion %>% filter(bf == 1 & (age == 20 | age ==25)) %>% lm_robust(lnr ~ repeal*year + acc + ir + pi + alcohol + crack + poverty + income + ur, data = ., weights = totpop, clusters = fip) Python Code abortion_ddd2.py import numpy as np import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf from itertools import combinations import plotnine as p # read data import ssl ssl._create_default_https_context = ssl._create_unverified_context def read_data(file): return pd.read_stata("https://raw.github.com/scunning1975/mixtape/master/" + file) abortion = read_data('abortion.dta') abortion = abortion[~pd.isnull(abortion.lnr)] abortion_filt = abortion[(abortion.race == 2) & (abortion.sex == 2) & (abortion.age == 20)] regdd = ( smf .wls("""lnr ~ C(repeal)*C(year) + C(fip) + acc + ir + pi + alcohol+ crack + poverty+ income+ ur""", data=abortion_filt, weights=abortion_filt.totpop.values) .fit( cov_type='cluster', cov_kwds={'groups': abortion_filt.fip.values}, method='pinv') ) regdd.summary() Figure 9.15 shows the DDD estimated coefficients for the treated cohort relative to a slightly older 25- to 29-year-old cohort. It’s possible that the 25- to 29-year-old cohort is too close in age to function as a satisfactory within-state control; if those age 20–24 have sex with those who are age 25–29, for instance, then SUTVA is violated. There are other age groups, though, that you can try in place of the 25- to 29-year-olds, and I encourage you to do it for both the experience and the insights you might gleam. But let’s back up and remember the big picture. The abortion legalization hypothesis made a series of predictions about where negative parabolic treatment effects should appear in the data. And while we found some initial support, when we exploited more of those predictions, the results fell apart. A fair interpretation of this exercise is that our analysis does not support the abortion legalization hypothesis. Figure 9.15 shows several point estimates at nearly zero, and standard errors so large as to include both positive and negative values for these interactions. I included this analysis because I wanted to show you the power of a theory with numerous unusual yet testable predictions. Imagine for a moment if a parabola had showed up for all age groups in precisely the years predicted by the theory. Wouldn’t we have to update our priors about the abortion legalization selection hypothesis? With predictions so narrow, what else could be causing it? It’s precisely because the predictions are so specific, though, that we are able to reject the abortion legalization hypothesis, at least for gonorrhea. ### 9.5.5 Placebos as critique Since the fundamental problem of causal inference blocks our direct observation of causal effects, we rely on many direct and indirect pieces of evidence to establish credible causality. And as I said in the previous section on DDD, one of those indirect pieces of evidence is placebo analysis. The reasoning goes that if we find, using our preferred research design, effects where there shouldn’t be, then maybe our original findings weren’t credible in the first place. Using placebo analysis within your own work has become an essential part of empirical work for this reason. But another use of placebo analysis is to evaluate the credibility of popular estimation strategies themselves. This kind of use helps improve a literature by uncovering flaws in a research design which can then help stimulate the creation of stronger methods and models. Let’s take two exemplary studies that accomplished this well: Auld and Grootendorst (2004) and Cohen-Cole and Fletcher (2008). To say that the G. S. Becker and Murphy (1988) “rational addiction” model has been influential would be an understatement. It has over 4,000 cites and has become one of the most common frameworks in health economics. It created a cottage industry of empirical studies that persists to this day. Alcohol, tobacco, gambling, even sports, have all been found to be “rationally addictive” commodities and activities using various empirical approaches. But some researchers cautioned the research community about these empirical studies. Rogeberg (2004) critiqued the theory on its own grounds, but I’d like to focus on the empirical studies based on the theory. Rather than talk about any specific paper, I’d like to provide a quote from Melberg (2008), who surveyed researchers who had written on rational addiction: A majority of our respondents believe the literature is a success story that demonstrates the power of economic reasoning. At the same time, they also believe the empirical evidence is weak, and they disagree both on the type of evidence that would validate the theory and the policy implications. Taken together, this points to an interesting gap. On the one hand, most of the respondents claim that the theory has valuable real world implications. On the other hand, they do not believe the theory has received empiricalsupport. (p.1) Rational addiction should be held to the same empirical standards as in theory. The strength of the model has always been based on the economic reasoning, which economists obviously find compelling. But were the empirical designs flawed? How could we know? Auld and Grootendorst (2004) is not a test of the rational addiction model. On the contrary, it is an “anti-test” of the empirical rational addiction models common at the time. Their goal was not to evaluate the theoretical rational addiction model, in other words, but rather the empirical rational addiction models themselves. How do they do this? Auld and Grootendorst (2004) used the empirical rational addiction model to evaluate commodities that could not plausibly be considered addictive, such as eggs, milk, orange, and apples. They found that the empirical rational addiction model implied milk was extremely addictive, perhaps one of the most addictive commodities studied.146 Is it credible to believe that eggs and milk are “rationally addictive” or is it more likely the research designs used to evaluate the rational addiction model were flawed? Auld and Grootendorst (2004) study cast doubt on the empirical rational addiction model, not the theory. Another problematic literature was the peer-effects literature. Estimating peer effects is notoriously hard. Manski (1993) said that the deep endogeneity of social interactions made the identification of peer effects difficult and possibly even impossible. He called this problem the “mirroring” problem. If “birds of a feather flock together,” then identifying peer effects in observational settings may just be impossible due to the profound endogeneities at play. Several studies found significant network effects on outcomes like obesity, smoking, alcohol use, and happiness. This led many researchers to conclude that these kinds of risk behaviors were “contagious” through peer effects . But these studies did not exploit randomized social groups. The peer groups were purely endogenous. Cohen-Cole and Fletcher (2008) showed using similar models and data that even attributes that couldn’t be transmitted between peers—acne, height, and headaches—appeared “contagious” in observational data using the Christakis and Fowler (2007) model for estimation. Note, Cohen-Cole and Fletcher (2008) does not reject the idea of theoretical contagions. Rather, they point out that the Manski critique should guide peer effect analysis if social interactions are endogenous. They provide evidence for this indirectly using placebo analysis.147 ### 9.5.6 Compositional change within repeated cross-sections DD can be applied to repeated cross-sections, as well as panel data. But one of the risks of working with the repeated cross-sections is that unlike panel data (e.g., individual-level panel data), repeated cross-sections run the risk of compositional changes. Hong (2013) used repeated cross-sectional data from the Consumer Expenditure Survey (CEX) containing music expenditure and internet use for a random sample of households. The author’s study exploited the emergence and immense popularity of Napster, the first file-sharing software widely used by Internet users, in June 1999 as a natural experiment. The study compared Internet users and Internet non-users before and after the emergence of Napster. At first glance, they found that as Internet diffusion increased from 1996 to 2001, spending on music for Internet users fell faster than that for non-Internet users. This was initially evidence that Napster was responsible for the decline, until this was investigated more carefully. But when we look at Table 9.4, we see evidence of compositional changes. While music expenditure fell over the treatment period, the demographics of the two groups also changed over this period. For instance, the age of Internet users grew while income fell. If older people are less likely to buy music in the first place, then this could independently explain some of the decline. This kind of compositional change is a like an omitted variable bias built into the sample itself caused by time-variant unobservables. Diffusion of the Internet appears to be related to changing samples as younger music fans are early adopters. Identification of causal effects would need for the treatment itself to be exogenous to such changes in the composition.  Year 1997 1998 1999 Internet user Non-user Internet user Non-user Internet user Non-user Average expenditure Recorded music$25.73 $10.90$24.18 $9.97$20.92 $9.37 Entertainment$195.03 $96.71$193.38 $84.92$182.42 $80.19 Zero expenditure Recorded music 0.56 0.79 0.60 0.80 0.64 0.81 Entertainment 0.08 0.32 0.09 0.35 0.14 0.39 Demographics Age 40.2 49.0 42.3 49.0 44.1 49.4 Income$52,887 $30,459$51,995 $26,189$49,970 26,649 High school graduate 0.18 0.31 0.17 0.32 0.21 0.32 Some college 0.37 0.28 0.35 0.27 0.34 0.27 College grad 0.43 0.21 0.45 0.21 0.42 0.20 Manager 0.16 0.08 0.16 0.08 0.14 0.08 Sample means from the Consumer Expenditure Survey. ### 9.5.7 Final thoughts There are a few other caveats I’d like to make before moving on. First, it is important to remember the concepts we learned in the early DAG chapter. In choosing covariates in a DD design, you must resist the temptation to simply load the regression up with a kitchen sink of regressors. You should resist if only because in so doing, you may inadvertently include a collider, and if a collider is conditioned on, it introduces strange patterns that may mislead you and your audience. There is unfortunately no way forward except, again, deep institutional familiarity with both the factors that determined treatment assignment on the ground, as well as economic theory itself. Second, another issue I skipped over entirely is the question of how the outcome is modeled. Very little thought if any is given to how exactly we should model some outcome. Just to take one example, should we use the log or the levels themselves? Should we use the quartic root? Should we use rates? These, it turns, out are critically important because for many of them, the parallel trends assumption needed for identification will not be achieved—even though it will be achieved under some other unknown transformation. It is for this reason that you can think of many DD designs as having a parametric element because you must make strong commitments about the functional form itself. I cannot provide guidance to you on this, except that maybe using the pre-treatment leads as a way of finding parallelism could be a useful guide. ## Causal Inference: The Mixtape. Buy the print version today: ## 9.6 Twoway Fixed Effects with Differential Timing I have a bumper sticker on my car that says “I love Federalism (for the natural experiments)” (Figure 9.16). I made these bumper stickers for my students to be funny, and to illustrate that the United States is a never-ending laboratory. Because of state federalism, each US state has been given considerable discretion to govern itself with policies and reforms. Yet, because it is a union of states, US researchers have access to many data sets that have been harmonized across states, making it even more useful for causal inference. Goodman-Bacon (2019) calls the staggered assignment of treatments across geographic units over time the “differential timing” of treatment. What he means is unlike the simple $$2\times 2$$ that we discussed earlier (e.g., New Jersey and Pennsylvania), where treatment units were all treated at the same time, the more common situation is one where geographic units receive treatments at different points in time. And this happens in the United States because each area (state, municipality) will adopt a policy when it wants to, for its own reasons. As a result, the adoption of some treatment will tend to be differentially timed across units. This introduction of differential timing means there are basically two types of DD designs. There is the $$2\times 2$$ DD we’ve been discussing wherein a single unit or a group of units all receive some treatment at the same point in time, like Snow’s cholera study or Card and Krueger (1994). And then there is the DD with differential timing in which groups receive treatment at different points in time, like Cheng and Hoekstra (2013). We have a very good understanding of the $$2\times 2$$ design, how it works, why it works, when it works, and when it does not work. But we did not until Goodman-Bacon (2019) have as good an understanding of the DD design with differential timing. So let’s get down to business and discuss that now by reminding ourselves of the $$2\times 2$$ DD that we introduced earlier. \begin{align} \widehat{\delta}^{2\times 2}_{kU} = \bigg(\overline{y}_k^{\text{post}(k)} - \overline{y}_k^{\text{pre}(k)} \bigg) - \bigg(\overline{y}_U^{\text{post}(k)} - \overline{y}_U^{\text{pre}(k)} \bigg ) \end{align} where $$k$$ is the treatment group, $$U$$ is the never-treated group, and everything else is self-explanatory. Since this involves sample means, we can calculate the differences manually. Or we can estimate it with the following regression: \begin{align} y_{it} =\beta D_{i}+\tau \mathop{\mathrm{Post}}_{t}+\delta (D_i \times \mathop{\mathrm{Post}}_t)+X_{it}+\varepsilon_{it} \end{align} But a more common situation you’ll encounter will be a DD design with differential timing. And while the decomposition is a bit complicated, the regression equation itself is straightforward: \begin{align} y_{it} =\alpha_0 + \delta D_{it} + X_{it} + \alpha_i + \alpha_t + \epsilon_{it} \end{align} When researchers estimate this regression these days, they usually use the linear fixed-effects model that I discussed in the previous panel chapter. These linear panel models have gotten the nickname “twoway fixed effects” because they include both time fixed effects and unit fixed effects. Since this is such a popular estimator, it’s important we understand exactly what it is doing and what is it not. ### 9.6.1 Bacon Decomposition theorem Goodman-Bacon (2019) provides a helpful decomposition of the twoway fixed effects estimate of $$\widehat{\delta}$$. Given this is the go-to model for implementing differential timing designs, I have found his decomposition useful. But as there are some other decompositions of twoway fixed effects estimators, such as another important paper by Chaisemartin and D’Haultfœuille (2019), I’ll call it the Bacon decomposition for the sake of branding. The punchline of the Bacon decomposition theorem is that the twoway fixed effects estimator is a weighted average of all potential $$2\times 2$$ DD estimates where weights are both based on group sizes and variance in treatment. Under the assumption of variance weighted common trends (VWCT) and time invariant treatment effects, the variance weighted ATT is a weighted average of all possible ATTs. And under more restrictive assumptions, that estimate perfectly matches the ATT. But that is not true when there are time-varying treatment effects, as time-varying treatment effects in a differential timing design estimated with twoway fixed effects can generate a bias. As such, twoway fixed-effects models may be severely biased, which is echoed in Chaisemartin and D’Haultfœuille (2019). To make this concrete, let’s start with a simple example. Assume in this design that there are three groups: an early treatment group $$(k)$$, a group treated later $$(l)$$, and a group that is never treated $$(U)$$. Groups $$k$$ and $$l$$ are similar in that they are both treated but they differ in that $$k$$ is treated earlier than $$l$$. Let’s say there are 5 periods, and $$k$$ is treated in period 2. Then it spends 40% of its time under treatment, or 0.4. But let’s say $$l$$ is treated in period 4. Then it spends 80% of its time treated, or 0.8. I represent this time spent in treatment for a group as $$\overline{D}_k = 0.4$$ and $$\overline{D}_l = 0.8$$. This is important, because the length of time a group spends in treatment determines its treatment variance, which in turn affects the weight that $$2\times 2$$ plays in the final adding up of the DD parameter itself. And rather than write out $$2\times 2$$ DD estimator every time, we will just represent each $$2\times 2$$ as $$\widehat{\delta}_{ab}^{2\times 2,j}$$ where $$a$$ and $$b$$ are the treatment groups, and $$j$$ is the index notation for any treatment group. Thus if we wanted to know the $$2\times 2$$ for group $$k$$ compared to group $$U$$, we would write $$\widehat{\delta}_{kU}^{2\times 2,k}$$ or, maybe to save space, just $$\widehat{\delta}_{kU}^{k}$$. So, let’s get started. First, in a single differential timing design, how many $$2\times 2$$s are there anyway? Turns out there are a lot. To see, let’s make a toy example. Let’s say there are three timing groups ($$a$$, $$b$$, and $$c$$) and one untreated group $$(U)$$. Then there are 9 $$2\times 2$$ DDs. They are:  a to b b to a c to a a to c b to c c to b a to U b to U c to U See how it works? Okay, then let’s return to our simpler example where there are two timing groups $$k$$ and $$l$$ and one never-treated group. Groups $$k$$ and $$l$$ will get treated at time periods $$t^*_k$$ and $$t^*_l$$. The earlier period before anyone is treated will be called the “pre” period, the period between $$k$$ and $$l$$ treated is called the “mid” period, and the period after $$l$$ is treated is called the “post” period. This will be much easier to understand with some simple graphs. Let’s look at Figure 9.17. Recall the definition of a $$2\times 2$$ DD is $\widehat{\delta}^{2\times 2}_{kU} = \bigg (\overline{y}_k^{\text{post}(k)} - \overline{y}_k^{\text{pre}(k)} \bigg ) - \bigg (\overline{y}_U^{\text{post}(k)} - \overline{y}_U^{\text{pre}(k)} \bigg )$ where $$k$$ and $$U$$ are just place-holders for any of the groups used in a $$2\times 2$$. Substituting the information in each of the four panels of Figure 9.17 into the equation will enable you to calculate what each specific $$2\times 2$$ is. But we can really just summarize these into three really important $$2\times 2$$s, which are: \begin{align} \widehat{\delta}^{2\times 2}_{kU} &=\bigg ( \overline{y}_k^{\text{post}(k)} - \overline{y}_k^{\text{pre}(k)} \bigg ) - \bigg ( \overline{y}_U^{\text{post}(k)} - \overline{y}_U^{\text{pre}(k)} \bigg ) \\ \widehat{\delta}^{2\times 2}_{kl} &=\bigg ( \overline{y}_k^{mid(k,l)} - \overline{y}_k^{\text{pre}(k)} \bigg ) - \bigg ( \overline{y}_l^{mid(k,l)} - \overline{y}_l^{\text{pre}(k)} \bigg ) \\ \widehat{\delta}^{2\times 2}_{lk} &=\bigg ( \overline{y}_l^{\text{post}(l)} - \overline{y}_l^{mid(k,l)} \bigg ) - \bigg ( \overline{y}_k^{\text{post}(l)} - \overline{y}_k^{mid(k,l)} \bigg ) \end{align} where the first $$2\times 2$$ is any timing group compared to the untreated group ($$k$$ or $$l$$), the second is a group compared to the yet-to-be-treated timing group, and the last is the eventually-treated group compared to the already-treated controls. With this notation in mind, the DD parameter estimate can be decomposed as follows: $$\widehat{\delta}^{DD}$$ \begin{align} \widehat{\delta}^{DD} = \sum_{k \neq U} s_{kU}\widehat{\delta}_{kU}^{2\times 2} + \sum_{k \neq U} \sum_{l>k} s_{kl} \bigg [ \mu_{kl}\widehat{\delta}_{kl}^{2\times 2,k} + (1-\mu_{kl}) \widehat{\delta}_{kl}^{2\times 2,l} \bigg] \end{align} where the first $$2\times 2$$ is the $$k$$ compared to $$U$$ and the $$l$$ compared to $$U$$ (combined to make the equation shorter).148 So what are these weights exactly? \begin{align} s_{ku} &=\dfrac{ n_k n_u \overline{D}_k (1- \overline{D}_k ) }{ \widehat{Var} ( \tilde{D}_{it} )} \\ s_{kl} &=\dfrac{ n_k n_l (\overline{D}_k - \overline{D}_{l} ) ( 1- ( \overline{D}_k - \overline{D}_{l} )) }{\widehat{Var}(\tilde{D}_{it})} \\ \mu_{kl} &=\dfrac{1 - \overline{D}_k }{1 - ( \overline{D}_k - \overline{D}_{l} )} \end{align} where $$n$$ refers to sample sizes, $$\overline{D}_k (1- \overline{D}_k )$$ $$(\overline{D}_k - \overline{D}_{l} ) ( 1- ( \overline{D}_k - \overline{D}_{l} ))$$ expressions refer to variance of treatment, and the final equation is the same for two timing groups.149 Two things immediately pop out of these weights that I’d like to bring to your attention. First, notice how “group” variation matters, as opposed to unit-level variation. The Bacon decomposition shows that it’s group variation that twoway fixed effects is using to calculate that parameter you’re seeking. The more states that adopted a law at the same time, the bigger they influence that final aggregate estimate itself. The other thing that matters in these weights is within-group treatment variance. To appreciate the subtlety of what’s implied, ask yourself—how long does a group have to be treated in order to maximize its treatment variance? Define $$X=D(1-D)=D-D^2$$, take the derivative of $$V$$ with respect to $$\overline{D}$$, set $$\dfrac{d V}{d \overline{D}}$$ equal to zero, and solve for $$\overline{D}*$$. Treatment variance is maximized when $$\overline{D}=0.5$$. Let’s look at three values of $$\overline{D}$$ to illustrate this. $\begin{gather} \overline{D}=0.1; 0.1 \times 0.9 = 0.09 \\ \overline{D}=0.4; 0.4 \times 0.6 =0.24 \\ \overline{D}=0.5; 0.5 \times 0.5 = 0.25\end{gather}$ So what are we learning from this, exactly? Well, what we are learning is that being treated in the middle of the panel actually directly influences the numerical value you get when twoway fixed effects are used to estimate the ATT. That therefore means lengthening or shortening the panel can actually change the point estimate purely by changing group treatment variance and nothing more. Isn’t that kind of strange though? What criteria would we even use to determine the best length? But what about the “treated on treated weights,” or the $$s_{kl}$$ weight. That doesn’t have a $$\overline{D}(1-\overline{D})$$ expression. Rather, it has a $$(\overline{D}_k - \overline{D}_l)(1-(\overline{D}_k - \overline{D}_l)$$ expression. So the “middle” isn’t super clear. That’s because it isn’t the middle of treatment for a single group, but rather it’s the middle of the panel for the difference in treatment variance. For instance, let’s say $$k$$ spends 67% of time treated and $$l$$ spends 15% of time treated. Then $$\overline{D}_k - \overline{D}_l = 0.52$$ and therefore $$0.52 \times 0.48 = 0.2496$$, which as we showed is very nearly the max value of the variance as is possible (e.g., 0.25). Think about this for a moment—twoway fixed effects with differential timing weights the $$2 \times 2$$s comparing the two ultimate treatment groups more if the gap in treatment time is close to 0.5. ### 9.6.2 Expressing the decomposition in potential outcomes Up to now, we just showed what was inside the DD parameter estimate when using twoway fixed effects: it was nothing more than an “adding up” of all possible $$2\times 2$$s weighted by group shares and treatment variance. But that only tells us what DD is numerically; it does not tell us whether the parameter estimate maps onto a meaningful average treatment effect. To do that, we need to take those sample averages and then use the switching equations replace them with potential outcomes. This is key to moving from numbers to estimates of causal effects. Bacon’s decomposition theorem expresses the DD coefficient in terms of sample average, making it straightforward to substitute with potential outcomes using a modified switching equation. With a little creative manipulation, this will be revelatory. First, let’s define any year-specific ATT as \begin{align} ATT_k(\tau)=E\big[Y^1_{it}-Y^0_{it} \mathop{\mathrm{\,\vert\,}}k, t=\tau\big] \end{align} Next, let’s define it over a time window $$W$$ (e.g., a post-treatment window) \begin{align} ATT_k(\tau)=E\big[Y^1_{it}-Y^0_{it} \mathop{\mathrm{\,\vert\,}}k,\tau\in W\big] \end{align} Finally, let’s define differences in average potential outcomes overtime as: \begin{align} \Delta Y^h_k(W_1,W_0) = E\big[Y^h_{it} \mathop{\mathrm{\,\vert\,}}k, W_1\big]- E\big[Y^h_{it} \mathop{\mathrm{\,\vert\,}}k, W_0\big] \end{align} for $$h=0$$ (i.e., $$Y^0$$) or $$h=1$$ (i.e., $$Y^1$$) With trends, differences in mean potential outcomes is non-zero. You can see that in Figure 9.18. We’ll return to this, but I just wanted to point it out to you so that it would be concrete in your mind when we return to it later. We can move now from the $$2\times 2$$s that we decomposed earlier directly into the ATT, which is ultimately the main thing we want to know. We covered this earlier in the chapter, but review it again here to maintain progress on my argument. I will first write down the $$2\times 2$$ expression, use the switching equation to introduce potential outcome notation, and through a little manipulation, find some ATT expression. \begin{align} \widehat{\delta}^{2\times 2}_{kU} &=\bigg (E\big[Y_j \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big]- E\big[Y_j \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Pre}}\big] \bigg ) - \bigg( E\big[Y_u \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big]- E\big[Y_u \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Pre}}\big]\bigg)\\ &=\bigg ( \underbrace{E\big[Y^1_j \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big] - E\big[Y^0_j] \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Pre}}\big]\bigg)- \bigg(E\big[Y^0_u \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big]- E\big[Y^0_u \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Pre}}\big]}_{\text{Switching equation}} \bigg)\\ &+ \underbrace{E\big[Y_j^0 \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big] - E\big[Y^0_j \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big]}_{\text{Adding zero}}\\ &=\underbrace{E\big[Y^1_j \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big] - E\big[Y^0_j \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big]}_{\text{ATT}} \\ &+\bigg [ \underbrace{E\big[Y^0_j \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big] - E\big[Y^0_j \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Pre}}\big]\bigg]- \bigg [E\big[Y^0_U \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Post}}\big] - E\big[Y_U^0 \mathop{\mathrm{\,\vert\,}}\mathop{\mathrm{Pre}}\big] }_{\text{Non-parallel trends bias in 2\times 2 case}} \bigg ] \end{align} This can be rewritten even more compactly as: \begin{align} \widehat{\delta}^{2\times 2}_{kU} = ATT_{\mathop{\mathrm{Post}},j} + \underbrace{\Delta Y^0_{\mathop{\mathrm{Post}},\mathop{\mathrm{Pre}},j} - \Delta Y^0_{\mathop{\mathrm{Post}},\mathop{\mathrm{Pre}}, U}}_{\text{Selection bias!}} \end{align} The $$2\times 2$$ DD can be expressed as the sum of the ATT itself plus a parallel trends assumption, and without parallel trends, the estimator is biased. Ask yourself—which of these two differences in the parallel trends assumption is counterfactual, $$\Delta Y^0_{\mathop{\mathrm{Post}},\mathop{\mathrm{Pre}},j}$$ or $$\Delta Y^0_{\mathop{\mathrm{Post}},\mathop{\mathrm{Pre}}, U}$$? Which one is observed, in other words, and which one is not observed? Look and see if you can figure it out from this drawing in Figure 9.19. Only if these are parallel—the counterfactual trend and the observable trend—does the selection bias term zero out and ATT is identified. But let’s keep looking within the decomposition, as we aren’t done. The other two $$2\times 2$$s need to be defined since they appear in Bacon’s decomposition also. And they are: \begin{align} \widehat{\delta}^{2\times 2}_{kU} &= ATT_k{\mathop{\mathrm{Post}}} + \Delta Y^0_k(\mathop{\mathrm{Post}}(k),\mathop{\mathrm{Pre}}(k)) - \Delta Y^0_U(\mathop{\mathrm{Post}}(k),\mathop{\mathrm{Pre}}) \\ \widehat{\delta}^{2\times 2}_{kl} &= ATT_k(MID) + \Delta Y^0_k(MID,\mathop{\mathrm{Pre}}) - \Delta Y^0_l(MID, \mathop{\mathrm{Pre}}) \end{align} These look the same because you’re always comparing the treated unit with an untreated unit (though in the second case it’s just that they haven’t been treated yet). But what about the $$2\times 2$$ that compared the late groups to the already-treated earlier groups? With a lot of substitutions like we did we get: \begin{align} \widehat{\delta}^{2\times 2}_{lk} &= ATT_{l,\mathop{\mathrm{Post}}(l)} \nonumber \\ &+ \underbrace{\Delta Y^0_l(\mathop{\mathrm{Post}}(l),MID) - \Delta Y^0_k (\mathop{\mathrm{Post}}(l), MID)}_{\text{Parallel-trends bias}} \nonumber \\ & - \underbrace{(ATT_k(\mathop{\mathrm{Post}}) - ATT_k(Mid))}_{\text{Heterogeneity in time bias!}} \end{align} I find it interesting our earlier decomposition of the simple difference in means into $$ATE$$ $$+$$ selection bias $$+$$ heterogeneity treatment effects bias resembles the decomposition of the late to early $$2\times 2$$ DD. The first line is the $$ATT$$ that we desperately hope to identify. The selection bias zeroes out insofar as $$Y^0$$ for $$k$$ and $$l$$ has the same parallel trends from $$mid$$ to $$post$$ period. And the treatment effects bias in the third line zeroes out so long as there are constant treatment effects for a group over time. But if there is heterogeneity in time for a group, then the two $$ATT$$ terms will not be the same, and therefore will not zero out. But we can sign the bias if we are willing to assume monotonicity, which means the $$mid$$ term is smaller in absolute value than the $$post$$ term. Under monotonicity, the interior of the parentheses in the third line is positive, and therefore the bias is negative. For positive ATT, this will bias the effects towards zero, and for negative ATT, it will cause the estimated ATT to become even more negative. Let’s pause and collect these terms. The decomposition formula for DD is: \begin{align} \widehat{\delta}^{DD} = \sum_{k \neq U} s_{kU}\widehat{\delta}_{kU}^{2\times 2} + \sum_{k \neq U} \sum_{l>k} s_{kl} \bigg[ \mu_{kl}\widehat{\delta}_{kl}^{2\times 2,k} + (1-\mu_{kl}) \widehat{\delta}_{kl}^{2\times 2,l} \bigg] \end{align} We will substitute the following three expressions into that formula. \begin{align} \widehat{\delta}_{kU}^{2\times 2} &= ATT_k(\mathop{\mathrm{Post}})+\Delta Y_l^0(\mathop{\mathrm{Post}},\mathop{\mathrm{Pre}})- \Delta Y_U^0(\mathop{\mathrm{Post}},\mathop{\mathrm{Pre}}) \\ \widehat{\delta}_{kl}^{2\times 2,k} &=ATT_k( \mathop{\mathrm{\,\vert\,}})+\Delta Y_l^0( \mathop{\mathrm{\,\vert\,}},\mathop{\mathrm{Pre}})-\Delta Y_l^0( \mathop{\mathrm{\,\vert\,}}, \mathop{\mathrm{Pre}}) \\ \widehat{\delta}^{2\times 2,l}_{lk} &=ATT_{l}\mathop{\mathrm{Post}}(l)+\Delta Y^0_l(\mathop{\mathrm{Post}}(l), \mathop{\mathrm{\,\vert\,}})-\Delta Y^0_k (\mathop{\mathrm{Post}}(l), \mathop{\mathrm{\,\vert\,}}) \\ &- (ATT_k(\mathop{\mathrm{Post}})-ATT_k( \mathop{\mathrm{\,\vert\,}})) \end{align} Substituting all three terms into the decomposition formula is a bit overwhelming, so let’s simplify the notation. The estimated DD parameter is equal to: $p\lim\widehat{\delta}^{DD}_{n\to\infty} = VWATT + VWCT - \Delta ATT$ In the next few sections, I discuss each individual element of this expression. ### 9.6.3 Variance weighted ATT We begin by discussing the variance weighted average treatment effect on the treatment group, or $$VWATT$$. Its unpacked expression is: \begin{align} VWATT &=\sum_{k\neq U}\sigma_{kU}ATT_k(\mathop{\mathrm{Post}}(k)) \\ &+\sum_{k \neq U} \sum_{l>k} \sigma_{kl} \bigg [ \mu_{kl} ATT_k ( \mathop{\mathrm{\,\vert\,}})+ (1-\mu_{kl}) ATT_{l} (POST(l)) \bigg ] \end{align} where $$\sigma$$ is like $$s$$, only population terms not samples. Notice that the VWATT simply contains the three ATTs identified above, each of which was weighted by the weights contained in the decomposition formula. While these weights sum to one, that weighting is irrelevant if the ATT are identical.150 When I learned that the DD coefficient was a weighted average of all individual $$2\times 2$$s, I was not terribly surprised. I may not have intuitively known that the weights were based on group shares and treatment variance, but I figured it was probably a weighted average nonetheless. I did not have that same experience, though, when I worked through the other two terms. I now turn to the other two terms: the VWCT and the $$\Delta ATT$$. ### 9.6.5 ATT heterogeneity within time bias When we decomposed the simple difference in mean outcomes into the sum of the ATE, selection bias, and heterogeneous treatment effects bias, it really wasn’t a huge headache. That was because if the ATT differed from the ATU, then the simple difference in mean outcomes became the sum of ATT and selection bias, which was still an interesting parameter. But in the Bacon decomposition, ATT heterogeneity over time introduces bias that is not so benign. Let’s look at what happens when there is time-variant within-group treatment effects. \begin{align} \Delta ATT = \sum_{k \neq U} \sum_{l>k} (1 - \mu_{kl}) \Big[ ATT_k(\mathop{\mathrm{Post}}(l) - ATT_k( \mathop{\mathrm{\,\vert\,}})) \Big] \end{align} Heterogeneity in the ATT has two interpretations: you can have heterogeneous treatment effects across groups, and you can have heterogeneous treatment effects within groups over time. The $$\Delta ATT$$ is concerned with the latter only. The first case would be heterogeneity across units but not within groups. When there is heterogeneity across groups, then the VWATT is simply the average over group-specific ATTs weighted by a function of sample shares and treatment variance. There is no bias from this kind of heterogeneity.151 But it’s the second case—when $$ATT$$ is constant across units but heterogeneous within groups over time—that things get a little worrisome. Time-varying treatment effects, even if they are identical across units, generate cross-group heterogeneity because of the differing post-treatment windows, and the fact that earlier-treated groups are serving as controls for later-treated groups. Let’s consider a case where the counterfactual outcomes are identical, but the treatment effect is a linear break in the trend (Figure 9.20). For instance, $$Y^1_{it} = Y^0_{it} + \theta (t-t^*_1+1)$$ similar to Meer and West (2016). Notice how the first $$2\times 2$$ uses the later group as its control in the middle period, but in the late period, the later-treated group is using the earlier treated as its control. When is this a problem? It’s a problem if there are a lot of those $$2\times 2$$s or if their weights are large. If they are negligible portions of the estimate, then even if it exists, then given their weights are small (as group shares are also an important piece of the weighting not just the variance in treatment) the bias may be small. But let’s say that doesn’t hold. Then what is going on? The effect is biased because the control group is experiencing a trend in outcomes (e.g., heterogeneous treatment effects), and this bias feeds through to the later $$2\times 2$$ according to the size of the weights, $$(1-\mu_{kl})$$. We will need to correct for this if our plan is to stick with the twoway fixed effects estimator. Now it’s time to use what we’ve learned. Let’s look at an interesting and important paper by Cheng and Hoekstra (2013) to both learn more about a DD paper and replicate it using event studies and the Bacon decomposition. ### 9.6.6 Castle-doctrine statutes and homicides Cheng and Hoekstra (2013) evaluated the impact that a gun reform had on violence and to illustrate various principles and practices regarding differential timing. I’d like to discuss those principles in the context of this paper. This next section will discuss, extend, and replicate various parts of this study. Trayvon Benjamin Martin was a 17-year-old African-American young man when George Zimmerman shot and killed him in Sanford, Florida, on February 26, 2012. Martin was walking home alone from a convenience store when Zimmerman spotted him, followed him from a distance, and reported him to the police. He said he found Martin’s behavior “suspicious,” and though police officers urged Zimmerman to stay back, Zimmerman stalked and eventually provoked Martin. An altercation occurred and Zimmerman fatally shot Martin. Zimmerman claimed self-defense and was nonetheless charged with Martin’s death. A jury acquitted him of second-degree murder and of manslaughter. Zimmerman’s actions were interpreted by the jury to be legal because in 2005, Florida reformed when and where lethal self-defense could be used. Whereas once lethal self-defense was only legal inside the home, a new law, “Stand Your Ground,” had extended that right to other public places. Between 2000 and 2010, twenty-one states explicitly expanded the castle-doctrine statute by extending the places outside the home where lethal force could be legally used.152 These states had removed a long-standing tradition in the common law that placed the duty to retreat from danger on the victim. After these reforms, though, victims no longer had a duty to retreat in public places if they felt threatened; they could retaliate in lethal self-defense. Other changes were also made. In some states, individuals who used lethal force outside the home were assumed to be reasonably afraid. Thus, a prosecutor would have to prove fear was not reasonable, allegedly an almost impossible task. Civil liability for those acting under these expansions was also removed. As civil liability is a lower threshold of guilt than criminal guilt, this effectively removed the remaining constraint that might keep someone from using lethal force outside the home. From an economic perspective, these reforms lowered the cost of killing someone. One could use lethal self-defense in situations from which they had previously been barred. And as there was no civil liability, the expected cost of killing someone was now lower. Thus, insofar as people are sensitive to incentives, then depending on the elasticities of lethal self-defense with respect to cost, we expect an increase in lethal violence for the marginal victim. The reforms may have, in other words, caused homicides to rise. One can divide lethal force into true and false positives. The true positive use of lethal force would be those situations in which, had the person not used lethal force, he or she would have been murdered. Thus, the true positive case of lethal force is simply a transfer of one life (the offender) for another (the defender). This is tragic, but official statistics would not record an net increase in homicides relative to the counterfactual—only which person had been killed. But a false positive causes a net increase in homicides relative to the counterfactual. Some arguments can escalate unnecessarily, and yet under common law, the duty to retreat would have defused the situation before it spilled over into lethal force. Now, though, under these castle-doctrine reforms, that safety valve is removed, and thus a killing occurs that would not have in counterfactual, leading to a net increase in homicides. But that is not the only possible impact of the reforms—deterrence of violence is also a possibility under these reforms. In Lott and Mustard (1997), the authors found that concealed-carry laws reduced violence. They suggested this was caused by deterrence—thinking someone may be carrying a concealed weapon, the rational criminal is deterred from committing a crime. Deterrence dates back to G. Becker (1968) and Jeremy Bentham before him. Expanding the arenas where lethal force could be used could also deter crime. Since this theoretical possibility depends crucially on key elasticities, which may in fact be zero, deterrence from expanding where guns can be used to kill someone is ultimately an empirical question. Cheng and Hoekstra (2013) chose a difference-in-differences design for their project where the castle doctrine law was the treatment and timing was differential across states. Their estimatingequation was $Y_{it}=\alpha + \delta D_{it} + \gamma X_{it} + \sigma_i + \tau_t + \varepsilon_{it}$ where $$D_{it}$$ is the treatment parameter. They estimated this equation using a standard twoway fixed effects model as well as count models. Ordinarily, the treatment parameter will be a 0 or 1, but in Cheng and Hoekstra (2013), it’s a variable ranging from 0 to 1, because some states get the law change mid-year. So if they got the law in July, then $$D_{it}$$ equals 0 before the year of adoption, 0.5 in the year of adoption and 1 thereafter. The $$X_{it}$$ variable included a particular kind of control that they called “region-by-year fixed effects,” which was a vector of dummies for the census region to which the state belonged interacted with each year fixed effect. This was done so that explicit counterfactuals were forced to come from within the same census region.153 As the results are not dramatically different between their twoway fixed effects and count models, I will tend to emphasize results from the twoway fixed effects. The data they used is somewhat standard in crime studies. They used the FBI Uniform Crime Reports Summary Part I files from 2000 to 2010. The FBI Uniform Crime Reports is a harmonized data set on eight “index” crimes collected from voluntarily participating police agencies across the country. Participation is high and the data goes back many decades, making it attractive for many contemporary questions regarding the crime policy. Crimes were converted into rates, or “offenses per 100,000 population.” Cheng and Hoekstra (2013) rhetorically open their study with a series of simple placebos to check whether the reforms were spuriously correlated with crime trends more generally. Since oftentimes many crimes are correlated because of unobserved factors, this has some appeal, as it rules out the possibility that the laws were simply being adopted in areas where crime rates were already rising. For their falsifications they chose motor vehicle thefts and larcenies, neither of which, they reasoned, should be credibly connected to lowering the cost of using lethal force in public. There are so many regression coefficients in Table 9.5 because applied microeconomists like to report results under increasingly restrictive models. In this case, each column is a new regression with additional controls such as additional fixed-effects specifications, time-varying controls, a one-year lead to check on the pre-treatment differences in outcomes, and state-specific trends. As you can see, many of these coefficients are very small, and because they are small, even large standard errors yield a range of estimates that are still not very large. Next they look at what they consider to be crimes that might be deterred if policy created a credible threat of lethal retaliation in public: burglary, robbery, and aggravated assault. Insofar as castle doctrine has a deterrence effect, then we would expect a negative effect of the law on offenses. But all of the regressions shown in Table 9.6 are actually positive, and very few are significant even still. So the authors conclude they cannot detect any deterrence—which does not mean it didn’t happen; just that they cannot reject the null for large effects. Now they move to their main results, which is interesting because it’s much more common for authors to lead with their main results. But the rhetoric of this paper is somewhat original in that respect. By this point, the reader has seen a lot of null effects from the laws and may be wondering, “What’s going on? This law isn’t spurious and isn’t causing deterrence. Why am I reading this paper?” The first thing the authors did was show a series of figures showing the raw data on homicides for treatment and control states. This is always a challenge when working with differential timing, though. For instance, approximately twenty states adopted a castle-doctrine law from 2005 to 2010, but not at the same time. So how are you going to show this visually? What is the pre-treatment period, for instance, for the control group when there is differential timing? If one state adopts Table 9.5: Falsification Tests: The effect of castle doctrine laws on larceny and motor vehicle theft. OLS – Weighted by State Population (1) (2) (3) (4) (5) (6) Panel A. Larceny Log(Larceny Rate) Castle Doctrine 0.00300 $$-0.00600$$ $$-0.00910$$ $$-0.0858$$ $$-0.00401$$ $$-0.00284$$ Law (0.0161) (0.0147) (0.0139) (0.0139) (0.0128) (0.0180) 0 to 2 years before adoption of castle doctrine law 0.00112 (0.0105) Observation 550 550 550 550 550 550 Panel B. Motor Log(Motor Vehicle Theft Rate) Castle Doctrine 0.0517 $$-0.0389$$ $$-0.0252$$ $$-0.0294$$ $$-0.0165$$ $$-0.00708$$ Law (0.0563) (0.448) (0.0396) (0.0469) (0.0354) (0.0372) 0 to 2 years before adoption of castle doctrine law $$-0.00896$$ (0.0216) Observation 550 550 550 550 550 550 State and year fixed effects Yes Yes Yes Yes Yes Yes Region-by-year fixed effects Yes Yes Yes Yes Yes Time-varying controls Yes Yes Yes Yes Controls for larceny or motor theft Yes State-specific linear time trends Yes Each column in each panel represents a separate regression. The unit of observation is state-year. Robust standard errors are clustered at the state level. Time-varying controls include policing and incarceration rates, welfare and public assistance spending, median income, poverty rate, unemployment rate, and demographics. $$^{*}$$ Significant at the 10 percent level. $$^{**}$$ Significant at the 5 percent level. $$^{***}$$ Significant at the 1 percent level. Table 9.6: The deterrence effects of castle-doctrine laws: Burglary, robbery, and aggravated assault. OLS – Weighted by State Population (1) (2) (3) (4) (5) (6) Panel A. Burglary Log(Burglary Rate) Castle-doctrine law 0.0780$$^{***}$$ 0.0290 0.0223 0.0181 0.0327$$^*$$ 0.0237 0 to 2 years before adoption of castle-doctrine law (0.0255) (0.0236) (0.0223) (0.0265) (0.0165) (0.0207) $$-0.009606$$ (0.0133) Panel B. Robbery Log(Robery Rate) Castle-doctrine law 0.0408 0.0344 0.0262 0.0197 0.0376$$^{**}$$ 0.0515$$^*$$ (0.0254) (0.0224) (0.0229) (0.0257) (0.0181) (0.0274) 0 to 2 years before adoption of castle-doctrine law $$-0.0138$$ (0.0153) Panel C. Aggravated Log(Aggrevated Assault Rate) Castle-doctrine law 0.0434 0.0397 0.0372 0.0330 0.0424 0.0414 (0.0387) (0.0407) (0.0319) (0.0367) (0.0291) (0.0285) 0 to 2 years before adoption of castle-doctrine law $$-0.00897$$ (0.0216) Observation 550 550 550 550 550 550 State and year fixed effects Yes Yes Yes Yes Yes Yes Region-by-year fixed effects Yes Yes Yes Yes Yes Time-varying controls Yes Yes Yes Yes Controls for larceny or motor theft Yes State-specific linear time trends Yes Each column in each panel represents a separate regression. The unit of observation is state-year. Robust standard errors are clustered at the state level. Time-varying controls include policing and incarceration rates, welfare and public assistance spending, median income, poverty rate, unemployment rate, and demographics. $$^{*}$$ Significant at the 10 percent level. $$^{**}$$ Significant at the 5 percent level. $$^{***}$$ Significant at the 1 percent level. in 2005, but another in 2006, then what precisely is the pre- and post-treatment for the control group? So that’s a bit of a challenge, and yet if you stick with our guiding principle that causal inference studies desperately need data visualization of the main effects, your job is to solve it with creativity and honesty to make beautiful figures. Cheng and Hoekstra (2013) could’ve presented regression coefficients on leads and lags, as that is very commonly done, but knowing these authors firsthand, their preference is to give the reader pictures of the raw data to be as transparent as possible. Therefore, they showed multiple figures where each figure was a “treatment group” compared to all the “never-treated” units. Figure 9.21 shows the Florida case. Notice that before the passage of the law, the offenses are fairly flat for treatment and control. Obviously, as I’ve emphasized, this is not a direct test of the parallel-trends assumption. Parallel trends in the pre-treatment period are neither necessary nor sufficient. The identifying assumption, recall, is that of variance-weighted common trends, which are entirely based on parallel counterfactual trends, not pre-treatment trends. But researchers use parallel pre-treatment trends like a hunch that the counterfactual trends would have been parallel. In one sense, parallel pre-treatment rules out some obvious spurious factors that we should be worried about, such as the law adoption happening around the timing of a change, even if that’s simply nothing more than seemingly spurious factors like rising homicides. But that’s clearly not happening here—homicides weren’t diverging from controls pre-treatment. They were following a similar trajectory before Florida passed its law and only then did the trends converge. Notice that after 2005, which is when the law occurs, there’s a sizable jump in homicides. There are additional figures like this, but they all have this set up—they show a treatment group over time compared to the same “never-treated” group.  Panel A. Homicide Log(Homicide Rates) OLS–Weights 1 2 3 4 5 6 Castle Doctrine Law 0.0801$$^{**}$$ 0.0946$$^{***}$$ 0.0937$$^{***}$$ 0.0955$$^{**}$$ 0.0985$$^{***}$$ 0.100$$^{**}$$ (0.0342) (0.0279) (0.0290) (0.0367) (0.0299) (0.0388) 0 to 2 years before adoption of castle doctrine law 0.00398 (0.0222) Observation 550 550 550 550 550 550 State and Year Fixed Effects Yes Yes Yes Yes Yes Yes Region-by-year Fixed Yes Yes Yes Yes Yes Effects Time-Varying Controls Yes Yes Yes Yes Controls for Larceny or Motor Theft Yes State-specific Linear Time Trends Yes Each column in each panel represents a separate regression. The unit of observation is state-year. Robust standard errors are clustered at the state level. Time-varying controls include policing and incarceration rates, welfare and public assistance spending, median income, poverty rate, unemployment rate, and demographics. $$^{**}$$ Significant at the 5 percent level. $$^{***}$$ Significant at the 1 percent level. Insofar as the cost of committing lethal force has fallen, then we expect to see more of it, which implies a positive coefficient on the estimated $$\delta$$ term assuming the heterogeneity bias we discussed earlier doesn’t cause the twoway fixed effects estimated coefficient to flip signs. It should be different from zero both statistically and in a meaningful magnitude. They present four separate types of specifications—three using OLS, one using negative binomial. But I will only report the weighted OLS regressions for the sake of space. There’s a lot of information in Table 9.7, so let’s be sure not to get lost. First, all coefficients are positive and similar in magnitude—between 8% and 10% increases in homicides. Second, three of the four panels are almost entirely significant. It appears that the bulk of their evidence suggests the castle-doctrine statute caused an increase in homicides around 8%. Not satisfied, the authors implemented a kind of randomization inference-based test. Specifically, they moved the eleven-year panel back in time covering 1960–2009 and estimated forty placebo “effects” of passing castle doctrine one to forty years earlier. When they did this, they found that the average effect from this exercise was essentially zero. Those results are summarized here. It appears there is something statistically unusual about the actual treatment profile compared to the placebo profiles, because the actual profile yields effect sizes larger than all but one case in any of the placebo regressions run. Randomization inference averages . Method Average estimate Estimates larger than actual estimate Weighted OLS $$-0.003$$ 0/40 Unweighted OLS 0.001 1/40 Negative binomial 0.001 0/40 Cheng and Hoekstra (2013) found no evidence that castle-doctrine laws deter violent offenses, but they did find that it increased homicides. An 8% net increase in homicide rates translates to around six hundred additional homicides per year across the twenty-one adopting states. Thinking back to to the killing of Trayvon Martin by George Zimmerman, one is left to wonder whether Trayvon might still be alive had Florida not passed Stand Your Ground. This kind of counterfactual reasoning can drive you crazy, because it is unanswerable—we simply don’t know, cannot know, and never will know the answer to counterfactual questions. The fundamental problem of causal inference states that we need to know what would have happened that fateful night without Stand Your Ground and compare that with what happened with Stand Your Ground to know what can and cannot be placed at the feet of that law. What we do know is that under certain assumptions related to the DD design, homicides were on net around 8%–10% higher than they would’ve been when compared against explicit counterfactuals. And while that doesn’t answer every question, it suggests that a nontrivial number of deaths can be blamed on laws similar to Stand Your Ground. ### 9.6.7 Replicating Cheng and Hoekstra (2013), sort of Now that we’ve discussed Cheng and Hoekstra (2013), I want to replicate it, or at least do some work on their data set to illustrate certain things that we’ve discussed, like event studies and the Bacon decomposition. This analysis will be slightly different from what they did, though, because their policy variable was on the interval $$[0,1]$$ rather than being a pure dummy. That’s because they carefully defined their policy variable according to the month in which the law was passed (e.g., June) divided by a total of 12 months. So if a state passed the last in June, then they would assign a 0.5 in the first year, and a 1 thereafter. While there’s nothing wrong with that approach, I am going to use a dummy because it makes the event studies a bit easier to visualize, and the Bacon decomposition only works with dummy policy variables. First, I will replicate his main homicide results from Panel A, column 6, of Figure 9.21. Stata Code castle_1.do use https://github.com/scunning1975/mixtape/raw/master/castle.dta, clear set scheme cleanplots * ssc install bacondecomp * define global macros global crime1 jhcitizen_c jhpolice_c murder homicide robbery assault burglary larceny motor robbery_gun_r global demo blackm_15_24 whitem_15_24 blackm_25_44 whitem_25_44 //demographics global lintrend trend_1-trend_51 //state linear trend global region r20001-r20104 //region-quarter fixed effects global exocrime l_larceny l_motor // exogenous crime rates global spending l_exp_subsidy l_exp_pubwelfare global xvar l_police unemployrt poverty l_income l_prisoner l_lagprisonerdemo $spending label variable post "Year of treatment" xi: xtreg l_homicide i.year$region $xvar$lintrend post [aweight=popwt], fe vce(cluster sid)

R Code

castle_1.R

library(bacondecomp)
library(tidyverse)
library(haven)
library(lfe)

{
full_path <- paste("https://raw.github.com/scunning1975/mixtape/master/",
df, sep = "")
return(df)
}

#--- global variables
crime1 <- c("jhcitizen_c", "jhpolice_c",
"murder", "homicide",
"robbery", "assault", "burglary",
"larceny", "motor", "robbery_gun_r")

demo <- c("emo", "blackm_15_24", "whitem_15_24",
"blackm_25_44", "whitem_25_44")

# variables dropped to prevent colinearity
dropped_vars <- c("r20004", "r20014",
"r20024", "r20034",
"r20044", "r20054",
"r20064", "r20074",
"r20084", "r20094",
"r20101", "r20102", "r20103",
"r20104", "trend_9", "trend_46",
"trend_49", "trend_50", "trend_51"
)

lintrend <- castle %>%
select(starts_with("trend")) %>%
colnames %>%
# remove due to colinearity
subset(.,! . %in% dropped_vars)

region <- castle %>%
select(starts_with("r20")) %>%
colnames %>%
# remove due to colinearity
subset(.,! . %in% dropped_vars)

exocrime <- c("l_lacerny", "l_motor")
spending <- c("l_exp_subsidy", "l_exp_pubwelfare")

xvar <- c(
"blackm_15_24", "whitem_15_24", "blackm_25_44", "whitem_25_44",
"l_exp_subsidy", "l_exp_pubwelfare",
"l_police", "unemployrt", "poverty",
"l_income", "l_prisoner", "l_lagprisoner"
)

law <- c("cdl")

dd_formula <- as.formula(
paste("l_homicide ~ ",
paste(
paste(xvar, collapse = " + "),
paste(region, collapse = " + "),
paste(lintrend, collapse = " + "),
paste("post", collapse = " + "), sep = " + "),
"| year + sid | 0 | sid"
)
)

#Fixed effect regression using post as treatment variable
dd_reg <- felm(dd_formula, weights = castle$popwt, data = castle) summary(dd_reg) Python Code castle_1.py import numpy as np import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf from itertools import combinations import plotnine as p # read data import ssl ssl._create_default_https_context = ssl._create_unverified_context def read_data(file): return pd.read_stata("https://raw.github.com/scunning1975/mixtape/master/" + file) castle = read_data('castle.dta') crime1 = ("jhcitizen_c", "jhpolice_c", "murder", "homicide", "robbery", "assault", "burglary", "larceny", "motor", "robbery_gun_r") demo = ("emo", "blackm_15_24", "whitem_15_24", "blackm_25_44", "whitem_25_44") # variables dropped to prevent colinearity dropped_vars = ("r20004", "r20014", "r20024", "r20034", "r20044", "r20054", "r20064", "r20074", "r20084", "r20094", "r20101", "r20102", "r20103", "r20104", "trend_9", "trend_46", "trend_49", "trend_50", "trend_51") cols = pd.Series(castle.columns) trend_cols = set(cols[cols.str.contains('^trend')]) lintrend = castle[trend_cols - set(dropped_vars)] region = set(cols[cols.str.contains('^r20')]) lintrend = set(cols[cols.str.contains('^trend')]) exocrime = ("l_lacerny", "l_motor") spending = ("l_exp_subsidy", "l_exp_pubwelfare") xvar = ( "blackm_15_24", "whitem_15_24", "blackm_25_44", "whitem_25_44", "l_exp_subsidy", "l_exp_pubwelfare", "l_police", "unemployrt", "poverty", "l_income", "l_prisoner", "l_lagprisoner" ) law = ("cdl") dd_formula = "l_homicide ~ {} + {} + {} + post + C(year) + C(sid)".format( "+".join(xvar), "+".join(region), "+".join(lintrend)) #Fixed effect regression using post as treatment variable dd_reg = smf.wls(dd_formula, data = castle, weights = castle['popwt']).fit(cov_type='cluster', cov_kwds={'groups':castle['sid']}) dd_reg.summary() Here we see the main result that castle doctrine expansions led to an approximately 10% increase in homicides. And if we use the post-dummy, which is essentially equal to 0 unless the state had fully covered castle doctrine expansions, then the effect is more like 7.6%. But now, I’d like to go beyond their study to implement an event study. First, we need to define pre-treatment leads and lags. To do this, we use a “time_til” variable, which is the number of years until or after the state received the treatment. Using this variable, we then create the leads (which will be the years prior to treatment) and lags (the years post-treatment). Stata Code castle_2.do * Event study regression with the year of treatment (lag0) as the omitted category. xi: xtreg l_homicide i.year$region lead9 lead8 lead7 lead6 lead5 lead4 lead3 lead2 lead1 lag1-lag5 [aweight=popwt], fe vce(cluster sid)

R Code

castle_2.R

castle <- castle %>%
mutate(
time_til = year - treatment_date,
lead1 = case_when(time_til == -1 ~ 1, TRUE ~ 0),
lead2 = case_when(time_til == -2 ~ 1, TRUE ~ 0),
lead3 = case_when(time_til == -3 ~ 1, TRUE ~ 0),
lead4 = case_when(time_til == -4 ~ 1, TRUE ~ 0),
lead5 = case_when(time_til == -5 ~ 1, TRUE ~ 0),
lead6 = case_when(time_til == -6 ~ 1, TRUE ~ 0),
lead7 = case_when(time_til == -7 ~ 1, TRUE ~ 0),
lead8 = case_when(time_til == -8 ~ 1, TRUE ~ 0),
lead9 = case_when(time_til == -9 ~ 1, TRUE ~ 0),

lag0 = case_when(time_til == 0 ~ 1, TRUE ~ 0),
lag1 = case_when(time_til == 1 ~ 1, TRUE ~ 0),
lag2 = case_when(time_til == 2 ~ 1, TRUE ~ 0),
lag3 = case_when(time_til == 3 ~ 1, TRUE ~ 0),
lag4 = case_when(time_til == 4 ~ 1, TRUE ~ 0),
lag5 = case_when(time_til == 5 ~ 1, TRUE ~ 0)
)

event_study_formula <- as.formula(
paste("l_homicide ~ + ",
paste(
paste(region, collapse = " + "),
paste(paste("lead", 1:9, sep = ""), collapse = " + "),
paste(paste("lag", 1:5, sep = ""), collapse = " + "), sep = " + "),
"| year + state | 0 | sid"
),
)

event_study_reg <- felm(event_study_formula, weights = castle$popwt, data = castle) summary(event_study_reg) Python Code castle_2.py import numpy as np import pandas as pd import statsmodels.api as sm import statsmodels.formula.api as smf from itertools import combinations import plotnine as p # read data import ssl ssl._create_default_https_context = ssl._create_unverified_context def read_data(file): return pd.read_stata("https://raw.github.com/scunning1975/mixtape/master/" + file) castle['time_til'] = castle['year'] - castle['treatment_date'] castle['lead1'] = castle['time_til'] == -1 castle['lead2'] = castle['time_til'] == -2 castle['lead3'] = castle['time_til'] == -3 castle['lead4'] = castle['time_til'] == -4 castle['lead5'] = castle['time_til'] == -5 castle['lead6'] = castle['time_til'] == -6 castle['lead7'] = castle['time_til'] == -7 castle['lead8'] = castle['time_til'] == -8 castle['lead9'] = castle['time_til'] == -9 castle['lag0'] = castle['time_til'] == 0 castle['lag1'] = castle['time_til'] == 1 castle['lag2'] = castle['time_til'] == 2 castle['lag3'] = castle['time_til'] == 3 castle['lag4'] = castle['time_til'] == 4 castle['lag5'] = castle['time_til'] == 5 formula = "l_homicide ~ r20001 + r20002 + r20003 + r20011 + r20012 + r20013 + r20021 + r20022 + r20023 + r20031 + r20032 + r20033 + r20041 + r20042 + r20043 + r20051 + r20052 + r20053 + r20061 + r20062 + r20063 + r20071 + r20072 + r20073 + r20081 + r20082 + r20083 + r20091 + r20092 + r20093 + lead1 + lead2 + lead3 + lead4 + lead5 + lead6 + lead7 + lead8 + lead9 + lag1 + lag2 + lag3 + lag4 + lag5 + C(year) + C(state)" event_study_formula = smf.wls(formula, data = castle, weights = castle['popwt']).fit(cov_type='cluster', cov_kwds={'groups':castle['sid']}) event_study_formula.summary() Our omitted category is the year of treatment, so all coefficients are with respect to that year. You can see from the coefficients on the leads that they are not statistically different from zero prior to treatment, except for leads 8 and 9, which may be because there are only three states with eight years prior to treatment, and one state with nine years prior to treatment. But in the years prior to treatment, leads 1 to 6 are equal to zero and statistically insignificant, although they do technically have large confidence intervals. The lags, on the other hand, are all positive and not too dissimilar from one another except for lag 5, which is around 17%. Now it is customary to plot these event studies, so let’s do that now. I am going to show you an easy way and a longer way to do this. The longer way gives you ultimately more control over what exactly you want the event study to look like, but for a fast and dirty method, the easier way will suffice. For the easier way, you will need to install a program in Stata called coefplot, written by Ben Jann, author of estout.154 Stata Code castle_3.do * Plot the coefficients using coefplot * ssc install coefplot coefplot, keep(lead9 lead8 lead7 lead6 lead5 lead4 lead3 lead2 lead1 lag1 lag2 lag3 lag4 lag5) xlabel(, angle(vertical)) yline(0) xline(9.5) vertical msymbol(D) mfcolor(white) ciopts(lwidth(*3) lcolor(*.6)) mlabel format(%9.3f) mlabposition(12) mlabgap(*2) title(Log Murder Rate)  R Code castle_3.R # order of the coefficients for the plot plot_order <- c("lead9", "lead8", "lead7", "lead6", "lead5", "lead4", "lead3", "lead2", "lead1", "lag1", "lag2", "lag3", "lag4", "lag5") # grab the clustered standard errors # and average coefficient estimates # from the regression, label them accordingly # add a zero'th lag for plotting purposes leadslags_plot <- tibble( sd = c(event_study_reg$cse[plot_order], 0),
mean = c(coef(event_study_reg)[plot_order], 0),
label = c(-9,-8,-7,-6, -5, -4, -3, -2, -1, 1,2,3,4,5, 0)
)

# This version has a point-range at each
# comes down to stylistic preference at the
# end of the day!
ggplot(aes(x = label, y = mean,
ymin = mean-1.96*sd,
ymax = mean+1.96*sd)) +
geom_hline(yintercept = 0.035169444, color = "red") +
geom_pointrange() +
theme_minimal() +
xlab("Years before and after castle doctrine expansion") +
ylab("log(Homicide Rate)") +
geom_hline(yintercept = 0,
linetype = "dashed") +
geom_vline(xintercept = 0,
linetype = "dashed")

Python Code

castle_3.py

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

castle['time_til'] = castle['year'] - castle['treatment_date']

castle['lag0'] = castle['time_til'] == 0
castle['lag1'] = castle['time_til'] == 1
castle['lag2'] = castle['time_til'] == 2
castle['lag3'] = castle['time_til'] == 3
castle['lag4'] = castle['time_til'] == 4
castle['lag5'] = castle['time_til'] == 5

formula = "l_homicide ~ r20001 + r20002 + r20003 + r20011 + r20012 + r20013 + r20021 + r20022 + r20023 + r20031 + r20032 + r20033 + r20041 + r20042 + r20043 + r20051 + r20052 + r20053 + r20061 + r20062 + r20063 + r20071 + r20072 + r20073 + r20081 + r20082 + r20083 + r20091 + r20092 + r20093 + lead1 + lead2 + lead3 + lead4 + lead5 + lead6 + lead7 + lead8 + lead9 + lag1 + lag2 + lag3 + lag4 + lag5 + C(year) + C(state)"

event_study_formula = smf.wls(formula,
data = castle, weights = castle['popwt']).fit(cov_type='cluster', cov_kwds={'groups':castle['sid']})

lags = ['lag1[T.True]', 'lag2[T.True]', 'lag3[T.True]', 'lag4[T.True]', 'lag5[T.True]']

'label': np.arange(-9, 6)})

# This version has a point-range at each
# comes down to stylistic preference at the
# end of the day!
p.ggplot(leadslags_plot, p.aes(x = 'label', y = 'mean',
ymin = 'lb',
ymax = 'ub')) +\
p.geom_hline(yintercept = 0.035169444, color = "red") +\
p.geom_pointrange() +\
p.theme_minimal() +\
p.xlab("Years before and after castle doctrine expansion") +\
p.ylab("log(Homicide Rate)") +\
p.geom_hline(yintercept = 0,
linetype = "dashed") +\
p.geom_vline(xintercept = 0,
linetype = "dashed")

Let’s look now at what this command created. As you can see in Figure 9.22, eight to nine years prior to treatment, treatment states have significantly lower levels of homicides, but as there are so few states that even have these values (one with $$-9$$ and three with $$-8$$), we may want to disregard the relevance of these negative effects if for no other reason than that there are so few units in the dummy and we know from earlier that that can lead to very high overrejection rates . Instead, notice that for the six years prior to treatment, there is virtually no difference between the treatment states and the control states.

But, after the year of treatment, that changes. Log murders begin rising, which is consistent with our post dummy that imposed zeros on all pre-treatment leads and required that the average effect post-treatment be a constant.

I promised to show you how to make this graph in a way that gave more flexibility, but you should be warned, this is a bit more cumbersome.

Stata Code

castle_4.do

xi: xtreg l_homicide  i.year $region$xvar $lintrend post [aweight=popwt], fe vce(cluster sid) local DDL = _b[post] local DD : display %03.2f _b[post] local DDSE : display %03.2f _se[post] local DD1 = -0.10 xi: xtreg l_homicide i.year$region lead9 lead8 lead7 lead6 lead5 lead4 lead3 lead2 lead1 lag1-lag5 [aweight=popwt], fe vce(cluster sid)

outreg2 using "./eventstudy_levels.xls", replace keep(lead9 lead8 lead7 lead6 lead5 lead4 lead3 lead2 lead1 lag1-lag5) noparen noaster addstat(DD, DD', DDSE, DDSE')

*Pull in the ES Coefs
xmluse "./eventstudy_levels.xls", clear cells(A3:B32) first
replace VARIABLES = subinstr(VARIABLES,"lag","",.)
quietly destring _all, replace ignore(",")
replace VARIABLES = -9 in 2
replace VARIABLES = -8 in 4
replace VARIABLES = -7 in 6
replace VARIABLES = -6 in 8
replace VARIABLES = -5 in 10
replace VARIABLES = -4 in 12
replace VARIABLES = -3 in 14
replace VARIABLES = -2 in 16
replace VARIABLES = -1 in 18
replace VARIABLES = 1 in 20
replace VARIABLES = 2 in 22
replace VARIABLES = 3 in 24
replace VARIABLES = 4 in 26
replace VARIABLES = 5 in 28
drop in 1
compress
quietly destring _all, replace ignore(",")
compress

ren VARIABLES exp
gen b = exp<.
replace exp = -9 in 2
replace exp = -8 in 4
replace exp = -7 in 6
replace exp = -6 in 8
replace exp = -5 in 10
replace exp = -4 in 12
replace exp = -3 in 14
replace exp = -2 in 16
replace exp = -1 in 18
replace exp = 1 in 20
replace exp = 2 in 22
replace exp = 3 in 24
replace exp = 4 in 26
replace exp = 5 in 28

* Expand the dataset by one more observation so as to include the comparison year
local obs =_N+1
set obs obs'
for var _all: replace X = 0 in obs'
replace b = 1 in obs'
replace exp = 0 in obs'
keep exp l_homicide b
set obs 30
foreach x of varlist exp l_homicide b {
replace x'=0 in 30
}
reshape wide l_homicide, i(exp) j(b)

* Create the confidence intervals
cap drop *lb* *ub*
gen lb = l_homicide1 - 1.96*l_homicide0
gen ub = l_homicide1 + 1.96*l_homicide0

* Create the picture
set scheme s2color
#delimit ;
twoway (scatter l_homicide1 ub lb exp ,
lpattern(solid dash dash dot dot solid solid)
lcolor(gray gray gray red blue)
lwidth(thick medium medium medium medium thick thick)
msymbol(i i i i i i i i i i i i i i i) msize(medlarge medlarge)
mcolor(gray black gray gray red blue)
c(l l l l l l l l l l l l l l l)
cmissing(n n n n n n n n n n n n n n n n)
xline(0, lcolor(black) lpattern(solid))
yline(0, lcolor(black))
xlabel(-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 , labsize(medium))
ylabel(, nogrid labsize(medium))
xsize(7.5) ysize(5.5)
legend(off)
xtitle("Years before and after castle doctrine expansion", size(medium))
ytitle("Log Murders ", size(medium))
graphregion(fcolor(white) color(white) icolor(white) margin(zero))
yline(DDL', lcolor(red) lwidth(thick)) text(DD1' -0.10 "DD Coefficient = DD' (s.e. = DDSE')")
)
;

#delimit cr;

R Code

castle_4.R

# This version includes
# an interval that traces the confidence intervals
ggplot(aes(x = label, y = mean,
ymin = mean-1.96*sd,
ymax = mean+1.96*sd)) +
# this creates a red horizontal line
geom_hline(yintercept = 0.035169444, color = "red") +
geom_line() +
geom_point() +
geom_ribbon(alpha = 0.2) +
theme_minimal() +
# Important to have informative axes labels!
xlab("Years before and after castle doctrine expansion") +
ylab("log(Homicide Rate)") +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0)

Python Code

castle_4.py

# Missing python script`

You can see the figure that this creates in Figure 9.23. The difference between coefplot and this twoway command connects the event-study coefficients with lines, whereas coefplot displayed them as coefficients hanging in the air. Neither is right or wrong; I merely wanted you to see the differences for your own sake and to have code that you might experiment with and adapt to your own needs.

But the thing about this graph is that the leads are imbalanced. There’s only one state, for instance, in the ninth lead, and there’s only three in the eighth lead. So I’d like you to do two modifications to this. First, I’d like you to replace the sixth lead so that it is now equal to leads 6–9. In other words, we will force these late adopters to have the same coefficient as those with six years until treatment. When you do that, you should get Figure 9.24.

Next, let’s balance the event study by dropping the states who only show up in the seventh, eighth, and ninth leads.155 When you do this, you should get Figure 9.25.

If nothing else, exploring these different specifications and cuts of the data can help you understand just how confident you should be that prior to treatment, treatment and control states genuinely were pretty similar. And if they weren’t similar, it behooves the researcher to at minimum provide some insight to others as to why the treatment and control groups were dissimilar in levels. Because after all—if they were different in levels, then it’s entirely plausible they would be different in their counterfactual trends too because why else are they different in the first place .

### 9.6.8 Bacon decomposition

Recall that we run into trouble using the twoway fixed-effects model in a DD framework insofar as there are heterogeneous treatment effects over time. But the problem here only occurs with those $$2\times 2$$s that use late-treated units compared to early-treated units. If there are few such cases, then the issue is much less problematic depending on the magnitudes of the weights and the size of the DD coefficients themselves. What we are now going to do is simply evaluate the frequency with which this issue occurs using the Bacon decomposition. Recall that the Bacon decomposition decomposes the twoway fixed effects estimator of the DD parameter into weighted averages of individual $$2\times 2$$s across the four types of $$2\times 2$$s possible. The Bacon decomposition uses a binary treatment variable, so we will reestimate the effect of castle-doctrine statutes on logged homicide rates by coding a state as “treated” if any portion of the year it had a castle-doctrine amendment. We will work with the special case of no covariates for simplicity, though note that the decomposition works with the inclusion of covariates as well . Stata users will need to download -ddtiming- from Thomas Goldring’s website, which I’ve included in the first line.

First, let’s estimate the actual model itself using a post dummy equaling one if the state was covered by a castle-doctrine statute that year. Here we find a smaller effect than many of Cheng and Hoekstra’s estimates because we do not include their state-year interaction fixed effects strategy among other things. But this is just for illustrative purposes, so let’s move to the Bacon decomposition itself. We can decompose the parameter estimate into the three different types of $$2\times 2$$s, which I’ve reproduced in Table 9.8

Dependent Variable Log(homicide rate)
Castle-doctrine law 0.069
(0.034)
Table 9.8: Bacon Decomposition Example.
DD Comparison
Earlier T vs. Later C 0.077 $$-0.029$$
Later T vs. Earlier C 0.024 0.046
T vs. Never treated 0.899 0.078

Taking these weights, let’s just double check that they do indeed add up to the regression estimate we just obtained using our twoway fixed-effects estimator.156 \begin{align} di (0.077 * -0.029) + (0.024 * 0.046) + (0.899 * 0.078) = 0.069 \end{align}

That is our main estimate, and thus confirms what we’ve been building on, which is that the DD parameter estimate from a twoway fixed-effects estimator is simply a weighted average over the different types of $$2\times 2$$s in any differential design. Furthermore, we can see in the Bacon decomposition that most of the 0.069 parameter estimate is coming from comparing the treatment states to a group of never-treated states. The average DD estimate for that group is 0.078 with a weight of 0.899. So even though there is a later to early $$2\times 2$$ in the mix, as there always will be with any differential timing, it is small in terms of influence and ultimately pulls down the estimate.

But let’s now visualize this as distributing the weights against the DD estimates, which is a useful exercise. The horizontal line in Figure 9.26 shows the average DD estimate we obtained from our fixed-effects regression of 0.069. But then what are these other graphics? Let’s review.

Each icon in the graphic represents a single $$2\times 2$$ DD. The horizontal axis shows the weight itself, whereas the vertical axis shows the magnitude of that particular $$2\times 2$$. Icons further to the right therefore will be more influential in the final average DD than those closer to zero.

There are three kinds of icons here: an early to late group comparison (represented with a light $$\times$$), a late to early (dark $$\times$$), and a treatment compared to a never-treated (dark triangle). You can see that the dark triangles are all above zero, meaning that each of these $$2\times 2$$s (which correspond to a particular set of states getting the treatment in the same year) is positive. Now they are spread out somewhat—two groups are on the horizontal line, but the rest are higher. What appears to be the case, though, is that the group with the largest weight is really pulling down the parameter estimate and bringing it closer to the 0.069 that we find in the regression.

### 9.6.9 The future of DD

The Bacon decomposition is an important phase in our understanding of the DD design when implemented using the twoway fixed effects linear model. Prior to this decomposition, we had only a metaphorical understanding of the necessary conditions for identifying causal effects using differential timing with a twoway fixed-effects estimator. We thought that since the $$2\times 2$$ required parallel trends, that that “sort of” must be what’s going on with differential timing too. And we weren’t too far off—there is a version of parallel trends in the identifying assumptions of DD using twoway fixed effects with differential timing. But what Goodman-Bacon (2019) also showed is that the weights themselves drove the numerical estimates too, and that while some of it was intuitive (e.g., group shares being influential) others were not (e.g., variance in treatment being influential).

The Bacon decomposition also highlighted some of the unique challenges we face with differential timing. Perhaps no other problem is better highlighted in the diagnostics of the Bacon decomposition than the problematic “late to early” $$2\times 2$$ for instance. Given any heterogeneity bias, the late to early $$2\times 2$$ introduces biases even with variance weighted common trends holding! So, where to now?

From 2018 to 2020, there has been an explosion of work on the DD design. Much of it is unpublished, and there has yet to appear any real consensus among applied people as to how to handle it. Here I would like to outline what I believe could be a map as you attempt to navigate the future of DD. I have attempted to divide this new work into three categories: weighting, selective choice of “good” $$2 \times 2$$s, and matrix completion.

What we know now is that there are two fundamental problems with the DD design. First, there is the issue of weighting itself. The twoway fixed-effects estimator weights the individual $$2\times 2$$s in ways that do not make a ton of theoretical sense. For instance, why do we think that groups at the middle of the panel should be weighted more than those at the end? There’s no theoretical reason we should believe that. But as Goodman-Bacon (2019) revealed, that’s precisely what twoway fixed effects does. And this is weird because you can change your results simply by adding or subtracting years to the panel—not just because this changes the $$2\times 2$$, but also because it changes the variance in treatment itself! So that’s weird.157

But this is not really the fatal problem, you might say, with twoway fixed-effects estimates of a DD design. The bigger issue was what we saw in the Bacon decomposition—you will inevitably use past treated units as controls for future treated units, or what I called the “late to early $$2\times 2$$.” This happens both in the event study and in the designs modeling the average treatment effect with a dummy variable. Insofar as it takes more than one period for the treatment to be fully incorporated, then insofar as there’s substantial weight given to the late to early $$2\times 2$$s, the existence of heterogeneous treatment effects skews the parameter away from the ATT—maybe even flipping signs!158

Whereas the weird weighting associated with twoway fixed effects is an issue, it’s something you can at least check into because the Bacon decomposition allows you to separate out the $$2\times 2$$ average DD values from their weights. Thus, if your results are changing by adding years because your underlying $$2\times 2$$s are changing, you simply need to investigate it in the Bacon decomposition. The weights and the $$2\times 2$$s, in other words, are things that can be directly calculated, which can be a source of insight into why twoway fixed effects estimator is finding what it finds.

But the second issue is a different beast altogether. And one way to think of the emerging literature is that many authors are attempting to solve the problem that some of these $$2\times 2$$s (e.g., the late to early $$2\times 2$$) are problematic. Insofar as they are problematic, can we improve over our static twoway fixed-effects model? Let’s take a few of these issues up with examples from the growing literature.

Another solution to the weird weighting twoway fixed-effects problem has been provided by Callaway and Sant’Anna (2019).159 Callaway and Sant’Anna (2019) approach the DD framework very differently than Goodman-Bacon (2019). Callaway and Sant’Anna (2019) use an approach that allows them to estimate what they call the group-time average treatment effect, which is just the ATT for a given group at any point in time. Assuming parallel trends conditional on time-invariant covariates and overlap in a propensity score, which I’ll discuss below, you can calculate group ATT by time (relative time like in an event study or absolute time). One unique part of these authors’ approach is that it is non-parametric as opposed to regression-based. For instance, under their identifying assumptions, their nonparametric estimator for a group ATT by time is: \begin{align} ATT(g,t)=E\left[\left(\dfrac{G_g}{E[G_g]}- \dfrac{\dfrac{p_g(X)C}{1-p_g(X)}}{E \bigg [ \dfrac{p_g(X)C}{1-p_g(X)}\bigg]}\right) (Y_t-Y_{g-1})\right] \end{align} where the weights, $$p$$, are propensity scores, $$G$$ is a binary variable that is equal to 1 if an individual is first treated in period $$g$$, and $$C$$ is a binary variable equal to one for individuals in the control group. Notice there is no time index, so these $$C$$ units are the never-treated group. If you’re still with me, you should find the weights straightforward. Take observations from the control group as well as group $$g$$, and omit the other groups. Then weight up those observations from the control group that have characteristics similar to those frequently found in group $$g$$ and weight down observations from the control group that have characteristics rarely found in group $$g$$. This kind of reweighting procedure guarantees that the covariates of group $$g$$ and the control group are balanced. You can see principles from earlier chapters making their way into this DD estimation—namely, balance on covariates to create exchangeable units on observables.

But because we are calculating group-specific ATT by time, you end up with a lot of treatment effect parameters. The authors address this by showing how one can take all of these treatment effects and collapse them into more interpretable parameters, such as a larger ATT. All of this is done without running a regression, and therefore avoids some of the unique issues created in doing so.

One simple solution might be to estimate your event-study model and simply take the mean over all lags using a linear combination of all point estimates . Using this method, we in fact find considerably larger effects or nearly twice the size as we get from the simpler static twoway fixed-effects model. This is perhaps an improvement because weights can be large on the long-run effects due to large effects from group shares. So if you want a summary measure, it’s better to estimate the event study and then average them afterthe fact.

Another great example of a paper wrestling with the biases brought up by heterogeneous treatment effects is Sun and Abraham (2020). This paper is primarily motivated by problems created in event studies, but you can see some of the issues brought up in Goodman-Bacon (2019). In an event study with differential timing, as we discussed earlier, leads and lags are often used to measure dynamics in the treatment itself. But these can produce causally uninterpretable results because they will assign non-convex weights to cohort-specific treatment effects. Similar to Callaway and Sant’Anna (2019), they propose estimating a group-specific dynamic effect and from those calculate a group specific estimate.

The way I organize these papers in my mind is around the idea of heterogeneity in time, the use of twoway fixed effects, and differential timing. The theoretical insight from all these papers is the coefficients on the static twoway fixed-effects leads and lags will be unintelligible if there is heterogeneity in treatment effects over time. In this sense, we are back in the world that Goodman-Bacon (2019) revealed, in which heterogeneity treatment effect biases create real challenges for the DD design using twoway fixed effects.160

Their alternative is estimate a “saturated” model to ensure that the heterogeneous problem never occurs in the first place. The proposed alternative estimation technique is to use an interacted specification that is saturated in relative time indicators as well as cohort indicators. The treatment effect associated with this design is called the interaction-weighted estimator, and using it, the DD parameter is equivalent to the difference between the average change in outcomes for a given cohort in those periods prior to treatment and the average changes for those units that had not been treated at the time interval. Additionally, this method uses the never-treated units as controls, and thereby avoids the hairy problems noted in Goodman-Bacon (2019) when computing later to early $$2\times 2$$s.161

Another paper that attempts to circumvent the weirdness of the regression-based method when there are numerous late to early $$2\times 2$$s is Cengiz et al. (2019). This is bound to be a classic study in labor for its exhaustive search for detectable repercussions of the minimum wage on low-paying jobs. The authors ultimately find little evidence to support any concern, but how do they come to this conclusion?

Cengiz et al. (2019) take a careful approach by creating separate samples. The authors want to know the impact of minimum-wage changes on low-wage jobs across 138 state-level minimum-wage changes from 1979 to 2016. The authors in an appendix note the problems with aggregating individual DD estimates into a single parameter, and so tackle the problem incrementally by creating 138 separate data sets associated with a minimum-wage event. Each sample has both treatment groups and control groups, but not all units are used as controls. Rather, only units that were not treated within the sample window are allowed to be controls. Insofar as a control is not treated during the sample window associated with a treatment unit, it can be by this criteria used as a control. These 138 estimates are then stacked to calculate average treatment effects. This is an alternative method to the twoway fixed-effects DD estimator because it uses a more stringent criteria for whether a unit can be considered a control. This in turn circumvents the heterogeneity problems that Goodman-Bacon (2019) notes because Cengiz et al. (2019) essentially create 138 DD situations in which controls are always “never-treated” for the duration of time under consideration.

But the last methodology I will discuss that has emerged in the last couple of years is a radical departure from the regression-based methodology altogether. Rather than use a twoway fixed-effects estimator to estimate treatment effects with differential timing, Athey et al. (2018) propose a machine-learning-based methodology called “matrix completion” for panel data. The estimator is exotic and bears some resemblance to matching imputation and synthetic control. Given the growing popularity of placing machine learning at the service of causal inference, I suspect that once Stata code for matrix completion is introduced, we will see this procedure used more broadly.

Matrix completion for panel data is a machine-learning-based approach to causal inference when one is working explicitly with panel data and differential timing. The application of matrix completion to causal inference has some intuitive appeal given one of the ways that Rubin has framed causality is as a missing data problem. Thus, if we are missing the matrix of counterfactuals, we might explore whether this method from computer science could assist us in recovering it. Imagine we could create two matrices of potential outcomes: a matrix of $$Y^0$$ potential outcomes for all panel units over time and $$Y^1$$. Once treatment occurs, a unit switches from $$Y^0$$ to $$Y^1$$ under the switching equation, and therefore the missing data problem occurs. Missingness is simply another way of describing the fundamental problem of causal inference for there will never be a complete set of matrices enabling calculation of interesting treatment parameters given the switching equation only assigns one of them to reality.

Say we are interested in this treatment effect parameter: \begin{align} \widehat{\delta_{ATT}} = \dfrac{1}{N_T} \sum \big(Y_{it}^1 - Z_{it}^0\big) \end{align} where $$Y^1$$ are the observed outcomes in a panel unit at some post-treatment period, $$Z^0$$ is the estimated missing elements of the $$Y^0$$ matrix for the post-treatment period, and $$N_T$$ is the number of treatment units. Matrix completion uses the observed elements of the matrix’s realized values to predict the missing elements of the $$Y^0$$ matrix (missing due to being in the post-treatment period and therefore having switched from $$Y^0$$ to $$Y^1$$).

Analytically, this imputation is done via something called regularization-based prediction. The objective in this approach is to optimally predict the missing elements by minimizing a convex function of the difference between the observed matrix of $$Y^0$$ and the unknown complete matrix $$Z^0$$ using nuclear norm regularization. Let $$\Omega$$ denote the row and column indices $$(i,j)$$ of the observed entries of the outcomes, then the objective function can be written as \begin{align} \widehat{Z^0}=\arg\min_{Z^0} \sum_{(i,j) \in \Omega} \dfrac{(Y^0_{it} - Z^0_{it})^2}{|\Omega|}+\Lambda ||Z^0|| \end{align} where $$||Z^0||$$ is the nuclear norm (sum of singular values of $$Z0$$). The regularization parameter $$\Lambda$$ is chosen using tenfold cross validation. Athey et al. (2018) show that this procedure outperforms other methods in terms of root mean squared prediction error.

Unfortunately, at present estimation using matrix completion is not available in Stata. R packages for it do exist, such as the gsynth package, but it has to be adapted for Stata users. And until it is created, I suspect adoption will lag.

## 9.7 Conclusion

America’s institutionalized state federalism provides a constantly evolving laboratory for applied researchers seeking to evaluate the causal effects of laws and other interventions. It has for this reason probably become one of the most popular forms of identification among American researchers, if not the most common. A Google search of the phrase “difference-in-differences” brought up 45,000 hits. It is arguably the most common methodology you will use—more than IV or matching or even RDD, despite RDD’s greater perceived credibility. There is simply a never-ending flow of quasi-experiments being created by our decentralized data-generating process in the United States made even more advantageous by so many federal agencies being responsible for data collection, thus ensuring improved data quality and consistency.

But, what we have learned in this chapter is that while there is a current set of identifying assumptions and practices associated with the DD design, differential timing does introduce some thorny challenges that have long been misunderstood. Much of the future of DD appears to be mounting solutions to problems we are coming to understand better, such as the odd weighting of regression itself and problematic $$2\times 2$$ DDs that bias the aggregate ATT when heterogeneity in the treatment effects over time exists. Nevertheless, DD—and specifically, regression-based DD—is not going away. It is the single most popular design in the applied researcher’s toolkit and likely will be for many years to come. Thus it behooves the researcher to study this literature carefully so that they can better protect against various forms of bias.