2 Probability and Regression Review
2.1 Basic probability theory
In practice, causal inference is based on statistical models that range from the very simple to extremely advanced. And building such models requires some rudimentary knowledge of probability theory, so let’s begin with some definitions. A random process is a process that can be repeated many times with different outcomes each time. The sample space is the set of all the possible outcomes of a random process. We distinguish between discrete and continuous random processes Table 1 below. Discrete processes produce, integers, whereas continuous processes produce fractions as well.
Description | Type | Potential outcomes |
---|---|---|
12-sided die | Discrete | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 |
Coin | Discrete | Heads, Tails |
Deck of cards | Discrete | 2 |
Gas prices | Continuous |
We define independent events two ways. The first refers to logical independence. For instance, two events occur but there is no reason to believe that the two events affect each other. When it is assumed that they do affect each other, this is a logical fallacy called post hoc ergo propter hoc, which is Latin for “after this, therefore because of this.” This fallacy recognizes that the temporal ordering of events is not sufficient to be able to say that the first thing caused the second.
The second definition of an independent event is statistical independence. We’ll illustrate the latter with an example from the idea of sampling with and without replacement. Let’s use a randomly shuffled deck of cards for an example. For a deck of 52 cards, what is the probability that the first card will be an ace?
Assume that the first card was an ace. Now we ask the question again. If we shuffle the deck, what is the probability the next card drawn is also an ace? It is no longer
But what if we want to know the probability of some event occurring that requires that multiple events first to occur? For instance, let’s say we’re talking about the Cleveland Cavaliers winning the NBA championship. In 2016, the Golden State Warriors were 3–1 in a best-of-seven playoff. What had to happen for the Warriors to lose the playoff? The Cavaliers had to win three in a row. In this instance, to find the probability, we have to take the product of all marginal probabilities, or
Let’s formalize what we’ve been saying for a more generalized case. For independent events, to calculate joint probabilities, we multiply the marginal probabilities:
Now, for a slightly more difficult application. What is the probability of rolling a 7 using two six-sided dice, and is it the same as the probability of rolling a 3? To answer this, let’s compare the two probabilities. We’ll use a table to help explain the intuition. First, let’s look at all the ways to get a 7 using two six-sided dice. There are 36 total possible outcomes
2.2 Events and conditional probability
First, before we talk about the three ways of representing a probability, I’d like to introduce some new terminology and concepts: events and conditional probabilities. Let
A and B: Both A and B occur.
A and B: A does not occur, but B occurs.A and
B: A occurs, but B does not occur. A and B: Neither A nor B occurs.
I’ll use a couple of different examples to illustrate how to represent a probability.
2.3 Probability tree
Let’s think about a situation in which you are trying to get your driver’s license. Suppose that in order to get a driver’s license, you have to pass the written exam and the driving exam. However, if you fail the written exam, you’re not allowed to take the driving exam. We can represent these two events in a probability tree.
Probability trees are intuitive and easy to interpret.2 First, we see that the probability of passing the written exam is 0.75 and the probability of failing the exam is 0.25. Second, at every branching off from a node, we can further see that the probabilities associated with a given branch are summing to 1.0. The joint probabilities are also all summing to 1.0. This is called the law of total probability and it is equal to the sum of all joint probability of A and B
We also see the concept of a conditional probability in the driver’s license tree. For instance, the probability of failing the driving exam, conditional on having passed the written exam, is represented as
2.4 Venn diagrams and sets
A second way to represent multiple events occurring is with a Venn diagram. Venn diagrams were first conceived by John Venn in 1880. They are used to teach elementary set theory, as well as to express set relationships in probability and statistics. This example will involve two sets,
The University of Texas’s football coach has been on the razor’s edge with the athletic director and regents all season. After several mediocre seasons, his future with the school is in jeopardy. If the Longhorns don’t make it to a great bowl game, he likely won’t be rehired. But if they do, then he likely will be rehired. Let’s discuss elementary set theory using this coach’s situation as our guiding example. But before we do, let’s remind ourselves of our terms.
Note, that
We can rewrite out the following definitions:
Whenever we want to describe a set of events in which either
Now let’s look closely at a relationship involving the set A.
Notice what this is saying: there are two ways to identify the
A similar style of reasoning can help you understand the following expression.
Now it is just simple addition to find all missing values. Recall that
When working with sets, it is important to understand that probability is calculated by considering the share of the set (for example
I left this intentionally undefined on the left side so as to focus on the calculation itself. But now let’s define what we are wanting to calculate: In a world where
2.5 Contingency tables
Another way that we can represent events is with a contingency table. Contingency tables are also sometimes called twoway tables. Table 2.4 is an example of a contingency table. We continue with our example about the worried Texas coach.
Event labels | Coach is not rehired |
Coach is rehired |
Total |
---|---|---|---|
Total | 1.0 |
Recall that
So, we can use what we have done so far to write out a definition of joint probability. Let’s start with a definition of conditional probability first. Given two events,
Using equations (1) and (2), I can simply write down a definition of joint probabilities.
And this is the formula for joint probability. Given equation (3), and using the definitions of
Equation 2.3 is sometimes called the naive version of Bayes’s rule. We will now decompose this equation more fully, though, by substituting equation (5) into Equation 2.3.
Substituting Equation 2.1 into the denominator for Equation 2.4 yields:
That’s a mouthful of substitutions, so what does Equation 2.6 mean? This is the Bayesian decomposition version of Bayes’s rule. Let’s use our example again of Texas making a great bowl game.
2.6 Monty Hall example
Let’s use a different example, the Monty Hall example. This is a fun one, because most people find itcounterintuitive. It even is used to stump mathematicians and statisticians.4 But Bayes’s rule makes the answer very clear—so clear, in fact, that it’s somewhat surprising that Bayes’s rule was actually once controversial (Mcgrayne 2012).
Let’s assume three closed doors: door 1
A common response to Monty Hall’s offer is to say it makes no sense to change doors, because there’s an equal chance that the million dollars is behind either door. Therefore, why switch? There’s a 50–50 chance it’s behind the door picked and there’s a 50–50 chance it’s behind the remaining door, so it makes no rational sense to switch. Right? Yet, a little intuition should tell you that’s not the right answer, because it would seem that when Monty Hall opened that third door, he made a statement. But what exactly did he say?
Let’s formalize the problem using our probability notation. Assume that you chose door 1,
We need to know the probability that door 3 has the million dollars and compare that to Door 1’s probability. We will call the opening of door 2 event
There are basically two kinds of probabilities on the right side of the equation. There’s the marginal probability that the million dollars is behind a given door,
The marginal probability that door
The conditional probability,
Let’s think about the second conditional probability:
And then the last conditional probability,
Each of these conditional probabilities requires thinking carefully about the feasibility of the events in question. Let’s examine the easiest question:
What then is
What about the second conditional probability,
But what about the probability that door 3, the door you’re holding, has the million dollars? Have your beliefs about that likelihood changed now that door 2 has been removed from the equation? Let’s crank through our Bayesian decomposition and see whether we learned anything.
Interestingly, while your beliefs about the door you originally chose haven’t changed, your beliefs about the other door have changed. The prior probability,
As was mentioned in footnote 14 regarding the controversy around vos Sant’s correct reasoning about the need to switch doors, deductions based on Bayes’s rule are often surprising even to smart people—probably because we lack coherent ways to correctly incorporate information into probabilities. Bayes’s rule shows us how to do that in a way that is logical and accurate. But besides being insightful, Bayes’s rule also opens the door for a different kind of reasoning about cause and effect. Whereas most of this book has to do with estimating effects from known causes, Bayes’s rule reminds us that we can form reasonable beliefs about causes from known effects.
2.7 Summation operator
The tools we use to reason about causality rest atop a bedrock of probabilities. We are often working with mathematical tools and concepts from statistics such as expectations and probabilities. One of the most common tools we will use in this book is the linear regression model, but before we can dive into that, we have to build out some simple notation.5 We’ll begin with the summation operator. The Greek letter
For any constant
We can use the summation indicator to make a number of calculations, some of which we will do repeatedly over the course of this book. For instance, we can use the summation operator to calculate the average:
Consider a sequence of two numbers {
An overly long, step-by-step proof is below. Note that the summation index is suppressed after the first line for easier reading.
A more general version of this result is:
Or:
2.8 Expected value
The expected value of a random variable, also called the expectation and sometimes the population mean, is simply the weighted average of the possible values that the variable can take, with the weights being given by the probability of each value occurring in the population. Suppose that the variable
The first property of expected value is that for any constant
We can also express this using the expectation operator:
And in the special case where
2.9 Variance
The expectation operator,
A few more properties of variance. First, the variance of a line is:
And the variance of a constant is 0 (i.e.,
2.10 Covariance
The last part of Equation 2.11 is called the covariance. The covariance measures the amount of linear dependence between two random variables. We represent it with the
Interpreting the magnitude of the covariance can be tricky. For that, we are better served by looking at correlation. We define correlation as follows. Let
2.11 Population model
We begin with cross-sectional analysis. We will assume that we can collect a random sample from the population of interest. Assume that there are two variables,
There are three questions that immediately come up. One, what if
Equation 2.14 explicitly allows for other factors to affect
First, we make a simplifying assumption without loss of generality. Let the expected value of
2.12 Mean independence
An assumption that meshes well with our elementary treatment of statistics involves the mean of the error term for each “slice” of the population determined by values of
An example might help here. Let’s say we are estimating the effect of schooling on wages, and
But let’s say we are willing to make this assumption. Then combining this new assumption,
2.13 Ordinary least squares
Given data on
to obtain estimating equations for
These are the two conditions in the population that effectively determine
The previous formula for
Once we have calculated
For any candidate estimates,
Suppose we measure the size of the mistake, for each
This equation is called the sum of squared residuals because the residual is
Once we have the numbers
Code
set seed 1
clear
set obs 10000
gen x = rnormal()
gen u = rnormal()
gen y = 5.5*x + 12*u
reg y x
predict yhat1
gen yhat2 = -0.0750109 + 5.598296*x // Compare yhat1 and yhat2
sum yhat*
predict uhat1, residual
gen uhat2=y-yhat2
sum uhat*
twoway (lfit y x, lcolor(black) lwidth(medium)) (scatter y x, mcolor(black) ///
msymbol(point)), title(OLS Regression Line)
msize(tiny) rvfplot, yline(0)
Code
library(tidyverse)
set.seed(1)
<- tibble(
tb x = rnorm(10000),
u = rnorm(10000),
y = 5.5*x + 12*u
)
<- tb %>%
reg_tb lm(y ~ x, .) %>%
print()
$coefficients
reg_tb
<- tb %>%
tb mutate(
yhat1 = predict(lm(y ~ x, .)),
yhat2 = 0.0732608 + 5.685033*x,
uhat1 = residuals(lm(y ~ x, .)),
uhat2 = y - yhat2
)
summary(tb[-1:-3])
%>%
tb lm(y ~ x, .) %>%
ggplot(aes(x=x, y=y)) +
ggtitle("OLS Regression Line") +
geom_point(size = 0.05, color = "black", alpha = 0.5) +
geom_smooth(method = lm, color = "black") +
annotate("text", x = -1.5, y = 30, color = "red",
label = paste("Intercept = ", -0.0732608)) +
annotate("text", x = 1.5, y = -30, color = "blue",
label = paste("Slope =", 5.685033))
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
1)
np.random.seed(
= pd.DataFrame({
tb 'x': np.random.normal(size=10000),
'u': np.random.normal(size=10000)})
'y'] = 5.5*tb['x'].values + 12*tb['u'].values
tb[
= sm.OLS.from_formula('y ~ x', data=tb).fit()
reg_tb
reg_tb.summary()
'yhat1'] = reg_tb.predict(tb)
tb['yhat2'] = 0.1114 + 5.6887*tb['x']
tb['uhat1'] = reg_tb.resid
tb['uhat2'] = tb['y'] - tb['yhat2']
tb[
tb.describe()
='x', y='y')) +\
p.ggplot(tb, p.aes(x"OLS Regression Line") +\
p.ggtitle(= 0.05, color = "black", alpha = 0.5) +\
p.geom_point(size ='x', y='y'), method = "lm", color = "black") +\
p.geom_smooth(p.aes(x"text", x = -1.5, y = 30, color = "red",
p.annotate(= "Intercept = {}".format(-0.0732608)) +\
label "text", x = 1.5, y = -30, color = "blue",
p.annotate(= "Slope = {}".format(5.685033)) label
Let’s look at the output from this. First, if you summarize the data, you’ll see that the fitted values are produced both using Stata’s Predict command and manually using the Generate command. I wanted the reader to have a chance to better understand this, so did it both ways. But second, let’s look at the data and paste on top of it the estimated coefficients, the y-intercept and slope on
Once we have the estimated coefficients and we have the OLS regression line, we can predict
Notice that the intercept is the predicted value of
Now that we have calculated
Recall that we defined the fitted value as
2.14 Algebraic Properties of OLS
Remember how we obtained
Code
clear
set seed 1234
set obs 10
gen x = 9*rnormal()
gen u = 36*rnormal()
gen y = 3 + 2*x + u
reg y x
predict yhat
predict residuals, residual
su residualslist
collapse (sum) x u y yhat residuals
list
Code
library(tidyverse)
set.seed(1)
<- tibble(
tb x = 9*rnorm(10),
u = 36*rnorm(10),
y = 3 + 2*x + u,
yhat = predict(lm(y ~ x)),
uhat = residuals(lm(y ~ x))
)
summary(tb)
colSums(tb)
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
= pd.DataFrame({
tb 'x': 9*np.random.normal(size=10),
'u': 36*np.random.normal(size=10)})
'y'] = 3*tb['x'].values + 2*tb['u'].values
tb[
= sm.OLS.from_formula('y ~ x', data=tb).fit()
reg_tb
'yhat1'] = reg_tb.predict(tb)
tb['uhat1'] = reg_tb.resid
tb[
tb.describe()
Output from this can be summarized as in the following table (Table 2.6).
no. | |||||||
---|---|---|---|---|---|---|---|
1. | 155.3967 | 115.4762 | |||||
2. | 70.22192 | 139.0793 | |||||
3. | 17.80379 | 20.60738 | 7.836532 | 12.77085 | 100.0792 | ||
4. | 7.770137 | 1.790884 | |||||
5. | 4.640209 | 13.18046 | 25.46088 | 20.10728 | 5.353592 | 24.84179 | 107.6462 |
6. | 4.848374 | 48.83337 | |||||
7. | 11.58586 | 9.118524 | 35.29023 | 38.09396 | |||
8. | 82.23296 | 74.65305 | 80.26126 | ||||
9. | 11.60571 | 14.0549 | 7.377647 | 6.677258 | 49.26245 | ||
10. | 159.0706 | 346.8405 | |||||
Sum | 34.25085 | 7.749418 | 7.749418 | 1.91e-06 | .0000305 |
Notice the difference between the
Because
A third property is that if we plug in the average for
2.15 Goodness-of-fit
For each observation, we write
These are sample variances when divided by
I would encourage you not to fixate on
2.16 Expected value of OLS
Up until now, we motivated simple regression using a population model. But our analysis has been purely algebraic, based on a sample of data. So residuals always average to zero when we apply OLS to a sample, regardless of any underlying model. But our job gets tougher. Now we have to study the statistical properties of the OLS estimator, referring to a population model and assuming random sampling.15
The field of mathematical statistics is concerned with questions. How do estimators behave across different samples of data? On average, for instance, will we get the right answer if we repeatedly sample? We need to find the expected value of the OLS estimators—in effect, the average outcome across all possible random samples—and determine whether we are right, on average. This leads naturally to a characteristic called unbiasedness, which is desirable of all estimators.
There are several assumptions required for OLS to be unbiased. The first assumption is called linear in the parameters. Assume a population model
Our second assumption is random sampling. We have a random sample of size
The third assumption is called sample variation in the explanatory variable. That is, the sample outcomes on
With the fourth assumption our assumptions start to have real teeth. It is called the zero conditional mean assumption and is probably the most critical assumption in causal inference. In the population, the error term has zero mean given any value of the explanatory variable:
So, how do we show that
Step 1: Write down a formula for
Step 2: Replace each
We have shown that:
Step 3: Find
Now we can complete the proof: conditional on
I find it helpful to be concrete when we work through exercises like this. So let’s visualize this. Let’s create a Monte Carlo simulation. We have the following population model:
Code
clear all
program define ols, rclass
version 14.2
syntax [, obs(integer 1) mu(real 0) sigma(real 1) ]
clear
drop _all
set obs 10000
gen x = 9*rnormal()
gen u = 36*rnormal()
gen y = 3 + 2*x + u
reg y x
end
simulate beta=_b[x], reps(1000): ols
su hist beta
Code
library(tidyverse)
<- lapply(
lm 1:1000,
function(x) tibble(
x = 9*rnorm(10000),
u = 36*rnorm(10000),
y = 3 + 2*x + u
%>%
) lm(y ~ x, .)
)
as_tibble(t(sapply(lm, coef))) %>%
summary(x)
as_tibble(t(sapply(lm, coef))) %>%
ggplot()+
geom_histogram(aes(x), binwidth = 0.01)
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
= np.zeros(1000)
coefs for i in range(1000):
= pd.DataFrame({
tb 'x': 9*np.random.normal(size=10000),
'u': 36*np.random.normal(size=10000)})
'y'] = 3 + 2*tb['x'].values + tb['u'].values
tb[
= sm.OLS.from_formula('y ~ x', data=tb).fit()
reg_tb
= reg_tb.params['x']
coefs[i]
+\
p.ggplot() =coefs), binwidth = 0.01) p.geom_histogram(p.aes(x
Table 2.7 gives us the mean value of
The problem is, we don’t know which kind of sample we have. Do we have one of the “almost exactly 2” samples, or do we have one of the “pretty different from 2” samples? We can never know whether we are close to the population value. We hope that our sample is “typical” and produces a slope estimate close to
2.17 Law of iterated expectations
The conditional expectation function (CEF) is the mean of some outcome
An important complement to the CEF is the law of iterated expectations (LIE). This law says that an unconditional expectation can be written as the unconditional average of the CEF. In other words,
2.18 CEF decomposition property
The first property of the CEF we will discuss is the CEF decomposition property. The power of LIE comes from the way it breaks a random variable into two pieces—the CEF and a residual with special properties. The CEF decomposition property states that
where (i)
and (ii)
The theorem says that any random variable
The second part of the theorem states that
2.19 CEF prediction property
The second property is the CEF prediction property. This states that
Now minimize the function with respect to
2.20 ANOVA theory
The final property of the CEF that we will discuss is the analysis of variance theorem, or ANOVA. According to this theorem, the unconditional variance in some random variable is equal to the variance in the conditional expectation plus the expectation of the conditional variance, or
2.21 Linear CEF theorem
As you probably know by now, the use of least squares in applied work is extremely common. That’s because regression has several justifications. We discussed one—unbiasedness under certain assumptions about the error term. But I’d like to present some slightly different arguments. Angrist and Pischke (2009) argue that linear regression may be useful even if the underlying CEF itself is not linear, because regression is a good approximation of the CEF. So keep an open mind as I break this down a little bit more.
Angrist and Pischke (2009) give several arguments for using regression, and the linear CEF theorem is probably the easiest. Let’s assume that we are sure that the CEF itself is linear. So what? Well, if the CEF is linear, then the linear CEF theorem states that the population regression is equal to that linear CEF. And if the CEF is linear, and if the population regression equals it, then of course you should use the population regression to estimate CEF. If you need a proof for what could just as easily be considered common sense, I provide one. If
2.22 Best linear predictor theorem
There are a few other linear theorems that are worth bringing up in this context. For instance, recall that the CEF is the minimum mean squared error predictor of
2.23 Regression CEF theorem
I would now like to cover one more attribute of regression. The function
2.24 Regression anatomy theorem
In addition to our discussion of the CEF and regression theorems, we now dissect the regression itself. Here we discuss the regression anatomy theorem. The regression anatomy theorem is based on earlier work by Frisch and Waugh (1933) and Lovell (1963).22 I find the theorem more intuitive when I think through a specific example and offer up some data visualization. I have found Filoso (2013) very helpful in explaining this with proofs and algebraic derivations, so I will use this notation and set of steps here as well. In my opinion, the theorem helps us interpret the individual coefficients of a multiple linear regression model. Say that we are interested in the causal effect of family size on labor supply. We want to regress labor supply on family size:
If family size is truly random, then the number of kids in a family is uncorrelated with the unobserved error term.23 This implies that when we regress labor supply on family size, our estimate,
But most likely, family size isn’t random, because so many people choose the number of children to have in their family—instead of, say, flipping a coin. So how do we interpret
But let’s say that we have reason to think that the number of kids in a family is conditionally random. To make this tractable for the sake of pedagogy, let’s say that a particular person’s family size is as good as randomly chosen once we condition on race and age.24 While unrealistic, I include it to illustrate an important point regarding multivariate regressions. If this assumption were to be true, then we could write the following equation:
If we want to estimate the average causal effect of family size on labor supply, then we need two things. First, we need a sample of data containing all four of those variables. Without all four of the variables, we cannot estimate this regression model. Second, we need for the number of kids,
Now, how do we interpret
To explain the intuition of the regression anatomy theorem, let’s write down a population model with multiple variables. Assume that your main multiple regression model of interest has K covariates. We can then write it as:
To prove the theorem, note that
Since by construction
Consider now the term
Since
Once again, since
The only remaining term, then, is
This gives
which follows directly from the orthogonality between
I find it helpful to visualize things. Let’s look at an example in Stata using its popular automobile data set. I’ll show you:
Code
by @Filoso2013.
* -reganat- is a user-created package ssc install reganat, replace
sysuse auto.dta, replace
regress price length
regress price length weight headroom mpg
length weight headroom mpg, dis(length) biline reganat price
Code
library(tidyverse)
library(haven)
<- function(df) {
read_data <- paste0("https://github.com/scunning1975/mixtape/raw/master/",
full_path
df)::read_dta(full_path)
haven
}
<-
auto read_data("auto.dta") %>%
mutate(length = length - mean(length))
<- lm(price ~ length, auto)
lm1 <- lm(price ~ length + weight + headroom + mpg, auto)
lm2 <- lm(length ~ weight + headroom + mpg, auto)
lm_aux <-
auto %>%
auto mutate(length_resid = residuals(lm_aux))
<- lm(price ~ length_resid, auto)
lm2_alt
<- lm1$coefficients
coef_lm1 <- lm2_alt$coefficients
coef_lm2_alt <- lm2$residuals
resid_lm2
<- tibble(price = coef_lm2_alt[1] + coef_lm1[2]*auto$length_resid,
y_single length_resid = auto$length_resid)
<- tibble(price = coef_lm2_alt[1] + coef_lm2_alt[2]*auto$length_resid,
y_multi length_resid = auto$length_resid)
%>%
auto ggplot(aes(x=length_resid, y = price)) +
geom_point() +
geom_smooth(data = y_multi, color = "blue") +
geom_smooth(data = y_single, color = "red")
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
= pd.read_stata('https://github.com/scunning1975/mixtape/raw/master/auto.dta')
auto 'length'] = auto['length'] - auto['length'].mean()
auto[
= sm.OLS.from_formula('price ~ length', data=auto).fit()
lm1 = sm.OLS.from_formula('price ~ length + weight + headroom + mpg', data=auto).fit()
lm2
= lm1.params
coef_lm1 = lm2.params
coef_lm2 = lm2.resid
resid_lm2
'y_single'] = coef_lm1[0] + coef_lm1[1]*auto['length']
auto['y_multi'] = coef_lm1[0] + coef_lm2[1]*auto['length']
auto[
+\
p.ggplot(auto) = 'length', y = 'price')) +\
p.geom_point(p.aes(x = 'length', y = 'y_multi'), color = "blue") +\
p.geom_smooth(p.aes(x = 'length', y = 'y_single'), color="red") p.geom_smooth(p.aes(x
Let’s walk through both the regression output that I’ve reproduced in Table 2.8 as well as a nice visualization of the slope parameters in what I’ll call the short bivariate regression and the longer multivariate regression. The short regression of price on car length yields a coefficient of 57.20 on length. For every additional inch, a car is $57 more expensive, which is shown by the upward-sloping, dashed line in Figure 2.4. The slope of that line is 57.20.
Covariates | ||
---|---|---|
Length | 57.20 | |
(14.08) | (40.40) | |
Weight | 4.34 | |
(1.16) | ||
Headroom | ||
(388.49) | ||
MPG | ||
(83.59) |
It will eventually become second nature for you to talk about including more variables on the right side of a regression as “controlling for” those variables. But in this regression anatomy exercise, I hope to give a different interpretation of what you’re doing when you in fact “control for” variables in a regression. First, notice how the coefficient on length changed signs and increased in magnitude once we controlled for the other variables. Now, the effect on length is
So what exactly going on in this visualization? Well, for one, it has condensed the number of dimensions (variables) from four to only two. It did this through the regression anatomy process that we described earlier. Basically, we ran the auxiliary regression, used the residuals from it, and then calculated the slope coefficient as
2.25 Variance of the OLS estimators
That more or less summarizes what we want to discuss regarding the linear regression. Under a zero conditional mean assumption, we could epistemologically infer that the rule used to produce the coefficient from a regression in our sample was unbiased. That’s nice because it tells us that we have good reason to believe that result. But now we need to build out this epistemological justification so as to capture the inherent uncertainty in the sampling process itself. This added layer of uncertainty is often called inference. Let’s turn to it now.
Remember the simulation we ran earlier in which we resampled a population and estimated regression coefficients a thousand times? We produced a histogram of those 1,000 estimates in Figure 2.3. The mean of the coefficients was around 1.998, which was very close to the true effect of 2 (hard-coded into the data-generating process). But the standard deviation was around 0.04. This means that, basically, in repeated sampling of some population, we got different estimates. But the average of those estimates was close to the true effect, and their spread had a standard deviation of 0.04. This concept of spread in repeated sampling is probably the most useful thing to keep in mind as we move through this section.
Under the four assumptions we discussed earlier, the OLS estimators are unbiased. But these assumptions are not sufficient to tell us anything about the variance in the estimator itself. The assumptions help inform our beliefs that the estimated coefficients, on average, equal the parameter values themselves. But to speak intelligently about the variance of the estimator, we need a measure of dispersion in the sampling distribution of the estimators. As we’ve been saying, this leads us to the variance and ultimately to the standard deviation. We could characterize the variance of the OLS estimators under the four assumptions. But for now, it’s easiest to introduce an assumption that simplifies the calculations. We’ll keep the assumption ordering we’ve been using and call this the fifth assumption.
The fifth assumption is the homoskedasticity or constant variance assumption. This assumption stipulates that our population error term,
Theorem: Sampling variance of OLS. Under assumptions 1 and 2, we get:
Usually, we are interested in
Notice that
The standard deviation of
Next we look at estimating the error variance. In the formula,
We now propose the following theorem. The unbiased estimator of
Given
2.26 Robust standard errors
How realistic is it that the variance in the errors is the same for all slices of the explanatory variable,
This isn’t completely bad news, because the unbiasedness of our regressions based on repeated sampling never depended on assuming anything about the variance of the errors. Those four assumptions, and particularly the zero conditional mean assumption, guaranteed that the central tendency of the coefficients under repeated sampling would equal the true parameter, which for this book is a causal parameter. The problem is with the spread of the coefficients. Without homoskedasticity, OLS no longer has the minimum mean squared errors, which means that the estimated standard errors are biased. Using our sampling metaphor, then, the distribution of the coefficients is probably larger than we thought. Fortunately, there is a solution. Let’s write down the variance equation under heterogeneous variance terms:
2.27 Cluster robust standard errors
People will try to scare you by challenging how you constructed your standard errors. Heteroskedastic errors, though, aren’t the only thing you should be worried about when it comes to inference. Some phenomena do not affect observations individually, but they do affect groups of observations that involve individuals. And then they affect those individuals within the group in a common way. Say you want to estimate the effect of class size on student achievement, but you know that there exist unobservable things (like the teacher) that affect all the students equally. If we can commit to independence of these unobservables across classes, but individual student unobservables are correlated within a class, then we have a situation in which we need to cluster the standard errors. Before we dive into an example, I’d like to start with a simulation to illustrate the problem.
As a baseline for this simulation, let’s begin by simulating nonclustered data and analyze least squares estimates of that nonclustered data. This will help firm up our understanding of the problems that occur with least squares when data is clustered.27
Code
clear all
set seed 20140
of simulations
* Set the number local n_sims = 1000
set obs `n_sims'
of each simulation
* Create the variables that will contain the results generate beta_0 = .
generate beta_0_l = .
generate beta_0_u = .
generate beta_1 = .
generate beta_1_l = .
generate beta_1_u = .
* Provide the true population parameterslocal beta_0_true = 0.4
local beta_1_true = 0
local rho = 0.5
save the parameters beta_0 and beta_1
* Run the linear regression 1000 times and quietly {
forvalues i = 1(1) `n_sims' {
preserve
clear
set obs 100
generate x = rnormal(0,1)
generate e = rnormal(0, sqrt(1 - `rho'))
generate y = `beta_0_true' + `beta_1_true'*x + e
regress y x
local b0 = _b[_cons]
local b1 = _b[x]
local df = e(df_r)
local critical_value = invt(`df', 0.975)
restore
replace beta_0 = `b0' in `i'
replace beta_0_l = beta_0 - `critical_value'*_se[_cons]
replace beta_0_u = beta_0 + `critical_value'*_se[_cons]
replace beta_1 = `b1' in `i'
replace beta_1_l = beta_1 - `critical_value'*_se[x]
replace beta_1_u = beta_1 + `critical_value'*_se[x]
}
}gen false = (beta_1_l > 0 )
replace false = 2 if beta_1_u < 0
replace false = 3 if false == 0
tab false
* Plot the parameter estimatehist beta_1, frequency addplot(pci 0 0 100 0) title("Least squares estimates of non-clustered data") subtitle(" Monte Carlo simulation of the slope") legend(label(1 "Distribution of least squares estimates") label(2 "True population parameter")) xtitle("Parameter estimate")
sort beta_1
gen int sim_ID = _n
gen beta_1_True = 0
of the Confidence Interval
* Plot twoway rcap beta_1_l beta_1_u sim_ID if beta_1_l > 0 | beta_1_u < 0 , horizontal lcolor(pink) || || ///
if beta_1_l < 0 & beta_1_u > 0 , horizontal ysc(r(0)) || || ///
rcap beta_1_l beta_1_u sim_ID ///
connected sim_ID beta_1 || || line sim_ID beta_1_True, lpattern(dash) lcolor(black) lwidth(1) ///
title("Least squares estimates of non-clustered data") subtitle(" 95% Confidence interval of the slope") ///
legend(label(1 "Missed") label(2 "Hit") label(3 "OLS estimates") label(4 "True population parameter")) xtitle("Parameter estimates") ///
ytitle("Simulation")
Code
#- Analysis of Clustered Data
#- Courtesy of Dr. Yuki Yanai,
#- http://yukiyanai.github.io/teaching/rm1/contents/R/clustered-data-analysis.html
library('arm')
library('mvtnorm')
library('lme4')
library('multiwayvcov')
library('clusterSEs')
library('ggplot2')
library('dplyr')
library('haven')
<- function(param = c(.1, .5), n = 1000, n_cluster = 50, rho = .5) {
gen_cluster # Function to generate clustered data
# Required package: mvtnorm
# individual level
<- matrix(c(1, 0, 0, 1 - rho), ncol = 2)
Sigma_i <- rmvnorm(n = n, sigma = Sigma_i)
values_i
# cluster level
<- rep(1:n_cluster, each = n / n_cluster)
cluster_name <- matrix(c(1, 0, 0, rho), ncol = 2)
Sigma_cl <- rmvnorm(n = n_cluster, sigma = Sigma_cl)
values_cl
# predictor var consists of individual- and cluster-level components
<- values_i[ , 1] + rep(values_cl[ , 1], each = n / n_cluster)
x
# error consists of individual- and cluster-level components
<- values_i[ , 2] + rep(values_cl[ , 2], each = n / n_cluster)
error
# data generating process
<- param[1] + param[2]*x + error
y
<- data.frame(x, y, cluster = cluster_name)
df return(df)
}
# Simulate a dataset with clusters and fit OLS
# Calculate cluster-robust SE when cluster_robust = TRUE
<- function(param = c(.1, .5), n = 1000, n_cluster = 50,
cluster_sim rho = .5, cluster_robust = FALSE) {
# Required packages: mvtnorm, multiwayvcov
<- gen_cluster(param = param, n = n , n_cluster = n_cluster, rho = rho)
df <- lm(y ~ x, data = df)
fit <- coef(fit)[2]
b1 if (!cluster_robust) {
<- vcov(fit)
Sigma <- sqrt(diag(Sigma)[2])
se <- confint(fit)[2, ]
b1_ci95 else { # cluster-robust SE
} <- cluster.vcov(fit, ~ cluster)
Sigma <- sqrt(diag(Sigma)[2])
se <- qt(.025, df = n - 2, lower.tail = FALSE)
t_critical <- b1 - t_critical*se
lower <- b1 + t_critical*se
upper <- c(lower, upper)
b1_ci95
}return(c(b1, se, b1_ci95))
}
# Function to iterate the simulation. A data frame is returned.
<- function(n_sims = 1000, param = c(.1, .5), n = 1000,
run_cluster_sim n_cluster = 50, rho = .5, cluster_robust = FALSE) {
# Required packages: mvtnorm, multiwayvcov, dplyr
<- replicate(n_sims, cluster_sim(param = param, n = n, rho = rho,
df n_cluster = n_cluster,
cluster_robust = cluster_robust))
<- as.data.frame(t(df))
df names(df) <- c('b1', 'se_b1', 'ci95_lower', 'ci95_upper')
<- df %>%
df mutate(id = 1:n(),
param_caught = ci95_lower <= param[2] & ci95_upper >= param[2])
return(df)
}
# Distribution of the estimator and confidence intervals
<- c(.4, 0) # beta1 = 0: no effect of x on y
sim_params <- run_cluster_sim(n_sims = 10000, param = sim_params, rho = 0)
sim_nocluster <- ggplot(sim_nocluster, aes(b1)) +
hist_nocluster geom_histogram(color = 'black') +
geom_vline(xintercept = sim_params[2], color = 'red')
print(hist_nocluster)
<- ggplot(sample_n(sim_nocluster, 100),
ci95_nocluster aes(x = reorder(id, b1), y = b1,
ymin = ci95_lower, ymax = ci95_upper,
color = param_caught)) +
geom_hline(yintercept = sim_params[2], linetype = 'dashed') +
geom_pointrange() +
labs(x = 'sim ID', y = 'b1', title = 'Randomly Chosen 100 95% CIs') +
scale_color_discrete(name = 'True param value', labels = c('missed', 'hit')) +
coord_flip()
print(ci95_nocluster)
%>% summarize(type1_error = 1 - sum(param_caught)/n()) sim_nocluster
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
def gen_cluster(param = (.1, .5), n = 1000, n_cluster = 50, rho = .5):
# Function to generate clustered data
# individual level
= np.array((1, 0, 0, 1 - rho)).reshape(2,2)
Sigma_i
= np.random.multivariate_normal(np.zeros(2), Sigma_i, size = n)
values_i
# cluster level
= np.repeat(np.arange(1, n_cluster+1), repeats = n / n_cluster)
cluster_name = np.array((1, 0, 0, rho)).reshape(2,2)
Sigma_cl = np.random.multivariate_normal(np.zeros(2),Sigma_cl, size = n_cluster)
values_cl
# predictor var consists of individual- and cluster-level components
= values_i[: , 0] + np.repeat(values_cl[: , 0], repeats = n / n_cluster)
x
# error consists of individual- and cluster-level components
= values_i[: , 1] + np.repeat(values_cl[: , 1], repeats = n / n_cluster)
error
# data generating process
= param[0] + param[1]*x + error
y
= pd.DataFrame({'x':x, 'y':y, 'cluster': cluster_name})
df return df
def cluster_sim(param = (.1, .5), n = 1000, n_cluster = 50,
= .5, cluster_robust = False):
rho
= gen_cluster(param = param, n = n , n_cluster = n_cluster, rho = rho)
df
if not cluster_robust:
= sm.OLS.from_formula('y ~ x', data = df).fit()
fit else: # cluster-robust SE
= sm.OLS.from_formula('y ~ x', data = df).fit(cov_type='cluster', cov_kwds={'groups': df['cluster']})
fit
= fit.params[1]
b1 = fit.cov_params()
Sigma
= np.sqrt(np.diag(Sigma)[1])
se = se*1.96
ci95 = (b1-ci95, b1+ci95)
b1_ci95
return (b1, se, *b1_ci95)
def run_cluster_sim(n_sims = 1000, param = (.1, .5), n = 1000,
= 50, rho = .5, cluster_robust = False):
n_cluster
= [cluster_sim(param = param, n = n, rho = rho,
res = n_cluster,
n_cluster = cluster_robust) for x in range(n_sims)]
cluster_robust = pd.DataFrame(res)
df = ('b1', 'se_b1', 'ci95_lower', 'ci95_upper')
df.columns 'param_caught'] = (df['ci95_lower'] <= param[1]) & (param[1] <= df['ci95_upper'])
df['id'] = df.index
df[return df
# Simulation no clustered SE
= [.4, 0] # beta1 = 0: no effect of x on y
sim_params = run_cluster_sim(n_sims=1000, param = sim_params, rho=0, cluster_robust = False)
sim_nocluster
'b1')) +\
p.ggplot(sim_nocluster, p.aes(= 'black') +\
p.geom_histogram(color = sim_params[1], color = 'red')
p.geom_vline(xintercept
100).sort_values('b1'),
p.ggplot(sim_nocluster.sample(= 'factor(id)', y = 'b1',
p.aes(x = 'ci95_lower', ymax = 'ci95_upper',
ymin = 'param_caught')) +\
color = sim_params[1], linetype = 'dashed') +\
p.geom_hline(yintercept +\
p.geom_pointrange() = 'sim ID', y = 'b1', title = 'Randomly Chosen 100 95% CIs') +\
p.labs(x = 'True param value', labels = ('missed', 'hit')) +\
p.scale_color_discrete(name
p.coord_flip()
1 - sum(sim_nocluster.param_caught)/sim_nocluster.shape[0]
As we can see in Figure 2.5, the least squares estimate is centered on its true population parameter.
Setting the significance level at 5%, we should incorrectly reject the null that

But what happens when we use least squares with clustered data? To see that, let’s resimulate our data with observations that are no longer independent draws in a given cluster of observations.
Code
clear all
set seed 20140
local n_sims = 1000
set obs `n_sims'
of each simulation
* Create the variables that will contain the results generate beta_0 = .
generate beta_0_l = .
generate beta_0_u = .
generate beta_1 = .
generate beta_1_l = .
generate beta_1_u = .
* Provide the true population parameterslocal beta_0_true = 0.4
local beta_1_true = 0
local rho = 0.5
data (x and e are clustered)
* Simulate a linear regression. Clustered
quietly {
forvalues i = 1(1) `n_sims' {
preserve
clear
set obs 50
cluster level data: clustered x and e
* Generate generate int cluster_ID = _n
generate x_cluster = rnormal(0,1)
generate e_cluster = rnormal(0, sqrt(`rho'))
expand 20bysort cluster_ID : gen int ind_in_clusterID = _n
level data
* Generate individual generate x_individual = rnormal(0,1)
generate e_individual = rnormal(0,sqrt(1 - `rho'))
e
* Generate x and generate x = x_individual + x_cluster
generate e = e_individual + e_cluster
generate y = `beta_0_true' + `beta_1_true'*x + e
* Least Squares Estimatesregress y x
local b0 = _b[_cons]
local b1 = _b[x]
local df = e(df_r)
local critical_value = invt(`df', 0.975)
* Save the resultsrestore
replace beta_0 = `b0' in `i'
replace beta_0_l = beta_0 - `critical_value'*_se[_cons]
replace beta_0_u = beta_0 + `critical_value'*_se[_cons]
replace beta_1 = `b1' in `i'
replace beta_1_l = beta_1 - `critical_value'*_se[x]
replace beta_1_u = beta_1 + `critical_value'*_se[x]
}
}
gen false = (beta_1_l > 0 )
replace false = 2 if beta_1_u < 0
replace false = 3 if false == 0
tab false
* Plot the parameter estimatehist beta_1, frequency addplot(pci 0 0 100 0) title("Least squares estimates of clustered Data") subtitle(" Monte Carlo simulation of the slope") legend(label(1 "Distribution of least squares estimates") label(2 "True population parameter")) xtitle("Parameter estimate")
Code
#- Analysis of Clustered Data - part 2
#- Courtesy of Dr. Yuki Yanai,
#- http://yukiyanai.github.io/teaching/rm1/contents/R/clustered-data-analysis.html
library('arm')
library('mvtnorm')
library('lme4')
library('multiwayvcov')
library('clusterSEs')
library('ggplot2')
library('dplyr')
library('haven')
#Data with clusters
<- c(.4, 0) # beta1 = 0: no effect of x on y
sim_params <- run_cluster_sim(n_sims = 10000, param = sim_params)
sim_cluster_ols <- hist_nocluster %+% sim_cluster_ols
hist_cluster_ols print(hist_cluster_ols)
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
def gen_cluster(param = (.1, .5), n = 1000, n_cluster = 50, rho = .5):
# Function to generate clustered data
# individual level
= np.array((1, 0, 0, 1 - rho)).reshape(2,2)
Sigma_i
= np.random.multivariate_normal(np.zeros(2), Sigma_i, size = n)
values_i
# cluster level
= np.repeat(np.arange(1, n_cluster+1), repeats = n / n_cluster)
cluster_name = np.array((1, 0, 0, rho)).reshape(2,2)
Sigma_cl = np.random.multivariate_normal(np.zeros(2),Sigma_cl, size = n_cluster)
values_cl
# predictor var consists of individual- and cluster-level components
= values_i[: , 0] + np.repeat(values_cl[: , 0], repeats = n / n_cluster)
x
# error consists of individual- and cluster-level components
= values_i[: , 1] + np.repeat(values_cl[: , 1], repeats = n / n_cluster)
error
# data generating process
= param[0] + param[1]*x + error
y
= pd.DataFrame({'x':x, 'y':y, 'cluster': cluster_name})
df return df
def cluster_sim(param = (.1, .5), n = 1000, n_cluster = 50,
= .5, cluster_robust = False):
rho
= gen_cluster(param = param, n = n , n_cluster = n_cluster, rho = rho)
df
if not cluster_robust:
= sm.OLS.from_formula('y ~ x', data = df).fit()
fit else: # cluster-robust SE
= sm.OLS.from_formula('y ~ x', data = df).fit(cov_type='cluster', cov_kwds={'groups': df['cluster']})
fit
= fit.params[1]
b1 = fit.cov_params()
Sigma
= np.sqrt(np.diag(Sigma)[1])
se = se*1.96
ci95 = (b1-ci95, b1+ci95)
b1_ci95
return (b1, se, *b1_ci95)
def run_cluster_sim(n_sims = 1000, param = (.1, .5), n = 1000,
= 50, rho = .5, cluster_robust = False):
n_cluster
= [cluster_sim(param = param, n = n, rho = rho,
res = n_cluster,
n_cluster = cluster_robust) for x in range(n_sims)]
cluster_robust = pd.DataFrame(res)
df = ('b1', 'se_b1', 'ci95_lower', 'ci95_upper')
df.columns 'param_caught'] = (df['ci95_lower'] <= param[1]) & (param[1] <= df['ci95_upper'])
df['id'] = df.index
df[return df
# Simulation clustered SE
= [.4, 0] # beta1 = 0: no effect of x on y
sim_params = run_cluster_sim(n_sims=1000, param = sim_params, cluster_robust = False)
sim_nocluster
'b1')) + p.geom_histogram(color = 'black') + p.geom_vline(xintercept = sim_params[1], color = 'red')
p.ggplot(sim_nocluster, p.aes(
1 - sum(sim_nocluster.param_caught)/sim_nocluster.shape[0]
As can be seen in Figure 2.7, the least squares estimate has a narrower spread than that of the estimates when the data isn’t clustered. But to see this a bit more clearly, let’s look at the confidence intervals again.
Code
sort beta_1
gen int sim_ID = _n
gen beta_1_True = 0
of the Confidence Interval
* Plot twoway rcap beta_1_l beta_1_u sim_ID if beta_1_l > 0 | beta_1_u < 0 , horizontal lcolor(pink) || || ///
if beta_1_l < 0 & beta_1_u > 0 , horizontal ysc(r(0)) || || ///
rcap beta_1_l beta_1_u sim_ID ///
connected sim_ID beta_1 || || line sim_ID beta_1_True, lpattern(dash) lcolor(black) lwidth(1) ///
title("Least squares estimates of clustered data") subtitle(" 95% Confidence interval of the slope") ///
legend(label(1 "Missed") label(2 "Hit") label(3 "OLS estimates") label(4 "True population parameter")) xtitle("Parameter estimates") ///
ytitle("Simulation")
Code
#- Analysis of Clustered Data - part 3
#- Courtesy of Dr. Yuki Yanai,
#- http://yukiyanai.github.io/teaching/rm1/contents/R/clustered-data-analysis.html
library('arm')
library('mvtnorm')
library('lme4')
library('multiwayvcov')
library('clusterSEs')
library('ggplot2')
library('dplyr')
library('haven')
#Confidence interval
<- ci95_nocluster %+% sample_n(sim_cluster_ols, 100)
ci95_cluster_ols print(ci95_cluster_ols)
%>% summarize(type1_error = 1 - sum(param_caught)/n()) sim_cluster_ols
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
def gen_cluster(param = (.1, .5), n = 1000, n_cluster = 50, rho = .5):
# Function to generate clustered data
# individual level
= np.array((1, 0, 0, 1 - rho)).reshape(2,2)
Sigma_i
= np.random.multivariate_normal(np.zeros(2), Sigma_i, size = n)
values_i
# cluster level
= np.repeat(np.arange(1, n_cluster+1), repeats = n / n_cluster)
cluster_name = np.array((1, 0, 0, rho)).reshape(2,2)
Sigma_cl = np.random.multivariate_normal(np.zeros(2),Sigma_cl, size = n_cluster)
values_cl
# predictor var consists of individual- and cluster-level components
= values_i[: , 0] + np.repeat(values_cl[: , 0], repeats = n / n_cluster)
x
# error consists of individual- and cluster-level components
= values_i[: , 1] + np.repeat(values_cl[: , 1], repeats = n / n_cluster)
error
# data generating process
= param[0] + param[1]*x + error
y
= pd.DataFrame({'x':x, 'y':y, 'cluster': cluster_name})
df return df
def cluster_sim(param = (.1, .5), n = 1000, n_cluster = 50,
= .5, cluster_robust = False):
rho
= gen_cluster(param = param, n = n , n_cluster = n_cluster, rho = rho)
df
if not cluster_robust:
= sm.OLS.from_formula('y ~ x', data = df).fit()
fit else: # cluster-robust SE
= sm.OLS.from_formula('y ~ x', data = df).fit(cov_type='cluster', cov_kwds={'groups': df['cluster']})
fit
= fit.params[1]
b1 = fit.cov_params()
Sigma
= np.sqrt(np.diag(Sigma)[1])
se = se*1.96
ci95 = (b1-ci95, b1+ci95)
b1_ci95
return (b1, se, *b1_ci95)
def run_cluster_sim(n_sims = 1000, param = (.1, .5), n = 1000,
= 50, rho = .5, cluster_robust = False):
n_cluster
= [cluster_sim(param = param, n = n, rho = rho,
res = n_cluster,
n_cluster = cluster_robust) for x in range(n_sims)]
cluster_robust = pd.DataFrame(res)
df = ('b1', 'se_b1', 'ci95_lower', 'ci95_upper')
df.columns 'param_caught'] = (df['ci95_lower'] <= param[1]) & (param[1] <= df['ci95_upper'])
df['id'] = df.index
df[return df
# Simulation clustered SE
= [.4, 0] # beta1 = 0: no effect of x on y
sim_params = run_cluster_sim(n_sims=1000, param = sim_params, cluster_robust = False)
sim_nocluster
100).sort_values('b1'),
p.ggplot(sim_nocluster.sample(= 'factor(id)', y = 'b1',
p.aes(x = 'ci95_lower', ymax = 'ci95_upper',
ymin = 'param_caught')) +\
color = sim_params[1], linetype = 'dashed') +\
p.geom_hline(yintercept +\
p.geom_pointrange() = 'sim ID', y = 'b1', title = 'Randomly Chosen 100 95% CIs') +\
p.labs(x = 'True param value', labels = ('missed', 'hit')) +\
p.scale_color_discrete(name p.coord_flip()

Figure 2.8 shows the distribution of 95% confidence intervals from the least squares estimates. As can be seen, a much larger number of estimates incorrectly rejected the null hypothesis when the data was clustered. The standard deviation of the estimator shrinks under clustered data, causing us to reject the null incorrectly too often. So what can we do?
Code
* Robust Estimatesclear all
local n_sims = 1000
set obs `n_sims'
of each simulation
* Create the variables that will contain the results generate beta_0_robust = .
generate beta_0_l_robust = .
generate beta_0_u_robust = .
generate beta_1_robust = .
generate beta_1_l_robust = .
generate beta_1_u_robust = .
* Provide the true population parameterslocal beta_0_true = 0.4
local beta_1_true = 0
local rho = 0.5
quietly {
forvalues i = 1(1) `n_sims' {
preserve
clear
set obs 50
cluster level data: clustered x and e
* Generate generate int cluster_ID = _n
generate x_cluster = rnormal(0,1)
generate e_cluster = rnormal(0, sqrt(`rho'))
expand 20bysort cluster_ID : gen int ind_in_clusterID = _n
level data
* Generate individual generate x_individual = rnormal(0,1)
generate e_individual = rnormal(0,sqrt(1 - `rho'))
e
* Generate x and generate x = x_individual + x_cluster
generate e = e_individual + e_cluster
generate y = `beta_0_true' + `beta_1_true'*x + e
regress y x, cl(cluster_ID)
local b0_robust = _b[_cons]
local b1_robust = _b[x]
local df = e(df_r)
local critical_value = invt(`df', 0.975)
* Save the resultsrestore
replace beta_0_robust = `b0_robust' in `i'
replace beta_0_l_robust = beta_0_robust - `critical_value'*_se[_cons]
replace beta_0_u_robust = beta_0_robust + `critical_value'*_se[_cons]
replace beta_1_robust = `b1_robust' in `i'
replace beta_1_l_robust = beta_1_robust - `critical_value'*_se[x]
replace beta_1_u_robust = beta_1_robust + `critical_value'*_se[x]
}
}
histogram of the parameters estimates of the robust least squares
* Plot the gen false = (beta_1_l_robust > 0 )
replace false = 2 if beta_1_u_robust < 0
replace false = 3 if false == 0
tab false
* Plot the parameter estimatehist beta_1_robust, frequency addplot(pci 0 0 110 0) title("Robust least squares estimates of clustered data") subtitle(" Monte Carlo simulation of the slope") legend(label(1 "Distribution of robust least squares estimates") label(2 "True population parameter")) xtitle("Parameter estimate")
sort beta_1_robust
gen int sim_ID = _n
gen beta_1_True = 0
of the Confidence Interval
* Plot twoway rcap beta_1_l_robust beta_1_u_robust sim_ID if beta_1_l_robust > 0 | beta_1_u_robust < 0, horizontal lcolor(pink) || || rcap beta_1_l_robust beta_1_u_robust sim_ID if beta_1_l_robust < 0 & beta_1_u_robust > 0 , horizontal ysc(r(0)) || || connected sim_ID beta_1_robust || || line sim_ID beta_1_True, lpattern(dash) lcolor(black) lwidth(1) title("Robust least squares estimates of clustered data") subtitle(" 95% Confidence interval of the slope") legend(label(1 "Missed") label(2 "Hit") label(3 "Robust estimates") label(4 "True population parameter")) xtitle("Parameter estimates") ytitle("Simulation")
Code
#- Analysis of Clustered Data - part 4
#- Courtesy of Dr. Yuki Yanai,
#- http://yukiyanai.github.io/teaching/rm1/contents/R/clustered-data-analysis.html
library('arm')
library('mvtnorm')
library('lme4')
library('multiwayvcov')
library('clusterSEs')
library('ggplot2')
library('dplyr')
library('haven')
#clustered robust
<- c(.4, 0) # beta1 = 0: no effect of x on y
sim_params <- run_cluster_sim(n_sims = 10000, param = sim_params,
sim_cluster_robust cluster_robust = TRUE)
<- hist_nocluster %+% sim_cluster_ols
hist_cluster_robust print(hist_cluster_robust)
#Confidence Intervals
<- ci95_nocluster %+% sample_n(sim_cluster_robust, 100)
ci95_cluster_robust print(ci95_cluster_robust)
%>% summarize(type1_error = 1 - sum(param_caught)/n()) sim_cluster_robust
Code
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from itertools import combinations
import plotnine as p
# read data
import ssl
= ssl._create_unverified_context
ssl._create_default_https_context def read_data(file):
return pd.read_stata("https://github.com/scunning1975/mixtape/raw/master/" + file)
def gen_cluster(param = (.1, .5), n = 1000, n_cluster = 50, rho = .5):
# Function to generate clustered data
# individual level
= np.array((1, 0, 0, 1 - rho)).reshape(2,2)
Sigma_i
= np.random.multivariate_normal(np.zeros(2), Sigma_i, size = n)
values_i
# cluster level
= np.repeat(np.arange(1, n_cluster+1), repeats = n / n_cluster)
cluster_name = np.array((1, 0, 0, rho)).reshape(2,2)
Sigma_cl = np.random.multivariate_normal(np.zeros(2),Sigma_cl, size = n_cluster)
values_cl
# predictor var consists of individual- and cluster-level components
= values_i[: , 0] + np.repeat(values_cl[: , 0], repeats = n / n_cluster)
x
# error consists of individual- and cluster-level components
= values_i[: , 1] + np.repeat(values_cl[: , 1], repeats = n / n_cluster)
error
# data generating process
= param[0] + param[1]*x + error
y
= pd.DataFrame({'x':x, 'y':y, 'cluster': cluster_name})
df return df
def cluster_sim(param = (.1, .5), n = 1000, n_cluster = 50,
= .5, cluster_robust = False):
rho
= gen_cluster(param = param, n = n , n_cluster = n_cluster, rho = rho)
df
if not cluster_robust:
= sm.OLS.from_formula('y ~ x', data = df).fit()
fit else: # cluster-robust SE
= sm.OLS.from_formula('y ~ x', data = df).fit(cov_type='cluster', cov_kwds={'groups': df['cluster']})
fit
= fit.params[1]
b1 = fit.cov_params()
Sigma
= np.sqrt(np.diag(Sigma)[1])
se = se*1.96
ci95 = (b1-ci95, b1+ci95)
b1_ci95
return (b1, se, *b1_ci95)
def run_cluster_sim(n_sims = 1000, param = (.1, .5), n = 1000,
= 50, rho = .5, cluster_robust = False):
n_cluster
= [cluster_sim(param = param, n = n, rho = rho,
res = n_cluster,
n_cluster = cluster_robust) for x in range(n_sims)]
cluster_robust = pd.DataFrame(res)
df = ('b1', 'se_b1', 'ci95_lower', 'ci95_upper')
df.columns 'param_caught'] = (df['ci95_lower'] <= param[1]) & (param[1] <= df['ci95_upper'])
df['id'] = df.index
df[return df
# Simulation clustered SE
= [.4, 0] # beta1 = 0: no effect of x on y
sim_params = run_cluster_sim(n_sims=1000, param = sim_params, cluster_robust = True)
sim_nocluster
'b1')) + p.geom_histogram(color = 'black') + p.geom_vline(xintercept = sim_params[1], color = 'red')
p.ggplot(sim_nocluster, p.aes(
100).sort_values('b1'),
p.ggplot(sim_nocluster.sample(= 'factor(id)', y = 'b1',
p.aes(x = 'ci95_lower', ymax = 'ci95_upper',
ymin = 'param_caught')) +\
color = sim_params[1], linetype = 'dashed') +\
p.geom_hline(yintercept +\
p.geom_pointrange() = 'sim ID', y = 'b1', title = 'Randomly Chosen 100 95% CIs') +\
p.labs(x = 'True param value', labels = ('missed', 'hit')) +\
p.scale_color_discrete(name
p.coord_flip()
1 - sum(sim_nocluster.param_caught)/sim_nocluster.shape[0]
Now in this case, notice that we included the “, cluster(cluster_ID)” syntax in our regression command. Before we dive in to what this syntax did, let’s look at how the confidence intervals changed. Figure 2.9 shows the distribution of the 95% confidence intervals where, again, the darkest region represents those estimates that incorrectly rejected the null. Now, when there are observations whose errors are correlated within a cluster, we find that estimating the model using least squares leads us back to a situation in which the type I error has decreased considerably.

This leads us to a natural question: what did the adjustment of the estimator’s variance do that caused the type I error to decrease by so much? Whatever it’s doing, it sure seems to be working! Let’s dive in to this adjustment with an example. Consider the following model:
Let’s stack the data by cluster first.
Adjusting for clustered data will be quite common in your applied work given the ubiquity of clustered data in the first place. It’s absolutely essential for working in the panel contexts, or in repeated cross-sections like the difference-in-differences design. But it also turns out to be important for experimental design, because often, the treatment will be at a higher level of aggregation than the microdata itself. In the real world, though, you can never assume that errors are independent draws from the same distribution. You need to know how your variables were constructed in the first place in order to choose the correct error structure for calculating your standard errors. If you have aggregate variables, like class size, then you’ll need to cluster at that level. If some treatment occurred at the state level, then you’ll need to cluster at that level. There’s a large literature available that looks at even more complex error structures, such as multi-way clustering (Cameron, Gelbach, and Miller 2011).
But even the concept of the sample as the basis of standard errors may be shifting. It’s becoming increasingly less the case that researchers work with random samples; they are more likely working with administrative data containing the population itself, and thus the concept of sampling uncertainty becomes strained.28 For instance, Manski and Pepper (2018) wrote that “random sampling assumptions …are not natural when considering states or counties as units of observation.” So although a metaphor of a superpopulation may be useful for extending these classical uncertainty concepts, the ubiquity of digitized administrative data sets has led econometricians and statisticians to think about uncertainty in other ways.
New work by Abadie et al. (2020) explores how sampling-based concepts of the standard error may not be the right way to think about uncertainty in the context of causal inference, or what they call design-based uncertainty. This work in many ways anticipates the next two chapters because of its direct reference to the concept of the counterfactual. Design-based uncertainty is a reflection of not knowing which values would have occurred had some intervention been different in counterfactual. And Abadie et al. (2020) derive standard errors for design-based uncertainty, as opposed to sampling-based uncertainty. As luck would have it, those standard errors are usually smaller.
Let’s now move into these fundamental concepts of causality used in applied work and try to develop the tools to understand how counterfactuals and causality work together.
The probability rolling a 5 using one six-sided die is
.↩︎The set notation
means “union” and refers to two events occurring together.↩︎Why are they different? Because 0.83 is an approximation of
, which was technically 0.833…trailing.↩︎There’s an ironic story in which someone posed the Monty Hall question to the columnist, Marilyn vos Savant. Vos Savant had an extremely high IQ and so people would send in puzzles to stump her. Without the Bayesian decomposition, using only logic, she got the answer right. Her column enraged people, though. Critics wrote in to mansplain how wrong she was, but in fact it was they who were wrong.↩︎
For a more complete review of regression, see Wooldridge (2010) and Wooldridge (2015). I stand on the shoulders of giants.↩︎
The law of total probability requires that all marginal probabilities sum to unity.↩︎
Whenever possible, I try to use the “hat” to represent an estimated statistic. Hence
instead of just . But it is probably more common to see the sample variance represented as .↩︎This is not necessarily causal language. We are speaking first and generally in terms of two random variables systematically moving together in some measurable way.↩︎
Notice that the conditional expectation passed through the linear function leaving a constant, because of the first property of the expectation operator, and a constant times
. This is because the conditional expectation of . This leaves us with which under zero conditional mean is equal to 0.↩︎Notice that we are dividing by
, not . There is no degrees-of-freedom correction, in other words, when using samples to calculate means. There is a degrees-of-freedom correction when we start calculating higher moments.↩︎Recall from much earlier that:
↩︎It isn’t exactly 0 even though
and are independent. Think of it as and are independent in the population, but not in the sample. This is because sample characteristics tend to be slightly different from population properties due to sampling error.↩︎Using the Stata code from Table 2.6, you can show these algebraic properties yourself. I encourage you to do so by creating new variables equaling the product of these terms and collapsing as we did with the other variables. That sort of exercise may help convince you that the aforementioned algebraic properties always hold.↩︎
Recall the earlier discussion about degrees-of-freedom correction.↩︎
This section is a review of traditional econometrics pedagogy. We cover it for the sake of completeness. Traditionally, econometricians motivated their discuss of causality through ideas like unbiasedness and consistency.↩︎
Told you we would use this result a lot.↩︎
I find it interesting that we see so many
terms when working with regres-
sion. They show up constantly. Keep your eyes peeled.↩︎The standard error I found from running this on one sample of data was 0.0403616.↩︎
I highly encourage the interested reader to study Angrist and Pischke (2009), who have an excellent discussion of LIE there.↩︎
Let’s take a concrete example of this proof. Let
. Then take the joint expectation . Then take conditional expectations after we pass the conditional expectation through.↩︎A helpful proof of the Frisch-Waugh-Lovell theorem can be found in Lovell (2008).↩︎
While randomly having kids may sound fun, I encourage you to have kids when you want to have them. Contact your local high school health teacher to learn more about a number of methods that can reasonably minimize the number of random children you create.↩︎
Almost certainly a ridiculous assumption, but stick with me.↩︎
There does exist an unbiased estimator of
, but it’s tedious and hardly anyone in economics seems to use it. See Holtzman (1950).↩︎No one even bothers to cite White (1980) anymore, just like how no one cites Leibniz or Newton when using calculus. Eicker, Huber, and White created a solution so valuable that it got separated from the original papers when it was absorbed into the statistical toolkit.↩︎
Hat tip to Ben Chidmi, who helped create this simulation in Stata.↩︎
Usually we appeal to superpopulations in such situations where the observed population is simply itself a draw from some “super” population.↩︎