Causal Inference
Overview
Causal inference is the process of determining whether one variable truly causes a change in another, as opposed to merely being correlated. While machine learning excels at prediction, causal inference answers the fundamental question: "What would happen if we changed something?" This field draws on potential outcomes (Rubin), structural causal models (Pearl), and experimental design to establish causal claims from both experimental and observational data.
Correlation vs Causation
The Fundamental Problem
Fundamental Principle
Three conditions for causation:
- Temporal precedence: Cause precedes effect
- Covariation: Cause and effect are related
- No confounders: The relationship isn't explained by a third variable
Confounding Variables
Confounder (Z)
/ \
v v
Treatment (X) --> Outcome (Y)
Example: Ice cream sales (X) and drowning deaths (Y)
Both caused by temperature (Z)
Types of Relationships
Causal: X --> Y (X causes Y)
Fork: X <-- Z --> Y (Z confounds X and Y)
Collider: X --> Z <-- Y (Z is collider, conditioning biases)
Chain: X --> Z --> Y (Z mediates X -> Y)
âšī¸ Collider Bias
Conditioning on a collider (a variable caused by both X and Y) can create a spurious association between X and Y. This is a common pitfall in observational studies where researchers control for too many variables.
Potential Outcomes Framework
Rubin Causal Model
For individual , define:
- = outcome if treated (potential outcome under treatment)
- = outcome if not treated (potential outcome under control)
DfIndividual Treatment Effect (ITE)
The individual treatment effect is the difference between potential outcomes under treatment and control. It is the fundamental quantity of interest in causal inference.
Individual Treatment Effect
Here,
- =Treatment effect for individual i
- =Potential outcome under treatment
- =Potential outcome under control
Fundamental Problem of Causal Inference
We can only observe ONE potential outcome for each individual:
Observed Outcome
Here,
- =Treatment assignment (0 or 1)
âšī¸ The Fundamental Problem
Because we never observe both and for the same individual, causal effects are inherently counterfactual. All causal inference methods are strategies for estimating what we cannot directly observe.
Average Treatment Effect (ATE)
DfAverage Treatment Effect (ATE)
The ATE is the expected difference in outcomes between treatment and control across the entire population. It answers: "On average, what is the effect of treatment?"
Average Treatment Effect on the Treated (ATT)
ATT
Here,
- =Average treatment effect on the treated
Conditional Average Treatment Effect (CATE)
CATE
Here,
- =Conditional average treatment effect at x
SUTVA (Stable Unit Treatment Value Assumption)
- No interference: One unit's treatment doesn't affect another's outcome
- Consistency: Only one version of each treatment level
đĄ SUTVA in Practice
SUTVA is often violated in social networks (where your treatment affects your friends) and in marketplace settings (where one user's treatment affects another's outcome). When SUTVA fails, standard methods produce biased estimates.
Propensity Score Matching
Propensity Score
DfPropensity Score
The propensity score is the probability of receiving treatment given covariates. It summarizes high-dimensional covariate information into a single scalar.
Propensity Score
Here,
- =Propensity score
- =Covariates
- =Treatment indicator
Key Property
ThPropensity Score Theorem (Rosenbaum & Rubin, 1983)
If treatment assignment is unconfounded given :
This reduces the dimensionality problem from balancing all covariates to balancing just the propensity score.
Matching Methods
1. Nearest Neighbor Matching
For each treated unit, find closest control by propensity score
|T1|---closest--->|C3|
|T2|---closest--->|C1|
|T3|---closest--->|C3| (can be reused)
2. Caliper Matching
Same as nearest neighbor, but reject matches beyond caliper
|T1|---distance=0.02--->|C3| (within caliper 0.05)
|T2|---distance=0.08--->|C1| (reject, too far)
3. Kernel Matching
Each treated unit matched to weighted average of all controls
|T1| <--weighted sum--> |C1|, |C2|, |C3|, |C4|, |C5|
4. Stratification
Divide into strata by propensity score, estimate within each
Stratum 1: [0.0-0.2] -> ATE_1
Stratum 2: [0.2-0.4] -> ATE_2
...
ATE = weighted average of ATE_k
Implementation
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from scipy import stats
class PropensityScoreMatching:
"""Propensity Score Matching for causal inference."""
def __init__(self, caliper=0.05, n_neighbors=1):
self.caliper = caliper
self.n_neighbors = n_neighbors
self.propensity_model = LogisticRegression(max_iter=1000)
def fit_propensity(self, X, treatment):
"""Estimate propensity scores."""
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
self.propensity_model.fit(X_scaled, treatment)
self.propensity_scores = self.propensity_model.predict_proba(X_scaled)[:, 1]
return self.propensity_scores
def match(self, treatment, n_neighbors=None):
"""Perform matching."""
if n_neighbors is None:
n_neighbors = self.n_neighbors
treated_idx = np.where(treatment == 1)[0]
control_idx = np.where(treatment == 0)[0]
# Fit nearest neighbors on propensity scores
treated_scores = self.propensity_scores[treated_idx].reshape(-1, 1)
control_scores = self.propensity_scores[control_idx].reshape(-1, 1)
nn = NearestNeighbors(n_neighbors=n_neighbors)
nn.fit(control_scores)
distances, indices = nn.kneighbors(treated_scores)
# Apply caliper
matches = []
for i, (t_idx, dist) in enumerate(zip(treated_idx, distances[:, 0])):
if dist <= self.caliper:
c_idx = control_idx[indices[i, 0]]
matches.append((t_idx, c_idx, dist))
self.matches_ = matches
return matches
def estimate_ate(self, outcome):
"""Estimate ATE from matched pairs."""
if not hasattr(self, 'matches_'):
raise ValueError("Run match() first")
treatment_effects = []
for t_idx, c_idx, _ in self.matches_:
treatment_effects.append(outcome[t_idx] - outcome[c_idx])
treatment_effects = np.array(treatment_effects)
return {
'ate': np.mean(treatment_effects),
'se': np.std(treatment_effects, ddof=1) / np.sqrt(len(treatment_effects)),
'ci_lower': np.mean(treatment_effects) - 1.96 * np.std(treatment_effects, ddof=1) / np.sqrt(len(treatment_effects)),
'ci_upper': np.mean(treatment_effects) + 1.96 * np.std(treatment_effects, ddof=1) / np.sqrt(len(treatment_effects)),
'n_matches': len(treatment_effects)
}
def check_balance(self, X, treatment):
"""Check covariate balance before and after matching."""
treated = X[treatment == 1]
control = X[treatment == 0]
# Before matching
balance_before = []
for col in range(X.shape[1]):
std_diff = (treated[:, col].mean() - control[:, col].mean()) / np.sqrt(
(treated[:, col].var() + control[:, col].var()) / 2
)
balance_before.append(abs(std_diff))
# After matching
balance_after = []
if hasattr(self, 'matches_'):
matched_treated = np.array([m[0] for m in self.matches_])
matched_control = np.array([m[1] for m in self.matches_])
for col in range(X.shape[1]):
std_diff = (X[matched_treated, col].mean() - X[matched_control, col].mean()) / np.sqrt(
(X[matched_treated, col].var() + X[matched_control, col].var()) / 2
)
balance_after.append(abs(std_diff))
return {
'before': np.array(balance_before),
'after': np.array(balance_after) if balance_after else None,
'max_before': max(balance_before),
'max_after': max(balance_after) if balance_after else None
}
# Example: Job training program effect
np.random.seed(42)
n = 2000
# Covariates
age = np.random.normal(35, 10, n)
education = np.random.normal(12, 3, n)
income = np.random.normal(40000, 15000, n)
experience = np.random.normal(10, 5, n)
X = np.column_stack([age, education, income, experience])
# Treatment assignment (confounded by education and income)
propensity_true = 1 / (1 + np.exp(-(- 3 + 0.05*age + 0.2*education + 0.0001*income)))
treatment = np.random.binomial(1, propensity_true)
# Outcome (true ATE = 5000)
true_ate = 5000
outcome = 30000 + 500 * age + 1000 * education + 0.3 * income + true_ate * treatment + np.random.normal(0, 5000, n)
# Naive comparison
naive_ate = outcome[treatment == 1].mean() - outcome[treatment == 0].mean()
print(f"Naive ATE (biased): ${naive_ate:,.2f}")
print(f"True ATE: ${true_ate:,.2f}")
# Propensity score matching
psm = PropensityScoreMatching(caliper=0.05)
psm.fit_propensity(X, treatment)
psm.match(treatment)
result = psm.estimate_ate(outcome)
print(f"\nPropensity Score Matching Results:")
print(f"Estimated ATE: ${result['ate']:,.2f}")
print(f"Standard Error: ${result['se']:,.2f}")
print(f"95% CI: (${result['ci_lower']:,.2f}, ${result['ci_upper']:,.2f})")
print(f"Number of matches: {result['n_matches']}")
# Balance check
balance = psm.check_balance(X, treatment)
print(f"\nCovariate Balance (standardized mean differences):")
print(f"Before matching (max): {balance['max_before']:.4f}")
print(f"After matching (max): {balance['max_after']:.4f}")
Difference-in-Differences (DiD)
Concept
DiD compares the change in outcomes over time between a treatment group and a control group, controlling for time-invariant confounders.
Two-Period Model
Difference-in-Differences Model
Here,
- =DiD estimator (treatment effect)
- =Treatment indicator for unit i
- =Post-treatment indicator
The DiD estimator is:
Y
^
| * Treatment (Post)
| /
| / <-- δ (DiD estimate)
| /
| *------* Treatment (Pre)
| /
| /
| *----------* Control (Pre and Post)
| /
+-----------------------------------------> Time
Pre Post
Parallel Trends Assumption
âšī¸ Parallel Trends Assumption
The key assumption: absent treatment, treatment and control groups would have followed parallel trends. This is untestable (we never observe the treated group's counterfactual), but can be partially validated using pre-treatment data.
Staggered Adoption DiD
When units adopt treatment at different times:
Staggered DiD
Here,
- =Unit fixed effects
- =Time fixed effects
- =Treatment indicator
Implementation
import numpy as np
import pandas as pd
import statsmodels.api as sm
from linearmodels.panel import PanelOLS
class DifferenceInDifferences:
"""Difference-in-Differences estimator."""
def __init__(self):
self.results = None
def fit(self, df, outcome, treatment, time, unit_id, covariates=None):
"""
Estimate DiD effect.
Args:
df: DataFrame with panel data
outcome: Name of outcome variable
treatment: Name of treatment indicator (unit-level)
time: Name of time variable
unit_id: Name of unit identifier
covariates: List of covariate names
"""
# Create interaction term
df = df.copy()
df['post'] = (df[time] == df[time].max()).astype(int)
df['did'] = df[treatment] * df['post']
# Build formula
formula = f'{outcome} ~ 1 + {treatment} + post + did'
if covariates:
formula += ' + ' + ' + '.join(covariates)
# Panel regression with fixed effects
df = df.set_index([unit_id, time])
model = PanelOLS.from_formula(
formula + ' + EntityEffects',
data=df
)
self.results = model.fit(cov_type='clustered', cluster_entity=True)
return self
def estimate_effect(self):
"""Extract DiD estimate."""
if self.results is None:
raise ValueError("Model not fitted")
did_coef = self.results.params['did']
did_se = self.results.std_errors['did']
did_pval = self.results.pvalues['did']
return {
'estimate': did_coef,
'std_error': did_se,
'p_value': did_pval,
'ci_lower': did_coef - 1.96 * did_se,
'ci_upper': did_coef + 1.96 * did_se,
'summary': self.results.summary
}
# Simulate DiD example: Minimum wage increase
np.random.seed(42)
# Create panel data
n_states = 50
n_periods = 20
states = []
for state in range(n_states):
treated = 1 if state < 25 else 0 # First 25 states adopt treatment
for time in range(n_periods):
post = 1 if time >= 10 else 0
# True effect
true_effect = 500 if treated and post else 0
# State and time effects
state_effect = state * 100
time_effect = time * 50
# Outcome
outcome = 10000 + state_effect + time_effect + true_effect + np.random.normal(0, 500)
states.append({
'state': state,
'time': time,
'treated': treated,
'post': post,
'employment': outcome
})
df = pd.DataFrame(states)
# Estimate DiD
did = DifferenceInDifferences()
did.fit(df, outcome='employment', treatment='treated', time='time', unit_id='state')
result = did.estimate_effect()
print("Difference-in-Differences Results")
print(f"DiD Estimate: {result['estimate']:.2f}")
print(f"Standard Error: {result['std_error']:.2f}")
print(f"P-value: {result['p_value']:.4f}")
print(f"95% CI: ({result['ci_lower']:.2f}, {result['ci_upper']:.2f})")
Instrumental Variables (IV)
When to Use IV
When treatment is correlated with unobserved confounders (endogeneity):
Endogeneity
Here,
- =Treatment variable
- =Error term
Instrument Requirements
- Relevance: is correlated with
- Exclusion: affects only through
- Exogeneity: is uncorrelated with confounders
Two-Stage Least Squares (2SLS)
DfTwo-Stage Least Squares
2SLS addresses endogeneity by using an instrument to isolate exogenous variation in the treatment. The first stage predicts treatment using the instrument; the second stage uses predicted treatment to estimate the causal effect.
Stage 1: Regress treatment on instrument:
First Stage
Here,
- =Instrument
- =First-stage coefficient
Stage 2: Regress outcome on predicted treatment:
Second Stage
Here,
- =Predicted treatment from first stage
The IV estimator is:
Visual
Confounding:
U (unobserved)
/ \
v v
X Y Endogeneity: Cov(X, Îĩ) â 0
Instrumental Variable:
U (unobserved)
/ \
v v
X Y IV Z satisfies: Z âĨ U, Z â X
^
|
Z (instrument)
Implementation
import numpy as np
import pandas as pd
import statsmodels.api as sm
from linearmodels.iv import IV2SLS
class InstrumentalVariables:
"""Instrumental Variables estimation."""
def __init__(self):
self.results = None
def fit(self, df, outcome, treatment, instrument, covariates=None):
"""
Estimate causal effect using IV/2SLS.
Args:
df: DataFrame
outcome: Dependent variable name
treatment: Endogenous variable name
instrument: Instrument variable name
covariates: List of exogenous covariate names
"""
# Build formula
formula = f'{outcome} ~ 1 + [{treatment} ~ {instrument}]'
if covariates:
formula += ' + ' + ' + '.join(covariates)
model = IV2SLS.from_formula(formula, data=df).fit()
self.results = model
return self
def test_instrument(self, df, treatment, instrument):
"""Test instrument relevance (first stage F-statistic)."""
X = sm.add_constant(df[instrument].values.reshape(-1, 1))
model = sm.OLS(df[treatment], X).fit()
f_stat = model.fvalue
relevant = f_stat > 10 # Rule of thumb
return {
'f_statistic': f_stat,
'relevant': relevant,
'first_stage': model.summary()
}
def estimate_effect(self):
"""Extract IV estimate."""
if self.results is None:
raise ValueError("Model not fitted")
return {
'estimate': self.results.params.iloc[-1], # Treatment effect
'std_error': self.results.std_errors.iloc[-1],
'p_value': self.results.pvalues.iloc[-1],
'ci_lower': self.results.conf_int().iloc[-1, 0],
'ci_upper': self.results.conf_int().iloc[-1, 1],
'summary': self.results.summary
}
# Simulate: Returns to education with ability bias
np.random.seed(42)
n = 5000
# Unobserved ability (confounder)
ability = np.random.normal(0, 1, n)
# Instrument: Quarter of birth (historically used)
quarter = np.random.choice([1, 2, 3, 4], n)
Z = (quarter <= 2).astype(int) # Born in first half of year
# Education (affected by instrument, correlated with ability)
education = 10 + 0.5 * Z + 0.8 * ability + np.random.normal(0, 1, n)
# Outcome: wages
# True return to education: 2000 per year
true_returns = 2000
wage = 20000 + true_returns * education + 5000 * ability + np.random.normal(0, 5000, n)
# Create DataFrame
df = pd.DataFrame({
'wage': wage,
'education': education,
'instrument': Z,
'ability': ability
})
# OLS (biased)
ols_model = sm.OLS(wage, sm.add_constant(df[['education', 'ability']])).fit()
print("OLS (with ability - true model):")
print(f"Education coefficient: {ols_model.params['education']:.2f}")
print(f"True value: {true_returns}")
# OLS (without ability - biased)
ols_biased = sm.OLS(wage, sm.add_constant(df[['education']])).fit()
print("\nOLS (omitted variable bias):")
print(f"Education coefficient: {ols_biased.params['education']:.2f}")
# IV estimation
iv = InstrumentalVariables()
iv.fit(df, outcome='wage', treatment='education', instrument='instrument')
result = iv.estimate_effect()
print("\nIV (2SLS) Results:")
print(f"Education coefficient: {result['estimate']:.2f}")
print(f"Standard Error: {result['std_error']:.2f}")
print(f"P-value: {result['p_value']:.4f}")
print(f"95% CI: ({result['ci_lower']:.2f}, {result['ci_upper']:.2f})")
# Test instrument relevance
relevance = iv.test_instrument(df, 'education', 'instrument')
print(f"\nInstrument Relevance (F-stat): {relevance['f_statistic']:.2f}")
print(f"Strong instrument: {relevance['relevant']}")
Regression Discontinuity Design (RDD)
Sharp RDD
Treatment is assigned based on a cutoff rule:
DfSharp RDD
Treatment is deterministically assigned based on whether a running variable exceeds a cutoff. The causal effect is estimated at the cutoff where treatment assignment changes discontinuously.
Sharp RDD Treatment Assignment
Here,
- =Treatment indicator
- =Running variable
- =Cutoff point
Fuzzy RDD
Treatment probability jumps at cutoff but isn't deterministic:
Fuzzy RDD
Here,
- =Just above cutoff
- =Just below cutoff
Y
^
| * * *
| * * *
| * Treatment * *
| * * *
| * * *
| * * Control
|* *
+---------------------------+-----------> X
Cutoff (c)
đĄ RDD Validity
RDD is considered one of the most credible quasi-experimental methods because it mimics a randomized experiment at the cutoff. The key validity check is whether pre-treatment covariates also show discontinuities at the cutoff.
Implementation
import numpy as np
import pandas as pd
import statsmodels.api as sm
class RegressionDiscontinuity:
"""Regression Discontinuity Design estimator."""
def __init__(self, bandwidth=None):
self.bandwidth = bandwidth
def estimate(self, X, Y, cutoff, bandwidth=None):
"""
Estimate causal effect at discontinuity.
Args:
X: Running variable
Y: Outcome
cutoff: Cutoff point
bandwidth: Bandwidth for local estimation
"""
if bandwidth is None:
bandwidth = self.bandwidth or self._optimal_bandwidth(X, Y, cutoff)
# Select observations near cutoff
near = np.abs(X - cutoff) <= bandwidth
# Create treatment indicator
D = (X >= cutoff).astype(int)
# Local linear regression
X_centered = X - cutoff
X_near = X_centered[near]
Y_near = Y[near]
D_near = D[near]
# Add polynomial terms
Z = np.column_stack([
np.ones(len(X_near)),
D_near,
X_near,
D_near * X_near
])
model = sm.OLS(Y_near, Z).fit()
return {
'effect': model.params[1],
'std_error': model.bse[1],
'p_value': model.pvalues[1],
'bandwidth': bandwidth,
'n_obs': len(X_near),
'summary': model.summary
}
def _optimal_bandwidth(self, X, Y, cutoff):
"""Simple bandwidth selection (Imbens-Kalyanaraman)."""
n = len(X)
h = 1.06 * np.std(X) * n**(-1/5)
return h
# Simulate: Class size and test scores
np.random.seed(42)
n = 3000
# Running variable: class size
class_size = np.random.uniform(25, 40, n)
# Cutoff: policy changes at class_size = 30
cutoff = 30
treatment = (class_size >= cutoff).astype(int)
# True effect: smaller classes improve scores
# At cutoff, there's a jump
true_effect = 5
Y = 70 - 0.5 * class_size + true_effect * treatment + np.random.normal(0, 3, n)
# Estimate RDD
rdd = RegressionDiscontinuity()
result = rdd.estimate(class_size, Y, cutoff)
print("Regression Discontinuity Results")
print(f"Effect at cutoff: {result['effect']:.2f}")
print(f"Standard Error: {result['std_error']:.2f}")
print(f"P-value: {result['p_value']:.4f}")
print(f"Bandwidth: {result['bandwidth']:.2f}")
print(f"Observations used: {result['n_obs']}")
Directed Acyclic Graphs (DAGs)
Pearl's Causal Hierarchy
- Association: â Seeing
- Intervention: â Doing
- Counterfactual: â Imagining
âšī¸ Causal Hierarchy
Each level of Pearl's causal hierarchy requires strictly more assumptions than the level below. Association requires no assumptions beyond observational data. Intervention requires causal assumptions encoded in a DAG. Counterfactuals require a fully specified structural causal model.
do-Calculus
Backdoor Criterion
A set satisfies the backdoor criterion relative to if:
- No node in is a descendant of
- blocks every path between and that contains an arrow into
Front-Door Criterion
When no set satisfies backdoor criterion, use front-door:
- Find that mediates all causal paths from to
- blocks all backdoor paths from to
- No unblocked backdoor paths from to
Key Takeaways
đSummary: Causal Inference
- Correlation â Causation: Always consider confounders â observational data alone cannot establish causation without additional assumptions
- Potential outcomes framework provides rigorous definition of causal effects through counterfactuals
- Propensity score matching reduces confounding by balancing covariates between treated and control groups
- DiD uses variation over time to control for unobserved time-invariant confounders
- IV addresses endogeneity using exogenous variation from instruments
- RDD exploits arbitrary cutoffs for quasi-experimental identification â one of the most credible quasi-experimental designs
- DAGs help visualize and verify causal assumptions formally
- No method is perfect: All require untestable assumptions; sensitivity analysis is essential
Practice Exercises
Exercise 1: Propensity Score Matching
Using the job training data example, perform PSM and check:
- Covariate balance before and after matching
- Sensitivity to caliper width
- Overlap in propensity score distributions
Exercise 2: DiD Analysis
Replicate the minimum wage study and test the parallel trends assumption using pre-treatment data.
Exercise 3: IV Estimation
Implement IV using quarter of birth as an instrument for education. Check instrument strength.
Exercise 4: RDD Design
Apply RDD to a dataset with a known cutoff. Estimate the effect at different bandwidths.
Discussion Questions
- When would you prefer DiD over propensity score matching?
- How do you validate the exclusion restriction for an IV?
- What are the limitations of each causal inference method?