Correlation vs Causation
Causal inference is the process of determining whether one variable truly causes a change in another, as opposed to merely being correlated.
Correlation: X ~ Y (X and Y move together)
Causal: X --> Y (X causes Y)
Fork: X <-- Z --> Y (Z confounds X and Y)
Collider: X --> Z <-- Y (Z is collider, conditioning biases)
Chain: X --> Z --> Y (Z mediates X -> Y)
Fundamental Principle
Three conditions for causation:
- Temporal precedence: Cause precedes effect
- Covariation: Cause and effect are related
- No confounders: The relationship is not explained by a third variable
Collider Bias
Conditioning on a collider (a variable caused by both X and Y) can create a spurious association between X and Y. This is a common pitfall in observational studies where researchers control for too many variables.
Potential Outcomes Framework
Rubin Causal Model
For individual i, define:
- Y_i(1) = outcome if treated (potential outcome under treatment)
- Y_i(0) = outcome if not treated (potential outcome under control)
Individual Treatment Effect
Here,
- =Treatment effect for individual i
- =Potential outcome under treatment
- =Potential outcome under control
Fundamental Problem of Causal Inference
Observed Outcome
Here,
- =Treatment assignment (0 or 1)
The Fundamental Problem
Because we never observe both Y_i(1) and Y_i(0) for the same individual, causal effects are inherently counterfactual. All causal inference methods are strategies for estimating what we cannot directly observe.
Average Treatment Effect (ATE)
Here,
- =Average treatment effect
Conditional Average Treatment Effect (CATE)
CATE
Here,
- =Conditional average treatment effect at x
SUTVA (Stable Unit Treatment Value Assumption)
- No interference: One unit's treatment does not affect another's outcome
- Consistency: Only one version of each treatment level
Propensity Score Matching
Propensity Score
Here,
- =Propensity score
- =Covariates
- =Treatment indicator
Key Property
ThPropensity Score Theorem (Rosenbaum and Rubin, 1983)
If treatment assignment is unconfounded given X, then Y(1), Y(0) ⊥ D | e(X). This reduces the dimensionality problem from balancing all covariates to balancing just the propensity score.
Matching Methods
| Method | Description | Example |
|---|---|---|
| Nearest Neighbor | Find closest control by propensity score | T1→C3, T2→C1 |
| Caliper | Reject matches beyond threshold | T1 (d=0.02) ✓, T2 (d=0.08) ✗ |
| Kernel | Weighted average of all controls | T1 ↠weighted sum of C1-C5 |
| Stratification | Divide by propensity score strata | Stratum 1: [0.0-0.2] → ATE_1 |
Implementation
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
class PropensityScoreMatching:
def __init__(self, caliper=0.05, n_neighbors=1):
self.caliper = caliper
self.n_neighbors = n_neighbors
self.propensity_model = LogisticRegression(max_iter=1000)
def fit_propensity(self, X, treatment):
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
self.propensity_model.fit(X_scaled, treatment)
self.propensity_scores = self.propensity_model.predict_proba(X_scaled)[:, 1]
return self.propensity_scores
def match(self, treatment, n_neighbors=None):
if n_neighbors is None:
n_neighbors = self.n_neighbors
treated_idx = np.where(treatment == 1)[0]
control_idx = np.where(treatment == 0)[0]
treated_scores = self.propensity_scores[treated_idx].reshape(-1, 1)
control_scores = self.propensity_scores[control_idx].reshape(-1, 1)
nn = NearestNeighbors(n_neighbors=n_neighbors)
nn.fit(control_scores)
distances, indices = nn.kneighbors(treated_scores)
matches = []
for i, (t_idx, dist) in enumerate(zip(treated_idx, distances[:, 0])):
if dist <= self.caliper:
c_idx = control_idx[indices[i, 0]]
matches.append((t_idx, c_idx, dist))
self.matches_ = matches
return matches
def estimate_ate(self, outcome):
treatment_effects = []
for t_idx, c_idx, _ in self.matches_:
treatment_effects.append(outcome[t_idx] - outcome[c_idx])
treatment_effects = np.array(treatment_effects)
return {
'ate': np.mean(treatment_effects),
'se': np.std(treatment_effects, ddof=1) / np.sqrt(len(treatment_effects)),
'n_matches': len(treatment_effects)
}
# Example: Job training program effect
np.random.seed(42)
n = 2000
age = np.random.normal(35, 10, n)
education = np.random.normal(12, 3, n)
income = np.random.normal(40000, 15000, n)
experience = np.random.normal(10, 5, n)
X = np.column_stack([age, education, income, experience])
propensity_true = 1 / (1 + np.exp(-(-3 + 0.05*age + 0.2*education + 0.0001*income)))
treatment = np.random.binomial(1, propensity_true)
true_ate = 5000
outcome = 30000 + 500*age + 1000*education + 0.3*income + true_ate*treatment + np.random.normal(0, 5000, n)
# Naive comparison
naive_ate = outcome[treatment == 1].mean() - outcome[treatment == 0].mean()
print(f"Naive ATE (biased): ${naive_ate:,.2f}")
print(f"True ATE: ${true_ate:,.2f}")
# Propensity score matching
psm = PropensityScoreMatching(caliper=0.05)
psm.fit_propensity(X, treatment)
psm.match(treatment)
result = psm.estimate_ate(outcome)
print(f"\nPropensity Score Matching Results:")
print(f"Estimated ATE: ${result['ate']:,.2f}")
print(f"Standard Error: ${result['se']:,.2f}")
print(f"Number of matches: {result['n_matches']}")
Difference-in-Differences (DiD)
Two-Period Model
Difference-in-Differences Model
Here,
- =DiD estimator (treatment effect)
- =Treatment indicator for unit i
- =Post-treatment indicator
Here,
- =Estimated treatment effect
Y
^ * Treatment (Post)
| /
| / <-- δ (DiD estimate)
| /
| *------* Treatment (Pre)
| /
| *----------* Control (Pre and Post)
+-----------------------------------------> Time
Pre Post
Parallel Trends Assumption
Parallel Trends Assumption
The key assumption: absent treatment, treatment and control groups would have followed parallel trends. This is untestable but can be partially validated using pre-treatment data.
Implementation
import statsmodels.api as sm
np.random.seed(42)
n_states = 50
n_periods = 20
states = []
for state in range(n_states):
treated = 1 if state < 25 else 0
for time in range(n_periods):
post = 1 if time >= 10 else 0
true_effect = 500 if treated and post else 0
state_effect = state * 100
time_effect = time * 50
outcome = 10000 + state_effect + time_effect + true_effect + np.random.normal(0, 500)
states.append({
'state': state, 'time': time, 'treated': treated,
'post': post, 'employment': outcome
})
df = pd.DataFrame(states)
df['did'] = df['treated'] * df['post']
import statsmodels.formula.api as smf
model = smf.ols('employment ~ treated + post + did', data=df).fit()
print(model.summary())
print(f"\nDiD Estimate: {model.params['did']:.2f}")
print(f"P-value: {model.pvalues['did']:.4f}")
Instrumental Variables (IV)
When to Use IV
When treatment is correlated with unobserved confounders (endogeneity):
Endogeneity
Here,
- =Treatment variable
- =Error term
Instrument Requirements
- Relevance: Z is correlated with X
- Exclusion: Z affects Y only through X
- Exogeneity: Z is uncorrelated with confounders
Two-Stage Least Squares (2SLS)
First Stage
Here,
- =Instrument
- =First-stage coefficient
Second Stage
Here,
- =Predicted treatment from first stage
Here,
- =IV estimate of treatment effect
Confounding:
U (unobserved)
/ \
v v
X Y Endogeneity: Cov(X, ε) ≠0
Instrumental Variable:
U (unobserved)
/ \
v v
X Y IV Z satisfies: Z ⊥ U, Z -> X
^
|
Z (instrument)
Regression Discontinuity Design (RDD)
Sharp RDD
Sharp RDD Treatment Assignment
Here,
- =Treatment indicator
- =Running variable
- =Cutoff point
Here,
- =Estimated treatment effect at cutoff
Y
^ * * *
| * * *
| * Treatment * *
| * * *
| * * *
| * * Control
|* *
+---------------------------+-----------> X
Cutoff (c)
RDD Validity
RDD is considered one of the most credible quasi-experimental methods because it mimics a randomized experiment at the cutoff. The key validity check is whether pre-treatment covariates also show discontinuities at the cutoff.
Directed Acyclic Graphs (DAGs)
Pearl's Causal Hierarchy
- Association: P(Y | X) — Seeing
- Intervention: P(Y | do(X)) — Doing
- Counterfactual: P(Y_x | X', Y') — Imagining
Causal Hierarchy
Each level of Pearl's causal hierarchy requires strictly more assumptions than the level below. Association requires no assumptions beyond observational data. Intervention requires causal assumptions encoded in a DAG. Counterfactuals require a fully specified structural causal model.
do-Calculus
Here,
- =Intervention on X
- =Confounders satisfying backdoor criterion
Backdoor Criterion
A set Z satisfies the backdoor criterion relative to (X, Y) if:
- No node in Z is a descendant of X
- Z blocks every path between X and Y that contains an arrow into X
Key Takeaways
Summary: Causal Inference
- Correlation ≠Causation: Always consider confounders — observational data alone cannot establish causation without additional assumptions
- Potential outcomes framework provides rigorous definition of causal effects through counterfactuals
- Propensity score matching reduces confounding by balancing covariates between treated and control groups
- DiD uses variation over time to control for unobserved time-invariant confounders
- IV addresses endogeneity using exogenous variation from instruments
- RDD exploits arbitrary cutoffs for quasi-experimental identification — one of the most credible quasi-experimental designs
- DAGs help visualize and verify causal assumptions formally
- No method is perfect: All require untestable assumptions; sensitivity analysis is essential
Practice Exercises
- Propensity Score Matching: Perform PSM on job training data and check covariate balance before/after matching
- DiD Analysis: Replicate the minimum wage study and test the parallel trends assumption
- IV Estimation: Implement IV using quarter of birth as an instrument for education. Check instrument strength
- RDD Design: Apply RDD to a dataset with a known cutoff. Estimate the effect at different bandwidths
- Discussion: When would you prefer DiD over propensity score matching? How do you validate the exclusion restriction?