Causal Inference: Beyond Correlation

Correlation vs Causation

Causal inference is the process of determining whether one variable truly causes a change in another, as opposed to merely being correlated.

Causal Structures (DAGs) Correlation X Y X ~ Y Move together Causal X Y X → Y X causes Y Fork Z X Y X ← Z → Y Z confounds Collider X Y Z X → Z ← Y Bias if conditioned Chain X Z Y X → Z → Y Z mediates Key Patterns Causal (interventional) Confounding (spurious) Collider (bias) Mediation (mechanism) Association only ⚠ Correlation ≠ Causation Three conditions for causation: (1) Temporal precedence (2) Covariation (3) No confounders

Three conditions for causation:

Temporal precedence: Cause precedes effect
Covariation: Cause and effect are related
No confounders: The relationship is not explained by a third variable

Potential Outcomes Framework

Rubin Causal Model

For individual i, define:

Y_i(1) = outcome if treated (potential outcome under treatment)
Y_i(0) = outcome if not treated (potential outcome under control)

Fundamental Problem of Causal Inference

Average Treatment Effect (ATE)

Conditional Average Treatment Effect (CATE)

SUTVA (Stable Unit Treatment Value Assumption)

No interference: One unit's treatment does not affect another's outcome
Consistency: Only one version of each treatment level

Propensity Score Matching

Key Property

Matching Methods

Method	Description	Example
Nearest Neighbor	Find closest control by propensity score	T1→C3, T2→C1
Caliper	Reject matches beyond threshold	T1 (d=0.02) ✓, T2 (d=0.08) ≤
Kernel	Weighted average of all controls	T1 ← weighted sum of C1-C5
Stratification	Divide by propensity score strata	Stratum 1: [0.0-0.2] → ATE_1

Implementation

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

class PropensityScoreMatching:
    def __init__(self, caliper=0.05, n_neighbors=1):
        self.caliper = caliper
        self.n_neighbors = n_neighbors
        self.propensity_model = LogisticRegression(max_iter=1000)
    
    def fit_propensity(self, X, treatment):
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        self.propensity_model.fit(X_scaled, treatment)
        self.propensity_scores = self.propensity_model.predict_proba(X_scaled)[:, 1]
        return self.propensity_scores
    
    def match(self, treatment, n_neighbors=None):
        if n_neighbors is None:
            n_neighbors = self.n_neighbors
        
        treated_idx = np.where(treatment == 1)[0]
        control_idx = np.where(treatment == 0)[0]
        
        treated_scores = self.propensity_scores[treated_idx].reshape(-1, 1)
        control_scores = self.propensity_scores[control_idx].reshape(-1, 1)
        
        nn = NearestNeighbors(n_neighbors=n_neighbors)
        nn.fit(control_scores)
        distances, indices = nn.kneighbors(treated_scores)
        
        matches = []
        for i, (t_idx, dist) in enumerate(zip(treated_idx, distances[:, 0])):
            if dist <= self.caliper:
                c_idx = control_idx[indices[i, 0]]
                matches.append((t_idx, c_idx, dist))
        
        self.matches_ = matches
        return matches
    
    def estimate_ate(self, outcome):
        treatment_effects = []
        for t_idx, c_idx, _ in self.matches_:
            treatment_effects.append(outcome[t_idx] - outcome[c_idx])
        
        treatment_effects = np.array(treatment_effects)
        return {
            'ate': np.mean(treatment_effects),
            'se': np.std(treatment_effects, ddof=1) / np.sqrt(len(treatment_effects)),
            'n_matches': len(treatment_effects)
        }

# Example: Job training program effect
np.random.seed(42)
n = 2000

age = np.random.normal(35, 10, n)
education = np.random.normal(12, 3, n)
income = np.random.normal(40000, 15000, n)
experience = np.random.normal(10, 5, n)

X = np.column_stack([age, education, income, experience])

propensity_true = 1 / (1 + np.exp(-(-3 + 0.05*age + 0.2*education + 0.0001*income)))
treatment = np.random.binomial(1, propensity_true)

true_ate = 5000
outcome = 30000 + 500*age + 1000*education + 0.3*income + true_ate*treatment + np.random.normal(0, 5000, n)

# Naive comparison
naive_ate = outcome[treatment == 1].mean() - outcome[treatment == 0].mean()
print(f"Naive ATE (biased): ${naive_ate:,.2f}")
print(f"True ATE: ${true_ate:,.2f}")

# Propensity score matching
psm = PropensityScoreMatching(caliper=0.05)
psm.fit_propensity(X, treatment)
psm.match(treatment)
result = psm.estimate_ate(outcome)

print(f"\nPropensity Score Matching Results:")
print(f"Estimated ATE: ${result['ate']:,.2f}")
print(f"Standard Error: ${result['se']:,.2f}")
print(f"Number of matches: {result['n_matches']}")

Difference-in-Differences (DiD)

Two-Period Model

Difference-in-Differences (DiD) Time → Outcome (Y) Pre Post Treatment Control Counterfactual Δ (DiD) Causal effect δ = 0.5 Pre: Similar levels Treatment group Control group Counterfactual DiD estimate

Parallel Trends Assumption

Implementation

import statsmodels.api as sm

np.random.seed(42)

n_states = 50
n_periods = 20
states = []
for state in range(n_states):
    treated = 1 if state < 25 else 0
    for time in range(n_periods):
        post = 1 if time >= 10 else 0
        true_effect = 500 if treated and post else 0
        state_effect = state * 100
        time_effect = time * 50
        outcome = 10000 + state_effect + time_effect + true_effect + np.random.normal(0, 500)
        states.append({
            'state': state, 'time': time, 'treated': treated,
            'post': post, 'employment': outcome
        })

df = pd.DataFrame(states)
df['did'] = df['treated'] * df['post']

import statsmodels.formula.api as smf
model = smf.ols('employment ~ treated + post + did', data=df).fit()
print(model.summary())
print(f"\nDiD Estimate: {model.params['did']:.2f}")
print(f"P-value: {model.pvalues['did']:.4f}")

Instrumental Variables (IV)

When to Use IV

When treatment is correlated with unobserved confounders (endogeneity):

Instrument Requirements

Relevance: Z is correlated with X
Exclusion: Z affects Y only through X
Exogeneity: Z is uncorrelated with confounders

Two-Stage Least Squares (2SLS)

Architecture Diagram

Confounding:

    U (unobserved)
   / \
  v   v
  X   Y     Endogeneity: Cov(X, ε) ≈ 0

Instrumental Variable:

    U (unobserved)
   / \
  v   v
  X   Y     IV Z satisfies: Z ⊥ U, Z -> X
  ^
  |
  Z (instrument)

Regression Discontinuity Design (RDD)

Sharp RDD

Architecture Diagram

Y
^         *  *  *
|       *        *  *
|     *    Treatment  *  *
|   *        *          *
|  *   *                 *
| *  *    Control
|*  *
+---------------------------+-----------> X
                     Cutoff (c)

Directed Acyclic Graphs (DAGs)

Pearl's Causal Hierarchy

Association: P(Y | X) — Seeing
Intervention: P(Y | do(X)) — Doing
Counterfactual: P(Y_x | X', Y') — Imagining

do-Calculus

Backdoor Criterion

A set Z satisfies the backdoor criterion relative to (X, Y) if:

No node in Z is a descendant of X
Z blocks every path between X and Y that contains an arrow into X

Key Takeaways

Practice Exercises

Propensity Score Matching: Perform PSM on job training data and check covariate balance before/after matching
DiD Analysis: Replicate the minimum wage study and test the parallel trends assumption
IV Estimation: Implement IV using quarter of birth as an instrument for education. Check instrument strength
RDD Design: Apply RDD to a dataset with a known cutoff. Estimate the effect at different bandwidths
Discussion: When would you prefer DiD over propensity score matching? How do you validate the exclusion restriction?

Causal Inference: Beyond Correlation

Correlation vs Causation

Potential Outcomes Framework

Rubin Causal Model

Fundamental Problem of Causal Inference

Average Treatment Effect (ATE)

Conditional Average Treatment Effect (CATE)

SUTVA (Stable Unit Treatment Value Assumption)

Propensity Score Matching

Key Property

Matching Methods

Implementation

Difference-in-Differences (DiD)

Two-Period Model

Parallel Trends Assumption

Implementation

Instrumental Variables (IV)

When to Use IV

Instrument Requirements

Two-Stage Least Squares (2SLS)

Regression Discontinuity Design (RDD)

Sharp RDD

Directed Acyclic Graphs (DAGs)

Pearl's Causal Hierarchy

do-Calculus

Backdoor Criterion

Key Takeaways

Practice Exercises

Need Expert Data Science Help?