CW

Causal Inference: Beyond Correlation

Module 4: Statistics & ProbabilityFree Lesson

Advertisement

Correlation vs Causation

Causal inference is the process of determining whether one variable truly causes a change in another, as opposed to merely being correlated.

Architecture Diagram
Correlation:        X ~ Y     (X and Y move together)
Causal:             X --> Y   (X causes Y)
Fork:               X <-- Z --> Y  (Z confounds X and Y)
Collider:           X --> Z <-- Y  (Z is collider, conditioning biases)
Chain:              X --> Z --> Y  (Z mediates X -> Y)

Fundamental Principle

CorrelationCausation\text{Correlation} \neq \text{Causation}

Three conditions for causation:

  1. Temporal precedence: Cause precedes effect
  2. Covariation: Cause and effect are related
  3. No confounders: The relationship is not explained by a third variable

Collider Bias

Conditioning on a collider (a variable caused by both X and Y) can create a spurious association between X and Y. This is a common pitfall in observational studies where researchers control for too many variables.

Potential Outcomes Framework

Rubin Causal Model

For individual i, define:

  • Y_i(1) = outcome if treated (potential outcome under treatment)
  • Y_i(0) = outcome if not treated (potential outcome under control)

Individual Treatment Effect

τi=Yi(1)Yi(0)\tau_i = Y_i(1) - Y_i(0)

Here,

  • I¨iÏ„_i=Treatment effect for individual i
  • Yi(1)Y_i(1)=Potential outcome under treatment
  • Yi(0)Y_i(0)=Potential outcome under control

Fundamental Problem of Causal Inference

Observed Outcome

Yiobs=DiYi(1)+(1Di)Yi(0)Y_i^{obs} = D_i \cdot Y_i(1) + (1 - D_i) \cdot Y_i(0)

Here,

  • DiD_i=Treatment assignment (0 or 1)

The Fundamental Problem

Because we never observe both Y_i(1) and Y_i(0) for the same individual, causal effects are inherently counterfactual. All causal inference methods are strategies for estimating what we cannot directly observe.

Average Treatment Effect (ATE)

Average Treatment Effect
ATE=E[Y(1)Y(0)]=E[Y(1)]E[Y(0)]ATE = E[Y(1) - Y(0)] = E[Y(1)] - E[Y(0)]

Here,

  • ATEATE=Average treatment effect

Conditional Average Treatment Effect (CATE)

CATE

CATE(x)=E[Y(1)Y(0)X=x]CATE(x) = E[Y(1) - Y(0) \mid X = x]

Here,

  • CATE(x)CATE(x)=Conditional average treatment effect at x

SUTVA (Stable Unit Treatment Value Assumption)

  1. No interference: One unit's treatment does not affect another's outcome
  2. Consistency: Only one version of each treatment level

Propensity Score Matching

Propensity Score

e(X)=P(D=1X)e(X) = P(D = 1 \mid X)

Here,

  • e(X)e(X)=Propensity score
  • XX=Covariates
  • DD=Treatment indicator

Key Property

ThPropensity Score Theorem (Rosenbaum and Rubin, 1983)

If treatment assignment is unconfounded given X, then Y(1), Y(0) ⊥ D | e(X). This reduces the dimensionality problem from balancing all covariates to balancing just the propensity score.

Matching Methods

MethodDescriptionExample
Nearest NeighborFind closest control by propensity scoreT1→C3, T2→C1
CaliperReject matches beyond thresholdT1 (d=0.02) ✓, T2 (d=0.08) ✗
KernelWeighted average of all controlsT1 ← weighted sum of C1-C5
StratificationDivide by propensity score strataStratum 1: [0.0-0.2] → ATE_1

Implementation

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

class PropensityScoreMatching:
    def __init__(self, caliper=0.05, n_neighbors=1):
        self.caliper = caliper
        self.n_neighbors = n_neighbors
        self.propensity_model = LogisticRegression(max_iter=1000)
    
    def fit_propensity(self, X, treatment):
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        self.propensity_model.fit(X_scaled, treatment)
        self.propensity_scores = self.propensity_model.predict_proba(X_scaled)[:, 1]
        return self.propensity_scores
    
    def match(self, treatment, n_neighbors=None):
        if n_neighbors is None:
            n_neighbors = self.n_neighbors
        
        treated_idx = np.where(treatment == 1)[0]
        control_idx = np.where(treatment == 0)[0]
        
        treated_scores = self.propensity_scores[treated_idx].reshape(-1, 1)
        control_scores = self.propensity_scores[control_idx].reshape(-1, 1)
        
        nn = NearestNeighbors(n_neighbors=n_neighbors)
        nn.fit(control_scores)
        distances, indices = nn.kneighbors(treated_scores)
        
        matches = []
        for i, (t_idx, dist) in enumerate(zip(treated_idx, distances[:, 0])):
            if dist <= self.caliper:
                c_idx = control_idx[indices[i, 0]]
                matches.append((t_idx, c_idx, dist))
        
        self.matches_ = matches
        return matches
    
    def estimate_ate(self, outcome):
        treatment_effects = []
        for t_idx, c_idx, _ in self.matches_:
            treatment_effects.append(outcome[t_idx] - outcome[c_idx])
        
        treatment_effects = np.array(treatment_effects)
        return {
            'ate': np.mean(treatment_effects),
            'se': np.std(treatment_effects, ddof=1) / np.sqrt(len(treatment_effects)),
            'n_matches': len(treatment_effects)
        }

# Example: Job training program effect
np.random.seed(42)
n = 2000

age = np.random.normal(35, 10, n)
education = np.random.normal(12, 3, n)
income = np.random.normal(40000, 15000, n)
experience = np.random.normal(10, 5, n)

X = np.column_stack([age, education, income, experience])

propensity_true = 1 / (1 + np.exp(-(-3 + 0.05*age + 0.2*education + 0.0001*income)))
treatment = np.random.binomial(1, propensity_true)

true_ate = 5000
outcome = 30000 + 500*age + 1000*education + 0.3*income + true_ate*treatment + np.random.normal(0, 5000, n)

# Naive comparison
naive_ate = outcome[treatment == 1].mean() - outcome[treatment == 0].mean()
print(f"Naive ATE (biased): ${naive_ate:,.2f}")
print(f"True ATE: ${true_ate:,.2f}")

# Propensity score matching
psm = PropensityScoreMatching(caliper=0.05)
psm.fit_propensity(X, treatment)
psm.match(treatment)
result = psm.estimate_ate(outcome)

print(f"\nPropensity Score Matching Results:")
print(f"Estimated ATE: ${result['ate']:,.2f}")
print(f"Standard Error: ${result['se']:,.2f}")
print(f"Number of matches: {result['n_matches']}")

Difference-in-Differences (DiD)

Two-Period Model

Difference-in-Differences Model

Yit=α+βTreati+γPostt+δ(Treati×Postt)+ϵitY_{it} = \alpha + \beta \cdot \text{Treat}_i + \gamma \cdot \text{Post}_t + \delta \cdot (\text{Treat}_i \times \text{Post}_t) + \epsilon_{it}

Here,

  • I^´Î´=DiD estimator (treatment effect)
  • TreatiTreat_i=Treatment indicator for unit i
  • PosttPost_t=Post-treatment indicator
DiD Estimator
δ^=(YˉT,postYˉT,pre)(YˉC,postYˉC,pre)\hat{\delta} = (\bar{Y}_{T,post} - \bar{Y}_{T,pre}) - (\bar{Y}_{C,post} - \bar{Y}_{C,pre})

Here,

  • I^´Iˋδ̂=Estimated treatment effect
Architecture Diagram
Y
^                    * Treatment (Post)
|                  /
|                / <-- δ (DiD estimate)
|              /
|      *------* Treatment (Pre)
|    /
|  *----------* Control (Pre and Post)
+-----------------------------------------> Time
  Pre              Post

Parallel Trends Assumption

Parallel Trends Assumption

The key assumption: absent treatment, treatment and control groups would have followed parallel trends. This is untestable but can be partially validated using pre-treatment data.

Implementation

import statsmodels.api as sm

np.random.seed(42)

n_states = 50
n_periods = 20
states = []
for state in range(n_states):
    treated = 1 if state < 25 else 0
    for time in range(n_periods):
        post = 1 if time >= 10 else 0
        true_effect = 500 if treated and post else 0
        state_effect = state * 100
        time_effect = time * 50
        outcome = 10000 + state_effect + time_effect + true_effect + np.random.normal(0, 500)
        states.append({
            'state': state, 'time': time, 'treated': treated,
            'post': post, 'employment': outcome
        })

df = pd.DataFrame(states)
df['did'] = df['treated'] * df['post']

import statsmodels.formula.api as smf
model = smf.ols('employment ~ treated + post + did', data=df).fit()
print(model.summary())
print(f"\nDiD Estimate: {model.params['did']:.2f}")
print(f"P-value: {model.pvalues['did']:.4f}")

Instrumental Variables (IV)

When to Use IV

When treatment is correlated with unobserved confounders (endogeneity):

Endogeneity

Cov(X,ϵ)0\text{Cov}(X, \epsilon) \neq 0

Here,

  • XX=Treatment variable
  • I^µÎµ=Error term

Instrument Requirements

  1. Relevance: Z is correlated with X
  2. Exclusion: Z affects Y only through X
  3. Exogeneity: Z is uncorrelated with confounders

Two-Stage Least Squares (2SLS)

First Stage

X=π0+π1Z+νX = \pi_0 + \pi_1 Z + \nu

Here,

  • ZZ=Instrument
  • I¨a^‚Ï€â‚=First-stage coefficient

Second Stage

Y=β0+β1X^+ϵY = \beta_0 + \beta_1 \hat{X} + \epsilon

Here,

  • XIˋXÌ‚=Predicted treatment from first stage
IV Estimator
β^IV=Cov(Z,Y)Cov(Z,X)\hat{\beta}_{IV} = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, X)}

Here,

  • I^2IˋIVβ̂_IV=IV estimate of treatment effect
Architecture Diagram
Confounding:

    U (unobserved)
   / \
  v   v
  X   Y     Endogeneity: Cov(X, ε) ≠ 0

Instrumental Variable:

    U (unobserved)
   / \
  v   v
  X   Y     IV Z satisfies: Z ⊥ U, Z -> X
  ^
  |
  Z (instrument)

Regression Discontinuity Design (RDD)

Sharp RDD

Sharp RDD Treatment Assignment

Di=1(Xic)D_i = \mathbb{1}(X_i \geq c)

Here,

  • DiD_i=Treatment indicator
  • XiX_i=Running variable
  • cc=Cutoff point
RDD Estimator
τ^RDD=limxcE[YX=x]limxcE[YX=x]\hat{\tau}_{RDD} = \lim_{x \downarrow c} E[Y \mid X=x] - \lim_{x \uparrow c} E[Y \mid X=x]

Here,

  • I¨IˋRDDτ̂_RDD=Estimated treatment effect at cutoff
Architecture Diagram
Y
^         *  *  *
|       *        *  *
|     *    Treatment  *  *
|   *        *          *
|  *   *                 *
| *  *    Control
|*  *
+---------------------------+-----------> X
                     Cutoff (c)

RDD Validity

RDD is considered one of the most credible quasi-experimental methods because it mimics a randomized experiment at the cutoff. The key validity check is whether pre-treatment covariates also show discontinuities at the cutoff.

Directed Acyclic Graphs (DAGs)

Pearl's Causal Hierarchy

  1. Association: P(Y | X) — Seeing
  2. Intervention: P(Y | do(X)) — Doing
  3. Counterfactual: P(Y_x | X', Y') — Imagining

Causal Hierarchy

Each level of Pearl's causal hierarchy requires strictly more assumptions than the level below. Association requires no assumptions beyond observational data. Intervention requires causal assumptions encoded in a DAG. Counterfactuals require a fully specified structural causal model.

do-Calculus

Backdoor Adjustment Formula
P(Ydo(X))=zP(YX,z)P(z)(when backdoor criterion satisfied)P(Y \mid do(X)) = \sum_z P(Y \mid X, z)P(z) \quad \text{(when backdoor criterion satisfied)}

Here,

  • do(X)do(X)=Intervention on X
  • zz=Confounders satisfying backdoor criterion

Backdoor Criterion

A set Z satisfies the backdoor criterion relative to (X, Y) if:

  1. No node in Z is a descendant of X
  2. Z blocks every path between X and Y that contains an arrow into X

Key Takeaways

Summary: Causal Inference

  1. Correlation ≠ Causation: Always consider confounders — observational data alone cannot establish causation without additional assumptions
  2. Potential outcomes framework provides rigorous definition of causal effects through counterfactuals
  3. Propensity score matching reduces confounding by balancing covariates between treated and control groups
  4. DiD uses variation over time to control for unobserved time-invariant confounders
  5. IV addresses endogeneity using exogenous variation from instruments
  6. RDD exploits arbitrary cutoffs for quasi-experimental identification — one of the most credible quasi-experimental designs
  7. DAGs help visualize and verify causal assumptions formally
  8. No method is perfect: All require untestable assumptions; sensitivity analysis is essential

Practice Exercises

  1. Propensity Score Matching: Perform PSM on job training data and check covariate balance before/after matching
  2. DiD Analysis: Replicate the minimum wage study and test the parallel trends assumption
  3. IV Estimation: Implement IV using quarter of birth as an instrument for education. Check instrument strength
  4. RDD Design: Apply RDD to a dataset with a known cutoff. Estimate the effect at different bandwidths
  5. Discussion: When would you prefer DiD over propensity score matching? How do you validate the exclusion restriction?

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement