Causal Inference

Overview

Causal inference is the process of determining whether one variable truly causes a change in another, as opposed to merely being correlated. While machine learning excels at prediction, causal inference answers the fundamental question: "What would happen if we changed something?" This field draws on potential outcomes (Rubin), structural causal models (Pearl), and experimental design to establish causal claims from both experimental and observational data.

Correlation vs Causation

The Fundamental Problem

Fundamental Principle

\text{Correlation} \neq \text{Causation}

Three conditions for causation:

Temporal precedence: Cause precedes effect
Covariation: Cause and effect are related
No confounders: The relationship isn't explained by a third variable

Confounding Variables

Architecture Diagram

        Confounder (Z)
       /             \
      v               v
  Treatment (X) --> Outcome (Y)

Example: Ice cream sales (X) and drowning deaths (Y)
Both caused by temperature (Z)

Types of Relationships

Architecture Diagram

Causal:        X --> Y     (X causes Y)
Fork:          X <-- Z --> Y  (Z confounds X and Y)
Collider:      X --> Z <-- Y  (Z is collider, conditioning biases)
Chain:         X --> Z --> Y  (Z mediates X -> Y)

ℹ️ Collider Bias

Conditioning on a collider (a variable caused by both X and Y) can create a spurious association between X and Y. This is a common pitfall in observational studies where researchers control for too many variables.

Potential Outcomes Framework

Rubin Causal Model

For individual $i$ , define:

$Y_i(1)$ = outcome if treated (potential outcome under treatment)
$Y_i(0)$ = outcome if not treated (potential outcome under control)

DfIndividual Treatment Effect (ITE)

The individual treatment effect is the difference between potential outcomes under treatment and control. It is the fundamental quantity of interest in causal inference.

Individual Treatment Effect

\tau_i = Y_i(1) - Y_i(0)

Here,

$\tau_i$ =Treatment effect for individual i
$Y_i(1)$ =Potential outcome under treatment
$Y_i(0)$ =Potential outcome under control

Fundamental Problem of Causal Inference

We can only observe ONE potential outcome for each individual:

Observed Outcome

Y_i^{obs} = D_i \cdot Y_i(1) + (1 - D_i) \cdot Y_i(0)

Here,

$D_i$ =Treatment assignment (0 or 1)

ℹ️ The Fundamental Problem

Because we never observe both $Y_i(1)$ and $Y_i(0)$ for the same individual, causal effects are inherently counterfactual. All causal inference methods are strategies for estimating what we cannot directly observe.

Average Treatment Effect (ATE)

DfAverage Treatment Effect (ATE)

The ATE is the expected difference in outcomes between treatment and control across the entire population. It answers: "On average, what is the effect of treatment?"

Average Treatment Effect on the Treated (ATT)

ATT

ATT = E[Y(1) - Y(0) \mid D = 1]

Here,

$ATT$ =Average treatment effect on the treated

Conditional Average Treatment Effect (CATE)

CATE

CATE(x) = E[Y(1) - Y(0) \mid X = x]

Here,

$CATE(x)$ =Conditional average treatment effect at x

SUTVA (Stable Unit Treatment Value Assumption)

No interference: One unit's treatment doesn't affect another's outcome
Consistency: Only one version of each treatment level

💡 SUTVA in Practice

SUTVA is often violated in social networks (where your treatment affects your friends) and in marketplace settings (where one user's treatment affects another's outcome). When SUTVA fails, standard methods produce biased estimates.

Propensity Score Matching

Propensity Score

DfPropensity Score

The propensity score is the probability of receiving treatment given covariates. It summarizes high-dimensional covariate information into a single scalar.

Propensity Score

e(X) = P(D = 1 \mid X)

Here,

$e(X)$ =Propensity score
$X$ =Covariates
$D$ =Treatment indicator

Key Property

ThPropensity Score Theorem (Rosenbaum & Rubin, 1983)

If treatment assignment is unconfounded given $X$ :

Y(1), Y(0) \perp D \mid X \implies Y(1), Y(0) \perp D \mid e(X)

This reduces the dimensionality problem from balancing all covariates to balancing just the propensity score.

Matching Methods

Architecture Diagram

1. Nearest Neighbor Matching
   For each treated unit, find closest control by propensity score
   |T1|---closest--->|C3|
   |T2|---closest--->|C1|
   |T3|---closest--->|C3| (can be reused)

2. Caliper Matching
   Same as nearest neighbor, but reject matches beyond caliper
   |T1|---distance=0.02--->|C3| (within caliper 0.05)
   |T2|---distance=0.08--->|C1| (reject, too far)

3. Kernel Matching
   Each treated unit matched to weighted average of all controls
   |T1| <--weighted sum--> |C1|, |C2|, |C3|, |C4|, |C5|

4. Stratification
   Divide into strata by propensity score, estimate within each
   Stratum 1: [0.0-0.2] -> ATE_1
   Stratum 2: [0.2-0.4] -> ATE_2
   ...
   ATE = weighted average of ATE_k

Implementation

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from scipy import stats

class PropensityScoreMatching:
    """Propensity Score Matching for causal inference."""
    
    def __init__(self, caliper=0.05, n_neighbors=1):
        self.caliper = caliper
        self.n_neighbors = n_neighbors
        self.propensity_model = LogisticRegression(max_iter=1000)
    
    def fit_propensity(self, X, treatment):
        """Estimate propensity scores."""
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        self.propensity_model.fit(X_scaled, treatment)
        self.propensity_scores = self.propensity_model.predict_proba(X_scaled)[:, 1]
        return self.propensity_scores
    
    def match(self, treatment, n_neighbors=None):
        """Perform matching."""
        if n_neighbors is None:
            n_neighbors = self.n_neighbors
        
        treated_idx = np.where(treatment == 1)[0]
        control_idx = np.where(treatment == 0)[0]
        
        # Fit nearest neighbors on propensity scores
        treated_scores = self.propensity_scores[treated_idx].reshape(-1, 1)
        control_scores = self.propensity_scores[control_idx].reshape(-1, 1)
        
        nn = NearestNeighbors(n_neighbors=n_neighbors)
        nn.fit(control_scores)
        
        distances, indices = nn.kneighbors(treated_scores)
        
        # Apply caliper
        matches = []
        for i, (t_idx, dist) in enumerate(zip(treated_idx, distances[:, 0])):
            if dist <= self.caliper:
                c_idx = control_idx[indices[i, 0]]
                matches.append((t_idx, c_idx, dist))
        
        self.matches_ = matches
        return matches
    
    def estimate_ate(self, outcome):
        """Estimate ATE from matched pairs."""
        if not hasattr(self, 'matches_'):
            raise ValueError("Run match() first")
        
        treatment_effects = []
        for t_idx, c_idx, _ in self.matches_:
            treatment_effects.append(outcome[t_idx] - outcome[c_idx])
        
        treatment_effects = np.array(treatment_effects)
        
        return {
            'ate': np.mean(treatment_effects),
            'se': np.std(treatment_effects, ddof=1) / np.sqrt(len(treatment_effects)),
            'ci_lower': np.mean(treatment_effects) - 1.96 * np.std(treatment_effects, ddof=1) / np.sqrt(len(treatment_effects)),
            'ci_upper': np.mean(treatment_effects) + 1.96 * np.std(treatment_effects, ddof=1) / np.sqrt(len(treatment_effects)),
            'n_matches': len(treatment_effects)
        }
    
    def check_balance(self, X, treatment):
        """Check covariate balance before and after matching."""
        treated = X[treatment == 1]
        control = X[treatment == 0]
        
        # Before matching
        balance_before = []
        for col in range(X.shape[1]):
            std_diff = (treated[:, col].mean() - control[:, col].mean()) / np.sqrt(
                (treated[:, col].var() + control[:, col].var()) / 2
            )
            balance_before.append(abs(std_diff))
        
        # After matching
        balance_after = []
        if hasattr(self, 'matches_'):
            matched_treated = np.array([m[0] for m in self.matches_])
            matched_control = np.array([m[1] for m in self.matches_])
            
            for col in range(X.shape[1]):
                std_diff = (X[matched_treated, col].mean() - X[matched_control, col].mean()) / np.sqrt(
                    (X[matched_treated, col].var() + X[matched_control, col].var()) / 2
                )
                balance_after.append(abs(std_diff))
        
        return {
            'before': np.array(balance_before),
            'after': np.array(balance_after) if balance_after else None,
            'max_before': max(balance_before),
            'max_after': max(balance_after) if balance_after else None
        }

# Example: Job training program effect
np.random.seed(42)
n = 2000

# Covariates
age = np.random.normal(35, 10, n)
education = np.random.normal(12, 3, n)
income = np.random.normal(40000, 15000, n)
experience = np.random.normal(10, 5, n)

X = np.column_stack([age, education, income, experience])

# Treatment assignment (confounded by education and income)
propensity_true = 1 / (1 + np.exp(-(- 3 + 0.05*age + 0.2*education + 0.0001*income)))
treatment = np.random.binomial(1, propensity_true)

# Outcome (true ATE = 5000)
true_ate = 5000
outcome = 30000 + 500 * age + 1000 * education + 0.3 * income + true_ate * treatment + np.random.normal(0, 5000, n)

# Naive comparison
naive_ate = outcome[treatment == 1].mean() - outcome[treatment == 0].mean()
print(f"Naive ATE (biased): ${naive_ate:,.2f}")
print(f"True ATE: ${true_ate:,.2f}")

# Propensity score matching
psm = PropensityScoreMatching(caliper=0.05)
psm.fit_propensity(X, treatment)
psm.match(treatment)
result = psm.estimate_ate(outcome)

print(f"\nPropensity Score Matching Results:")
print(f"Estimated ATE: ${result['ate']:,.2f}")
print(f"Standard Error: ${result['se']:,.2f}")
print(f"95% CI: (${result['ci_lower']:,.2f}, ${result['ci_upper']:,.2f})")
print(f"Number of matches: {result['n_matches']}")

# Balance check
balance = psm.check_balance(X, treatment)
print(f"\nCovariate Balance (standardized mean differences):")
print(f"Before matching (max): {balance['max_before']:.4f}")
print(f"After matching (max): {balance['max_after']:.4f}")

Difference-in-Differences (DiD)

Concept

DiD compares the change in outcomes over time between a treatment group and a control group, controlling for time-invariant confounders.

Two-Period Model

Difference-in-Differences Model

Y_{it} = \alpha + \beta \cdot \text{Treat}_i + \gamma \cdot \text{Post}_t + \delta \cdot (\text{Treat}_i \times \text{Post}_t) + \epsilon_{it}

Here,

$\delta$ =DiD estimator (treatment effect)
$\text{Treat}_i$ =Treatment indicator for unit i
$\text{Post}_t$ =Post-treatment indicator

The DiD estimator is:

Architecture Diagram

Y
^
|                    * Treatment (Post)
|                  /
|                / <-- δ (DiD estimate)
|              /
|      *------* Treatment (Pre)
|    /
|   /
|  *----------* Control (Pre and Post)
| /
+-----------------------------------------> Time
  Pre              Post

Parallel Trends Assumption

ℹ️ Parallel Trends Assumption

The key assumption: absent treatment, treatment and control groups would have followed parallel trends. This is untestable (we never observe the treated group's counterfactual), but can be partially validated using pre-treatment data.

E[Y_0(t) \mid D=1] - E[Y_0(t) \mid D=0] = \text{constant over time}

Staggered Adoption DiD

When units adopt treatment at different times:

Staggered DiD

Y_{it} = \alpha_i + \lambda_t + \delta \cdot D_{it} + \epsilon_{it}

Here,

$\alpha_i$ =Unit fixed effects
$\lambda_t$ =Time fixed effects
$D_{it}$ =Treatment indicator

Implementation

import numpy as np
import pandas as pd
import statsmodels.api as sm
from linearmodels.panel import PanelOLS

class DifferenceInDifferences:
    """Difference-in-Differences estimator."""
    
    def __init__(self):
        self.results = None
    
    def fit(self, df, outcome, treatment, time, unit_id, covariates=None):
        """
        Estimate DiD effect.
        
        Args:
            df: DataFrame with panel data
            outcome: Name of outcome variable
            treatment: Name of treatment indicator (unit-level)
            time: Name of time variable
            unit_id: Name of unit identifier
            covariates: List of covariate names
        """
        # Create interaction term
        df = df.copy()
        df['post'] = (df[time] == df[time].max()).astype(int)
        df['did'] = df[treatment] * df['post']
        
        # Build formula
        formula = f'{outcome} ~ 1 + {treatment} + post + did'
        if covariates:
            formula += ' + ' + ' + '.join(covariates)
        
        # Panel regression with fixed effects
        df = df.set_index([unit_id, time])
        
        model = PanelOLS.from_formula(
            formula + ' + EntityEffects',
            data=df
        )
        self.results = model.fit(cov_type='clustered', cluster_entity=True)
        
        return self
    
    def estimate_effect(self):
        """Extract DiD estimate."""
        if self.results is None:
            raise ValueError("Model not fitted")
        
        did_coef = self.results.params['did']
        did_se = self.results.std_errors['did']
        did_pval = self.results.pvalues['did']
        
        return {
            'estimate': did_coef,
            'std_error': did_se,
            'p_value': did_pval,
            'ci_lower': did_coef - 1.96 * did_se,
            'ci_upper': did_coef + 1.96 * did_se,
            'summary': self.results.summary
        }

# Simulate DiD example: Minimum wage increase
np.random.seed(42)

# Create panel data
n_states = 50
n_periods = 20

states = []
for state in range(n_states):
    treated = 1 if state < 25 else 0  # First 25 states adopt treatment
    for time in range(n_periods):
        post = 1 if time >= 10 else 0
        
        # True effect
        true_effect = 500 if treated and post else 0
        
        # State and time effects
        state_effect = state * 100
        time_effect = time * 50
        
        # Outcome
        outcome = 10000 + state_effect + time_effect + true_effect + np.random.normal(0, 500)
        
        states.append({
            'state': state,
            'time': time,
            'treated': treated,
            'post': post,
            'employment': outcome
        })

df = pd.DataFrame(states)

# Estimate DiD
did = DifferenceInDifferences()
did.fit(df, outcome='employment', treatment='treated', time='time', unit_id='state')
result = did.estimate_effect()

print("Difference-in-Differences Results")
print(f"DiD Estimate: {result['estimate']:.2f}")
print(f"Standard Error: {result['std_error']:.2f}")
print(f"P-value: {result['p_value']:.4f}")
print(f"95% CI: ({result['ci_lower']:.2f}, {result['ci_upper']:.2f})")

Instrumental Variables (IV)

When to Use IV

When treatment is correlated with unobserved confounders (endogeneity):

Endogeneity

\text{Cov}(X, \epsilon) \neq 0

Here,

$X$ =Treatment variable
$\epsilon$ =Error term

Instrument Requirements

Relevance: $Z$ is correlated with $X$

\text{Cov}(Z, X) \neq 0

Exclusion: $Z$ affects $Y$ only through $X$

Z \perp \epsilon

Exogeneity: $Z$ is uncorrelated with confounders

Two-Stage Least Squares (2SLS)

DfTwo-Stage Least Squares

2SLS addresses endogeneity by using an instrument to isolate exogenous variation in the treatment. The first stage predicts treatment using the instrument; the second stage uses predicted treatment to estimate the causal effect.

Stage 1: Regress treatment on instrument:

First Stage

X = \pi_0 + \pi_1 Z + \nu

Here,

$Z$ =Instrument
$\pi_1$ =First-stage coefficient

Stage 2: Regress outcome on predicted treatment:

Second Stage

Y = \beta_0 + \beta_1 \hat{X} + \epsilon

Here,

$\hat{X}$ =Predicted treatment from first stage

The IV estimator is:

Visual

Architecture Diagram

Confounding:
    
    U (unobserved)
   / \
  v   v
  X   Y     Endogeneity: Cov(X, ε) ≠ 0

Instrumental Variable:

    U (unobserved)
   / \
  v   v
  X   Y     IV Z satisfies: Z ⊥ U, Z → X
  ^
  |
  Z (instrument)

Implementation

import numpy as np
import pandas as pd
import statsmodels.api as sm
from linearmodels.iv import IV2SLS

class InstrumentalVariables:
    """Instrumental Variables estimation."""
    
    def __init__(self):
        self.results = None
    
    def fit(self, df, outcome, treatment, instrument, covariates=None):
        """
        Estimate causal effect using IV/2SLS.
        
        Args:
            df: DataFrame
            outcome: Dependent variable name
            treatment: Endogenous variable name
            instrument: Instrument variable name
            covariates: List of exogenous covariate names
        """
        # Build formula
        formula = f'{outcome} ~ 1 + [{treatment} ~ {instrument}]'
        if covariates:
            formula += ' + ' + ' + '.join(covariates)
        
        model = IV2SLS.from_formula(formula, data=df).fit()
        self.results = model
        
        return self
    
    def test_instrument(self, df, treatment, instrument):
        """Test instrument relevance (first stage F-statistic)."""
        X = sm.add_constant(df[instrument].values.reshape(-1, 1))
        model = sm.OLS(df[treatment], X).fit()
        
        f_stat = model.fvalue
        relevant = f_stat > 10  # Rule of thumb
        
        return {
            'f_statistic': f_stat,
            'relevant': relevant,
            'first_stage': model.summary()
        }
    
    def estimate_effect(self):
        """Extract IV estimate."""
        if self.results is None:
            raise ValueError("Model not fitted")
        
        return {
            'estimate': self.results.params.iloc[-1],  # Treatment effect
            'std_error': self.results.std_errors.iloc[-1],
            'p_value': self.results.pvalues.iloc[-1],
            'ci_lower': self.results.conf_int().iloc[-1, 0],
            'ci_upper': self.results.conf_int().iloc[-1, 1],
            'summary': self.results.summary
        }

# Simulate: Returns to education with ability bias
np.random.seed(42)
n = 5000

# Unobserved ability (confounder)
ability = np.random.normal(0, 1, n)

# Instrument: Quarter of birth (historically used)
quarter = np.random.choice([1, 2, 3, 4], n)
Z = (quarter <= 2).astype(int)  # Born in first half of year

# Education (affected by instrument, correlated with ability)
education = 10 + 0.5 * Z + 0.8 * ability + np.random.normal(0, 1, n)

# Outcome: wages
# True return to education: 2000 per year
true_returns = 2000
wage = 20000 + true_returns * education + 5000 * ability + np.random.normal(0, 5000, n)

# Create DataFrame
df = pd.DataFrame({
    'wage': wage,
    'education': education,
    'instrument': Z,
    'ability': ability
})

# OLS (biased)
ols_model = sm.OLS(wage, sm.add_constant(df[['education', 'ability']])).fit()
print("OLS (with ability - true model):")
print(f"Education coefficient: {ols_model.params['education']:.2f}")
print(f"True value: {true_returns}")

# OLS (without ability - biased)
ols_biased = sm.OLS(wage, sm.add_constant(df[['education']])).fit()
print("\nOLS (omitted variable bias):")
print(f"Education coefficient: {ols_biased.params['education']:.2f}")

# IV estimation
iv = InstrumentalVariables()
iv.fit(df, outcome='wage', treatment='education', instrument='instrument')
result = iv.estimate_effect()

print("\nIV (2SLS) Results:")
print(f"Education coefficient: {result['estimate']:.2f}")
print(f"Standard Error: {result['std_error']:.2f}")
print(f"P-value: {result['p_value']:.4f}")
print(f"95% CI: ({result['ci_lower']:.2f}, {result['ci_upper']:.2f})")

# Test instrument relevance
relevance = iv.test_instrument(df, 'education', 'instrument')
print(f"\nInstrument Relevance (F-stat): {relevance['f_statistic']:.2f}")
print(f"Strong instrument: {relevance['relevant']}")

Regression Discontinuity Design (RDD)

Sharp RDD

Treatment is assigned based on a cutoff rule:

DfSharp RDD

Treatment is deterministically assigned based on whether a running variable exceeds a cutoff. The causal effect is estimated at the cutoff where treatment assignment changes discontinuously.

Sharp RDD Treatment Assignment

D_i = \mathbb{1}(X_i \geq c)

Here,

$D_i$ =Treatment indicator
$X_i$ =Running variable
$c$ =Cutoff point

Fuzzy RDD

Treatment probability jumps at cutoff but isn't deterministic:

Fuzzy RDD

P(D=1 \mid X=c^+) \neq P(D=1 \mid X=c^-)

Here,

$c^+$ =Just above cutoff
$c^-$ =Just below cutoff

Architecture Diagram

Y
^
|         *  *  *
|       *        *  *
|     *    Treatment  *  *
|   *        *          *
|  *   *                 *
| *  *    Control
|*  *
+---------------------------+-----------> X
                     Cutoff (c)

💡 RDD Validity

RDD is considered one of the most credible quasi-experimental methods because it mimics a randomized experiment at the cutoff. The key validity check is whether pre-treatment covariates also show discontinuities at the cutoff.

Implementation

import numpy as np
import pandas as pd
import statsmodels.api as sm

class RegressionDiscontinuity:
    """Regression Discontinuity Design estimator."""
    
    def __init__(self, bandwidth=None):
        self.bandwidth = bandwidth
    
    def estimate(self, X, Y, cutoff, bandwidth=None):
        """
        Estimate causal effect at discontinuity.
        
        Args:
            X: Running variable
            Y: Outcome
            cutoff: Cutoff point
            bandwidth: Bandwidth for local estimation
        """
        if bandwidth is None:
            bandwidth = self.bandwidth or self._optimal_bandwidth(X, Y, cutoff)
        
        # Select observations near cutoff
        near = np.abs(X - cutoff) <= bandwidth
        
        # Create treatment indicator
        D = (X >= cutoff).astype(int)
        
        # Local linear regression
        X_centered = X - cutoff
        X_near = X_centered[near]
        Y_near = Y[near]
        D_near = D[near]
        
        # Add polynomial terms
        Z = np.column_stack([
            np.ones(len(X_near)),
            D_near,
            X_near,
            D_near * X_near
        ])
        
        model = sm.OLS(Y_near, Z).fit()
        
        return {
            'effect': model.params[1],
            'std_error': model.bse[1],
            'p_value': model.pvalues[1],
            'bandwidth': bandwidth,
            'n_obs': len(X_near),
            'summary': model.summary
        }
    
    def _optimal_bandwidth(self, X, Y, cutoff):
        """Simple bandwidth selection (Imbens-Kalyanaraman)."""
        n = len(X)
        h = 1.06 * np.std(X) * n**(-1/5)
        return h

# Simulate: Class size and test scores
np.random.seed(42)
n = 3000

# Running variable: class size
class_size = np.random.uniform(25, 40, n)

# Cutoff: policy changes at class_size = 30
cutoff = 30
treatment = (class_size >= cutoff).astype(int)

# True effect: smaller classes improve scores
# At cutoff, there's a jump
true_effect = 5
Y = 70 - 0.5 * class_size + true_effect * treatment + np.random.normal(0, 3, n)

# Estimate RDD
rdd = RegressionDiscontinuity()
result = rdd.estimate(class_size, Y, cutoff)

print("Regression Discontinuity Results")
print(f"Effect at cutoff: {result['effect']:.2f}")
print(f"Standard Error: {result['std_error']:.2f}")
print(f"P-value: {result['p_value']:.4f}")
print(f"Bandwidth: {result['bandwidth']:.2f}")
print(f"Observations used: {result['n_obs']}")

Directed Acyclic Graphs (DAGs)

Pearl's Causal Hierarchy

Association: $P(Y \mid X)$ — Seeing
Intervention: $P(Y \mid do(X))$ — Doing
Counterfactual: $P(Y_x \mid X', Y')$ — Imagining

ℹ️ Causal Hierarchy

Each level of Pearl's causal hierarchy requires strictly more assumptions than the level below. Association requires no assumptions beyond observational data. Intervention requires causal assumptions encoded in a DAG. Counterfactuals require a fully specified structural causal model.

do-Calculus

Backdoor Criterion

A set $Z$ satisfies the backdoor criterion relative to $(X, Y)$ if:

No node in $Z$ is a descendant of $X$
$Z$ blocks every path between $X$ and $Y$ that contains an arrow into $X$

Front-Door Criterion

When no set satisfies backdoor criterion, use front-door:

Find $M$ that mediates all causal paths from $X$ to $Y$
$X$ blocks all backdoor paths from $X$ to $M$
No unblocked backdoor paths from $M$ to $Y$

Key Takeaways

📋Summary: Causal Inference

Correlation ≠ Causation: Always consider confounders — observational data alone cannot establish causation without additional assumptions
Potential outcomes framework provides rigorous definition of causal effects through counterfactuals
Propensity score matching reduces confounding by balancing covariates between treated and control groups
DiD uses variation over time to control for unobserved time-invariant confounders
IV addresses endogeneity using exogenous variation from instruments
RDD exploits arbitrary cutoffs for quasi-experimental identification — one of the most credible quasi-experimental designs
DAGs help visualize and verify causal assumptions formally
No method is perfect: All require untestable assumptions; sensitivity analysis is essential

Practice Exercises

Exercise 1: Propensity Score Matching

Using the job training data example, perform PSM and check:

Covariate balance before and after matching
Sensitivity to caliper width
Overlap in propensity score distributions

Exercise 2: DiD Analysis

Replicate the minimum wage study and test the parallel trends assumption using pre-treatment data.

Exercise 3: IV Estimation

Implement IV using quarter of birth as an instrument for education. Check instrument strength.

Exercise 4: RDD Design

Apply RDD to a dataset with a known cutoff. Estimate the effect at different bandwidths.

Discussion Questions

When would you prefer DiD over propensity score matching?
How do you validate the exclusion restriction for an IV?
What are the limitations of each causal inference method?

Causal Inference

Causal Inference

Overview

Correlation vs Causation

The Fundamental Problem

Fundamental Principle

Confounding Variables

Types of Relationships

Potential Outcomes Framework

Rubin Causal Model

DfIndividual Treatment Effect (ITE)

Individual Treatment Effect

Fundamental Problem of Causal Inference

Observed Outcome

Average Treatment Effect (ATE)

DfAverage Treatment Effect (ATE)

Average Treatment Effect on the Treated (ATT)

ATT

Conditional Average Treatment Effect (CATE)

CATE

SUTVA (Stable Unit Treatment Value Assumption)

Propensity Score Matching

Propensity Score

DfPropensity Score

Propensity Score

Key Property

ThPropensity Score Theorem (Rosenbaum & Rubin, 1983)

Matching Methods

Implementation

Difference-in-Differences (DiD)

Concept

Two-Period Model

Difference-in-Differences Model

Parallel Trends Assumption

Staggered Adoption DiD

Staggered DiD

Implementation

Instrumental Variables (IV)

When to Use IV

Endogeneity

Instrument Requirements

Two-Stage Least Squares (2SLS)

DfTwo-Stage Least Squares

First Stage

Second Stage

Visual

Implementation

Regression Discontinuity Design (RDD)

Sharp RDD

DfSharp RDD

Sharp RDD Treatment Assignment

Fuzzy RDD

Fuzzy RDD

Implementation

Directed Acyclic Graphs (DAGs)

Pearl's Causal Hierarchy

do-Calculus

Backdoor Criterion

Front-Door Criterion

Key Takeaways

📋Summary: Causal Inference

Practice Exercises

Exercise 1: Propensity Score Matching

Exercise 2: DiD Analysis

Exercise 3: IV Estimation

Exercise 4: RDD Design

Discussion Questions

Need Expert Data Science Help?