Regression Assumptions: The LINE Framework

Regression Analysis

Four Assumptions Every Regression Must Meet

The LINE framework (Linearity, Independence, Normality, Equal Variance) ensures OLS estimates are valid and inference is trustworthy. Violating these assumptions leads to biased or inefficient results.

Policy Evaluation — Ensure causal estimates from regression models are credible
Financial Modeling — Validate assumptions before using regression for risk assessment
Scientific Research — Meet peer-review standards by demonstrating assumption compliance

Check assumptions before trusting the coefficients they produce.

For OLS estimates to be valid and inference to be correct, four key assumptions must hold.

L — Linearity

The expected relationship between X and Y is linear.

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

np.random.seed(42)
n = 100
X = np.random.uniform(1, 10, n)
X_dm = sm.add_constant(X)

# Good: linear relationship
y_lin = 3 + 2*X + np.random.normal(0, 2, n)
model_lin = sm.OLS(y_lin, X_dm).fit()

# Violated: curved relationship
y_quad = 3 + 2*X + 0.5*X**2 + np.random.normal(0, 3, n)
model_quad = sm.OLS(y_quad, X_dm).fit()

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for i, (model, label) in enumerate([(model_lin, 'Linear (Assumption Met)'),
                                     (model_quad, 'Quadratic (Linearity Violated)')]):
    axes[i,0].scatter(model.fittedvalues, model.resid, alpha=0.6)
    axes[i,0].axhline(0, color='red', linestyle='--')
    axes[i,0].set_title(f'{label}: Residuals vs Fitted')
    axes[i,0].set_xlabel('Fitted Values')
    axes[i,0].set_ylabel('Residuals')
    
    stats.probplot(model.resid, dist='norm', plot=axes[i,1])
    axes[i,1].set_title(f'{label}: Q-Q Plot')

plt.tight_layout()
plt.savefig('regression_assumptions.png', dpi=150)
plt.show()

I — Independence

Residuals are independent across observations. Violated in:

Time series data (autocorrelation)
Clustered data (students within schools)
Spatial data

from statsmodels.stats.stattools import durbin_watson

# Durbin-Watson statistic: 2 = no autocorrelation, <2 positive, >2 negative
dw = durbin_watson(model_lin.resid)
print(f"Durbin-Watson = {dw:.4f}")
print(f"Interpretation: {'No autocorrelation' if 1.5<dw<2.5 else 'Possible autocorrelation'}")

N — Normality of Residuals

Residuals should be approximately normally distributed.

# Shapiro-Wilk test
stat_sw, p_sw = stats.shapiro(model_lin.resid)
print(f"Shapiro-Wilk: W={stat_sw:.4f}, p={p_sw:.4f}")

# Also check Q-Q plot (visual is often more informative for moderate n)
# Normality mainly matters for inference (t-tests, p-values) — less for point estimates

E — Equal Variance (Homoscedasticity)

The variance of residuals should be constant across all levels of X.

# Breusch-Pagan test for heteroscedasticity
from statsmodels.stats.diagnostic import het_breuschpagan

bp_stat, bp_p, _, _ = het_breuschpagan(model_lin.resid, model_lin.model.exog)
print(f"Breusch-Pagan: χ²={bp_stat:.4f}, p={bp_p:.4f}")
print(f"Heteroscedasticity: {'Detected' if bp_p < 0.05 else 'Not detected'}")

# White's test (more general)
from statsmodels.stats.diagnostic import het_white
wh_stat, wh_p, _, _ = het_white(model_lin.resid, model_lin.model.exog)
print(f"White's test: F={wh_stat:.4f}, p={wh_p:.4f}")

Regression Assumptions — LINE Framework and Diagnostics