Regression Assumptions: The LINE Framework
For OLS estimates to be valid and inference to be correct, four key assumptions must hold.
L — Linearity
The expected relationship between X and Y is linear.
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
np.random.seed(42)
n = 100
X = np.random.uniform(1, 10, n)
X_dm = sm.add_constant(X)
# Good: linear relationship
y_lin = 3 + 2*X + np.random.normal(0, 2, n)
model_lin = sm.OLS(y_lin, X_dm).fit()
# Violated: curved relationship
y_quad = 3 + 2*X + 0.5*X**2 + np.random.normal(0, 3, n)
model_quad = sm.OLS(y_quad, X_dm).fit()
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for i, (model, label) in enumerate([(model_lin, 'Linear (Assumption Met)'),
(model_quad, 'Quadratic (Linearity Violated)')]):
axes[i,0].scatter(model.fittedvalues, model.resid, alpha=0.6)
axes[i,0].axhline(0, color='red', linestyle='--')
axes[i,0].set_title(f'{label}: Residuals vs Fitted')
axes[i,0].set_xlabel('Fitted Values')
axes[i,0].set_ylabel('Residuals')
stats.probplot(model.resid, dist='norm', plot=axes[i,1])
axes[i,1].set_title(f'{label}: Q-Q Plot')
plt.tight_layout()
plt.savefig('regression_assumptions.png', dpi=150)
plt.show()
I — Independence
Residuals are independent across observations. Violated in:
- Time series data (autocorrelation)
- Clustered data (students within schools)
- Spatial data
from statsmodels.stats.stattools import durbin_watson
# Durbin-Watson statistic: 2 = no autocorrelation, <2 positive, >2 negative
dw = durbin_watson(model_lin.resid)
print(f"Durbin-Watson = {dw:.4f}")
print(f"Interpretation: {'No autocorrelation' if 1.5<dw<2.5 else 'Possible autocorrelation'}")
N — Normality of Residuals
Residuals should be approximately normally distributed.
# Shapiro-Wilk test
stat_sw, p_sw = stats.shapiro(model_lin.resid)
print(f"Shapiro-Wilk: W={stat_sw:.4f}, p={p_sw:.4f}")
# Also check Q-Q plot (visual is often more informative for moderate n)
# Normality mainly matters for inference (t-tests, p-values) — less for point estimates
E — Equal Variance (Homoscedasticity)
The variance of residuals should be constant across all levels of X.
# Breusch-Pagan test for heteroscedasticity
from statsmodels.stats.diagnostic import het_breuschpagan
bp_stat, bp_p, _, _ = het_breuschpagan(model_lin.resid, model_lin.model.exog)
print(f"Breusch-Pagan: χ²={bp_stat:.4f}, p={bp_p:.4f}")
print(f"Heteroscedasticity: {'Detected' if bp_p < 0.05 else 'Not detected'}")
# White's test (more general)
from statsmodels.stats.diagnostic import het_white
wh_stat, wh_p, _, _ = het_white(model_lin.resid, model_lin.model.exog)
print(f"White's test: F={wh_stat:.4f}, p={wh_p:.4f}")
Key Takeaways
- Linearity: residuals vs fitted should show no pattern
- Independence: use Durbin-Watson for time series; design clustered models for grouped data
- Normality: matters mainly for inference; large samples are robust via CLT
- Homoscedasticity: most important — violations inflate/deflate standard errors
- Violations: transform Y (log), use robust SEs, or switch to GLMs