Regression Assumptions — LINE Framework and Diagnostics

Regression AnalysisLinear RegressionFree Lesson

Advertisement

Regression Assumptions: The LINE Framework

For OLS estimates to be valid and inference to be correct, four key assumptions must hold.

L — Linearity

The expected relationship between X and Y is linear.

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

np.random.seed(42)
n = 100
X = np.random.uniform(1, 10, n)
X_dm = sm.add_constant(X)

# Good: linear relationship
y_lin = 3 + 2*X + np.random.normal(0, 2, n)
model_lin = sm.OLS(y_lin, X_dm).fit()

# Violated: curved relationship
y_quad = 3 + 2*X + 0.5*X**2 + np.random.normal(0, 3, n)
model_quad = sm.OLS(y_quad, X_dm).fit()

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for i, (model, label) in enumerate([(model_lin, 'Linear (Assumption Met)'),
                                     (model_quad, 'Quadratic (Linearity Violated)')]):
    axes[i,0].scatter(model.fittedvalues, model.resid, alpha=0.6)
    axes[i,0].axhline(0, color='red', linestyle='--')
    axes[i,0].set_title(f'{label}: Residuals vs Fitted')
    axes[i,0].set_xlabel('Fitted Values')
    axes[i,0].set_ylabel('Residuals')
    
    stats.probplot(model.resid, dist='norm', plot=axes[i,1])
    axes[i,1].set_title(f'{label}: Q-Q Plot')

plt.tight_layout()
plt.savefig('regression_assumptions.png', dpi=150)
plt.show()

I — Independence

Residuals are independent across observations. Violated in:

  • Time series data (autocorrelation)
  • Clustered data (students within schools)
  • Spatial data
from statsmodels.stats.stattools import durbin_watson

# Durbin-Watson statistic: 2 = no autocorrelation, <2 positive, >2 negative
dw = durbin_watson(model_lin.resid)
print(f"Durbin-Watson = {dw:.4f}")
print(f"Interpretation: {'No autocorrelation' if 1.5<dw<2.5 else 'Possible autocorrelation'}")

N — Normality of Residuals

Residuals should be approximately normally distributed.

# Shapiro-Wilk test
stat_sw, p_sw = stats.shapiro(model_lin.resid)
print(f"Shapiro-Wilk: W={stat_sw:.4f}, p={p_sw:.4f}")

# Also check Q-Q plot (visual is often more informative for moderate n)
# Normality mainly matters for inference (t-tests, p-values) — less for point estimates

E — Equal Variance (Homoscedasticity)

The variance of residuals should be constant across all levels of X.

# Breusch-Pagan test for heteroscedasticity
from statsmodels.stats.diagnostic import het_breuschpagan

bp_stat, bp_p, _, _ = het_breuschpagan(model_lin.resid, model_lin.model.exog)
print(f"Breusch-Pagan: χ²={bp_stat:.4f}, p={bp_p:.4f}")
print(f"Heteroscedasticity: {'Detected' if bp_p < 0.05 else 'Not detected'}")

# White's test (more general)
from statsmodels.stats.diagnostic import het_white
wh_stat, wh_p, _, _ = het_white(model_lin.resid, model_lin.model.exog)
print(f"White's test: F={wh_stat:.4f}, p={wh_p:.4f}")

Key Takeaways

  1. Linearity: residuals vs fitted should show no pattern
  2. Independence: use Durbin-Watson for time series; design clustered models for grouped data
  3. Normality: matters mainly for inference; large samples are robust via CLT
  4. Homoscedasticity: most important — violations inflate/deflate standard errors
  5. Violations: transform Y (log), use robust SEs, or switch to GLMs

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement