Multicollinearity
Multicollinearity occurs when two or more predictors are highly correlated with each other. It doesn't bias OLS estimates but inflates standard errors, making individual coefficients unreliable.
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(42)
n = 100
# Create correlated predictors
z = np.random.normal(0, 1, n)
x1 = z + np.random.normal(0, 0.3, n) # strongly correlated with z
x2 = z + np.random.normal(0, 0.3, n) # also correlated with z
x3 = np.random.normal(0, 1, n) # independent
y = 2*x1 + 1.5*x3 + np.random.normal(0, 1, n)
X = sm.add_constant(pd.DataFrame({'x1':x1,'x2':x2,'x3':x3}))
# Detect multicollinearity: Variance Inflation Factor
vif_data = pd.DataFrame()
vif_data['Feature'] = ['x1','x2','x3']
vif_data['VIF'] = [variance_inflation_factor(X.values, i+1) for i in range(3)]
print("VIF (Variance Inflation Factor):")
print(vif_data)
print("Rule of thumb: VIF > 10 (or >5) indicates problematic multicollinearity")
# Correlation matrix
corr = pd.DataFrame({'x1':x1,'x2':x2,'x3':x3}).corr()
print("\nCorrelation matrix:")
print(corr.round(3))
plt.figure(figsize=(6, 4))
sns.heatmap(corr, annot=True, fmt='.3f', cmap='RdBu_r', center=0)
plt.title('Predictor Correlation Matrix')
plt.tight_layout()
plt.savefig('multicollinearity.png', dpi=150)
plt.show()
# Show effect: unstable coefficients with multicollinearity
print("\nWith multicollinearity — coefficient instability:")
for seed in [1, 2, 3, 4, 5]:
np.random.seed(seed)
x1s = z + np.random.normal(0, 0.3, n)
x2s = z + np.random.normal(0, 0.3, n)
ys = 2*x1s + 1.5*np.random.normal(0,1,n) + np.random.normal(0, 1, n)
Xs = sm.add_constant(pd.DataFrame({'x1':x1s,'x2':x2s}))
m = sm.OLS(ys, Xs).fit()
print(f" Seed {seed}: β₁={m.params['x1']:.3f}, β₂={m.params['x2']:.3f}")
Solutions
| Solution | When to Use |
|---|---|
| Remove one collinear predictor | If redundant (e.g., two versions of same variable) |
| Create composite (PCA) | When both carry signal |
| Ridge regression | Regularization shrinks correlated coefficients |
| Center/standardize variables | For polynomial terms and interactions |
| Collect more data | Increases precision |
Key Takeaways
- VIF > 10 suggests serious multicollinearity; VIF > 5 warrants attention
- Multicollinearity inflates standard errors → wide CIs, large p-values, unstable coefficients
- Point estimates are still unbiased — only inference is affected
- Perfect collinearity makes XᵀX non-invertible → OLS impossible
- Ridge regression is the best solution when you need all predictors