Multicollinearity — Detection and Solutions

Regression AnalysisMultiple RegressionFree Lesson

Advertisement

Multicollinearity

Multicollinearity occurs when two or more predictors are highly correlated with each other. It doesn't bias OLS estimates but inflates standard errors, making individual coefficients unreliable.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)
n = 100

# Create correlated predictors
z = np.random.normal(0, 1, n)
x1 = z + np.random.normal(0, 0.3, n)      # strongly correlated with z
x2 = z + np.random.normal(0, 0.3, n)      # also correlated with z
x3 = np.random.normal(0, 1, n)            # independent
y = 2*x1 + 1.5*x3 + np.random.normal(0, 1, n)

X = sm.add_constant(pd.DataFrame({'x1':x1,'x2':x2,'x3':x3}))

# Detect multicollinearity: Variance Inflation Factor
vif_data = pd.DataFrame()
vif_data['Feature'] = ['x1','x2','x3']
vif_data['VIF'] = [variance_inflation_factor(X.values, i+1) for i in range(3)]
print("VIF (Variance Inflation Factor):")
print(vif_data)
print("Rule of thumb: VIF > 10 (or >5) indicates problematic multicollinearity")

# Correlation matrix
corr = pd.DataFrame({'x1':x1,'x2':x2,'x3':x3}).corr()
print("\nCorrelation matrix:")
print(corr.round(3))

plt.figure(figsize=(6, 4))
sns.heatmap(corr, annot=True, fmt='.3f', cmap='RdBu_r', center=0)
plt.title('Predictor Correlation Matrix')
plt.tight_layout()
plt.savefig('multicollinearity.png', dpi=150)
plt.show()

# Show effect: unstable coefficients with multicollinearity
print("\nWith multicollinearity — coefficient instability:")
for seed in [1, 2, 3, 4, 5]:
    np.random.seed(seed)
    x1s = z + np.random.normal(0, 0.3, n)
    x2s = z + np.random.normal(0, 0.3, n)
    ys = 2*x1s + 1.5*np.random.normal(0,1,n) + np.random.normal(0, 1, n)
    Xs = sm.add_constant(pd.DataFrame({'x1':x1s,'x2':x2s}))
    m = sm.OLS(ys, Xs).fit()
    print(f"  Seed {seed}: β₁={m.params['x1']:.3f}, β₂={m.params['x2']:.3f}")

Solutions

SolutionWhen to Use
Remove one collinear predictorIf redundant (e.g., two versions of same variable)
Create composite (PCA)When both carry signal
Ridge regressionRegularization shrinks correlated coefficients
Center/standardize variablesFor polynomial terms and interactions
Collect more dataIncreases precision

Key Takeaways

  1. VIF > 10 suggests serious multicollinearity; VIF > 5 warrants attention
  2. Multicollinearity inflates standard errors → wide CIs, large p-values, unstable coefficients
  3. Point estimates are still unbiased — only inference is affected
  4. Perfect collinearity makes XᵀX non-invertible → OLS impossible
  5. Ridge regression is the best solution when you need all predictors

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement