Multiple Linear Regression
Extends simple regression to multiple predictors:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy import stats
import matplotlib.pyplot as plt
np.random.seed(42)
n = 200
# House price data: size, bedrooms, age
house_size = np.random.uniform(1000, 3500, n)
bedrooms = np.random.choice([1,2,3,4,5], n, p=[0.05,0.2,0.4,0.25,0.1])
age = np.random.uniform(0, 50, n)
price = (50000 + 120*house_size + 8000*bedrooms - 500*age
+ np.random.normal(0, 25000, n))
df = pd.DataFrame({'price':price,'size':house_size,'bedrooms':bedrooms,'age':age})
X = sm.add_constant(df[['size','bedrooms','age']])
model = sm.OLS(df['price'], X).fit()
print(model.summary())
# Interpretation
print("\nCoefficient Interpretation:")
for name, coef, pval in zip(model.params.index, model.params, model.pvalues):
sig = "***" if pval<0.001 else "**" if pval<0.01 else "*" if pval<0.05 else "ns"
print(f" {name:12s}: {coef:>10.2f} (p={pval:.4f} {sig})")
# F-test: overall model significance
print(f"\nF({model.df_model:.0f},{model.df_resid:.0f}) = {model.fvalue:.2f}, p = {model.f_pvalue:.6f}")
print(f"R² = {model.rsquared:.4f}, Adj R² = {model.rsquared_adj:.4f}")
# Prediction with confidence interval
new_house = pd.DataFrame({'const':1,'size':[2000],'bedrooms':[3],'age':[10]})
pred = model.get_prediction(new_house)
summary = pred.summary_frame(alpha=0.05)
print(f"\nPrediction for 2000 sqft, 3 bed, 10yr old:")
print(f" Predicted: ${summary['mean'].iloc[0]:,.0f}")
print(f" 95% CI: (${summary['mean_ci_lower'].iloc[0]:,.0f}, ${summary['mean_ci_upper'].iloc[0]:,.0f})")
print(f" 95% PI: (${summary['obs_ci_lower'].iloc[0]:,.0f}, ${summary['obs_ci_upper'].iloc[0]:,.0f})")
Key Takeaways
- Each β coefficient = change in Y per unit change in Xᵢ, holding all others constant
- Adjusted R² should be used for model comparison (penalizes extra predictors)
- F-test tests H₀: all βᵢ = 0 (model has no explanatory power)
- Interpretation requires ceteris paribus — "all else equal" is the key phrase
- Check multicollinearity (VIF) before interpreting individual coefficients