OLS Estimation: From First Principles
Ordinary Least Squares (OLS) finds β̂ that minimizes the sum of squared residuals:
Taking the derivative and setting to zero gives the normal equations:
import numpy as np
import pandas as pd
from scipy import stats
np.random.seed(42)
# Data
n = 50
X_raw = np.random.uniform(0, 10, n)
y = 3 + 2.5 * X_raw + np.random.normal(0, 2, n)
# Design matrix: [1 | X]
X = np.column_stack([np.ones(n), X_raw])
# Normal equations: β̂ = (X'X)⁻¹ X'y
XtX = X.T @ X
Xty = X.T @ y
beta_hat = np.linalg.solve(XtX, Xty) # better than inv() for numerical stability
print(f"β̂₀ (intercept) = {beta_hat[0]:.4f} (true: 3)")
print(f"β̂₁ (slope) = {beta_hat[1]:.4f} (true: 2.5)")
# Fitted values and residuals
y_hat = X @ beta_hat
residuals = y - y_hat
# Standard error of the regression (RMSE)
n_params = 2 # β₀ and β₁
SSR = residuals @ residuals
sigma2_hat = SSR / (n - n_params)
print(f"σ̂² = {sigma2_hat:.4f}")
print(f"σ̂ (RMSE) = {np.sqrt(sigma2_hat):.4f}")
# Variance-covariance matrix of β̂
Var_beta = sigma2_hat * np.linalg.inv(XtX)
SE_beta = np.sqrt(np.diag(Var_beta))
print(f"SE(β̂₀) = {SE_beta[0]:.4f}")
print(f"SE(β̂₁) = {SE_beta[1]:.4f}")
# t-statistics and p-values
t_stats = beta_hat / SE_beta
p_values = 2 * stats.t.sf(np.abs(t_stats), df=n-n_params)
print(f"\nt-test for β₁: t={t_stats[1]:.4f}, p={p_values[1]:.6f}")
# R-squared
SS_total = ((y - y.mean())**2).sum()
SS_resid = (residuals**2).sum()
R2 = 1 - SS_resid/SS_total
print(f"R² = {R2:.4f}")
# Verify with statsmodels
import statsmodels.api as sm
model = sm.OLS(y, X).fit()
np.testing.assert_allclose(beta_hat, model.params, rtol=1e-10)
print("\n✓ Manual OLS matches statsmodels exactly")
Properties of OLS Estimators
Gauss-Markov Theorem: Under the classical assumptions, OLS is BLUE:
- Best (minimum variance)
- Linear (linear function of y)
- Unbiased (E[β̂] = β)
- Estimator
# Demonstrate unbiasedness: repeat many times and show E[β̂] ≈ β
n_sims = 10000
beta1_estimates = []
for _ in range(n_sims):
X_raw_s = np.random.uniform(0, 10, 50)
y_s = 3 + 2.5*X_raw_s + np.random.normal(0, 2, 50)
X_s = np.column_stack([np.ones(50), X_raw_s])
b = np.linalg.solve(X_s.T@X_s, X_s.T@y_s)
beta1_estimates.append(b[1])
print(f"True β₁ = 2.5")
print(f"E[β̂₁] over {n_sims:,} samples = {np.mean(beta1_estimates):.4f}")
print(f"Std(β̂₁) = {np.std(beta1_estimates):.4f}")
Key Takeaways
- OLS normal equations: β̂ = (XᵀX)⁻¹Xᵀy — the closed-form solution
- Gauss-Markov: OLS is BLUE when errors are i.i.d. with constant variance
- σ̂² = SSR/(n−p) — unbiased estimator of error variance (p = number of parameters)
- np.linalg.solve() is more stable than np.linalg.inv() for the normal equations
- The design matrix X includes a column of 1s for the intercept