Simple Linear Regression
Simple linear regression models the relationship between one predictor X and one response Y as a straight line:
- β₀ = intercept (predicted Y when X = 0)
- β₁ = slope (change in Y for 1-unit increase in X)
- εᵢ = error term
Ordinary Least Squares Estimation
OLS minimizes the sum of squared residuals:
The closed-form solution:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
np.random.seed(42)
# Generate data: house size vs price
n = 100
house_size = np.random.uniform(1000, 3000, n) # sq ft
true_beta0, true_beta1 = 50000, 150 # $150 per sq ft
house_price = true_beta0 + true_beta1 * house_size + np.random.normal(0, 30000, n)
df = pd.DataFrame({'size': house_size, 'price': house_price})
# ─── statsmodels OLS ───
X = sm.add_constant(df['size'])
model = sm.OLS(df['price'], X).fit()
print(model.summary())
# Key outputs
beta0_hat = model.params['const']
beta1_hat = model.params['size']
r_squared = model.rsquared
se_beta1 = model.bse['size']
print(f"\nEstimated equation: price = {beta0_hat:.1f} + {beta1_hat:.2f} × size")
print(f"R² = {r_squared:.4f} ({r_squared*100:.1f}% of variance explained)")
print(f"For each additional sq ft, price increases by ${beta1_hat:.2f}")
# Prediction
new_sizes = np.array([1500, 2000, 2500])
pred = pd.DataFrame({'const': 1, 'size': new_sizes})
predictions = model.get_prediction(pred)
pred_summary = predictions.summary_frame(alpha=0.05)
for size, row in zip(new_sizes, pred_summary.itertuples()):
print(f"Size {size}: predicted price = ${row.mean:.0f} "
f"(95% PI: ${row.obs_ci_lower:.0f} to ${row.obs_ci_upper:.0f})")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(df['size'], df['price']/1000, alpha=0.5, color='steelblue')
x_line = np.linspace(1000, 3000, 200)
y_pred = beta0_hat + beta1_hat * x_line
axes[0].plot(x_line, y_pred/1000, 'r-', linewidth=2)
axes[0].set_xlabel('House Size (sq ft)')
axes[0].set_ylabel('Price ($000)')
axes[0].set_title(f'Simple Linear Regression\nR² = {r_squared:.3f}')
# Residual plot
residuals = model.resid
fitted = model.fittedvalues
axes[1].scatter(fitted/1000, residuals/1000, alpha=0.5, color='coral')
axes[1].axhline(0, color='black', linewidth=2, linestyle='--')
axes[1].set_xlabel('Fitted Values ($000)')
axes[1].set_ylabel('Residuals ($000)')
axes[1].set_title('Residual Plot\n(should show no pattern)')
plt.tight_layout()
plt.savefig('simple_regression.png', dpi=150)
plt.show()
The Four Regression Assumptions (LINE)
| Assumption | Description | Check With |
|---|---|---|
| Linearity | Y and X have linear relationship | Scatter plot, residual vs fitted |
| Independence | Errors are independent | Study design, Durbin-Watson |
| Normality | Errors are normally distributed | Q-Q plot of residuals |
| Equal variance | Homoscedasticity (errors have constant spread) | Residual vs fitted plot |
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# 1. Residuals vs fitted
axes[0,0].scatter(fitted/1000, residuals/1000, alpha=0.5)
axes[0,0].axhline(0, color='red', linestyle='--')
axes[0,0].set_title('Residuals vs Fitted')
# 2. Q-Q plot
stats.probplot(residuals, dist='norm', plot=axes[0,1])
axes[0,1].set_title('Q-Q Plot of Residuals')
# 3. Scale-location (homoscedasticity)
axes[1,0].scatter(fitted/1000, np.sqrt(np.abs(residuals/1000)), alpha=0.5)
axes[1,0].set_title('Scale-Location Plot')
# 4. Residuals histogram
axes[1,1].hist(residuals/1000, bins=20, edgecolor='black', color='steelblue', alpha=0.7)
axes[1,1].set_title('Residuals Distribution')
plt.tight_layout()
plt.savefig('regression_diagnostics.png', dpi=150)
plt.show()
Key Takeaways
- β₁ is the slope — the change in Y per unit change in X
- R² = proportion of variance in Y explained by X — ranges from 0 to 1
- OLS estimates are BLUE (Best Linear Unbiased Estimators) when LINE assumptions hold
- Always plot residuals — patterns indicate assumption violations
- Correlation ≠ causation — regression shows linear association, not causal direction
- 95% prediction interval is always wider than confidence interval for the mean