What Is Linear Regression?
Linear regression models the relationship between a continuous target and one or more features by fitting the best straight line (or hyperplane in multiple dimensions).
In matrix form:
Where:
is thedesign matrix (first column of 1s for intercept)is thevector of coefficientsis the error term
Ordinary Least Squares (OLS) — Deriving the Solution
We choose to minimise the Residual Sum of Squares (RSS):
Taking the derivative and setting it to zero:
Normal Equations → closed-form solution:
This works when is invertible (no perfect multicollinearity).
Key Assumptions (LINE)
| Assumption | What It Means | How to Check |
|---|---|---|
| Linearity | is linear in | Scatter plot, residual vs fitted |
| Independence | Residuals are uncorrelated | Durbin-Watson test |
| Normality | Residuals ~ Normal(0, σ²) | QQ-plot, Shapiro-Wilk test |
| Equal Variance | Homoscedasticity | Residual vs fitted plot |
Full Python Implementation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')
# ── 1. Load Data ─────────────────────────────────────────────────────
housing = fetch_california_housing(as_frame=True)
df = housing.frame
print(df.describe().round(2))
print(f"\nTarget: Median house value ($100k)")
print(f"Shape: {df.shape}")
# ── 2. EDA ───────────────────────────────────────────────────────────
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
for ax, col in zip(axes.flat, df.columns):
ax.hist(df[col], bins=40, color='steelblue', edgecolor='white', alpha=0.8)
ax.set_title(col, fontsize=10)
ax.set_xlabel('')
plt.suptitle('Feature Distributions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm',
center=0, square=True)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()
# ── 3. Prepare Data ──────────────────────────────────────────────────
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# ── 4. OLS Implementation from Scratch ───────────────────────────────
def ols_solution(X, y):
"""Closed-form OLS: β = (XᵀX)⁻¹ Xᵀy"""
X_b = np.column_stack([np.ones(len(X)), X]) # add intercept column
beta = np.linalg.pinv(X_b.T @ X_b) @ X_b.T @ y
return beta[0], beta[1:] # intercept, coefficients
intercept_ols, coefs_ols = ols_solution(X_train_s, y_train.values)
print(f"\nOLS Intercept : {intercept_ols:.4f}")
coef_df = pd.DataFrame({'Feature': X.columns, 'OLS Coef': coefs_ols})
print(coef_df.sort_values('OLS Coef', key=abs, ascending=False))
# ── 5. Sklearn Linear Regression ─────────────────────────────────────
lr = LinearRegression()
lr.fit(X_train_s, y_train)
y_pred_lr = lr.predict(X_test_s)
def evaluate(name, y_true, y_pred):
mse = mean_squared_error(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
rmse = np.sqrt(mse)
print(f"\n{'─'*40}")
print(f" {name}")
print(f"{'─'*40}")
print(f" RMSE : {rmse:.4f}")
print(f" MAE : {mae:.4f}")
print(f" R² : {r2:.4f}")
return {'RMSE': rmse, 'MAE': mae, 'R²': r2}
res_lr = evaluate("Linear Regression (OLS)", y_test, y_pred_lr)
# ── 6. Regularised Models ─────────────────────────────────────────────
alphas_to_try = {
'Ridge (L2)': Ridge(alpha=1.0),
'Lasso (L1)': Lasso(alpha=0.01),
'ElasticNet': ElasticNet(alpha=0.01, l1_ratio=0.5),
}
all_results = {'OLS': res_lr}
for name, model in alphas_to_try.items():
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)
all_results[name] = evaluate(name, y_test, y_pred)
# ── 7. Visualise Coefficient Shrinkage ───────────────────────────────
coef_comparison = pd.DataFrame({
'Feature': X.columns,
'OLS': lr.coef_,
'Ridge': Ridge(alpha=1.0).fit(X_train_s, y_train).coef_,
'Lasso': Lasso(alpha=0.01).fit(X_train_s, y_train).coef_,
})
coef_comparison = coef_comparison.set_index('Feature')
coef_comparison.plot(kind='bar', figsize=(12, 5), width=0.7)
plt.title('Coefficient Comparison: OLS vs Ridge vs Lasso', fontsize=13)
plt.ylabel('Coefficient Value')
plt.axhline(0, color='black', linewidth=0.8, linestyle='--')
plt.xticks(rotation=30, ha='right')
plt.legend()
plt.tight_layout()
plt.show()
# ── 8. Residual Diagnostics ───────────────────────────────────────────
residuals = y_test.values - y_pred_lr
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Residuals vs Fitted
axes[0].scatter(y_pred_lr, residuals, alpha=0.4, color='steelblue', s=10)
axes[0].axhline(0, color='red', linewidth=1.5, linestyle='--')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted')
axes[0].grid(True, alpha=0.3)
# Distribution of residuals
axes[1].hist(residuals, bins=50, color='steelblue', edgecolor='white', alpha=0.8)
axes[1].set_xlabel('Residual')
axes[1].set_title('Residual Distribution')
axes[1].grid(True, alpha=0.3)
# Q-Q plot
from scipy import stats
stats.probplot(residuals, dist='norm', plot=axes[2])
axes[2].set_title('Normal Q-Q Plot')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# ── 9. Polynomial Regression ──────────────────────────────────────────
# Use just 1 feature (MedInc) to visualise
X_single = df[['MedInc']].values
y_single = df['MedHouseVal'].values
plt.figure(figsize=(12, 4))
for deg, color in zip([1, 2, 3], ['#3b82f6', '#10b981', '#f59e0b']):
pipe = Pipeline([
('poly', PolynomialFeatures(degree=deg, include_bias=False)),
('scaler', StandardScaler()),
('reg', LinearRegression()),
])
pipe.fit(X_single, y_single)
X_plot = np.linspace(X_single.min(), X_single.max(), 300).reshape(-1, 1)
y_plot = pipe.predict(X_plot)
plt.plot(X_plot, y_plot, color=color, linewidth=2.5, label=f'Degree {deg}')
plt.scatter(X_single[::20], y_single[::20], alpha=0.2, color='gray', s=10)
plt.xlabel('Median Income')
plt.ylabel('Median House Value')
plt.title('Polynomial Regression — MedInc vs MedHouseVal')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Output:
────────────────────────────────────────
Linear Regression (OLS)
────────────────────────────────────────
RMSE : 0.7378
MAE : 0.5324
R² : 0.5757
────────────────────────────────────────
Ridge (L2)
────────────────────────────────────────
RMSE : 0.7378
MAE : 0.5323
R² : 0.5757
Regularisation: Ridge vs Lasso vs ElasticNet
| Method | Effect | Best When |
|---|---|---|
| Ridge | Shrinks all coefficients towards 0 | All features contribute, multicollinearity |
| Lasso | Drives some coefficients to exactly 0 (feature selection) | Sparse data, many irrelevant features |
| ElasticNet | Combines both | High-dimensional, correlated features |
Evaluation Metrics Summary
| Metric | Formula | Interpretation |
|---|---|---|
| RMSE | sqrt(mean squared errors) | Same units as y; penalises large errors |
| MAE | mean of absolute errors | Robust to outliers |
| R² | 1 - SS_res / SS_tot | 1 = perfect, 0 = predicts mean, less than 0 = worse than mean |
| Adj R² | 1 - (1 - R²)(n-1)/(n-p-1) | Penalises extra features; use for model comparison |
Key Takeaways
- OLS gives the closed-form best-fit line by minimising RSS
- Assumptions: Linearity, independence, normality, homoscedasticity
- Ridge reduces variance without removing features; Lasso does feature selection
- Always scale features before applying regularisation
- Residual plots are the most important diagnostic — check them first