Linear Regression — Theory, Math & Python

What Is Linear Regression?

Linear regression models the relationship between a continuous target $y$ and one or more features $X$ by fitting the best straight line (or hyperplane in multiple dimensions).

$\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon$

In matrix form:

$\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$

Where:

$\mathbf{X}$ is the $(n \times p+1)$ design matrix (first column of 1s for intercept)
$\boldsymbol{\beta}$ is the $(p+1)$ vector of coefficients
$\boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma^2\mathbf{I})$ is the error term

Ordinary Least Squares (OLS) — Deriving the Solution

We choose $\boldsymbol{\beta}$ to minimise the Residual Sum of Squares (RSS):

$\text{RSS}(\boldsymbol{\beta}) = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2$

Taking the derivative and setting it to zero:

$\frac{\partial \text{RSS}}{\partial \boldsymbol{\beta}} = -2\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = 0$

Normal Equations → closed-form solution:

$\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$

This works when $\mathbf{X}^\top\mathbf{X}$ is invertible (no perfect multicollinearity).

Key Assumptions (LINE)

Assumption	What It Means	How to Check
Linearity	$E[y\\|X]$ is linear in $X$	Scatter plot, residual vs fitted
Independence	Residuals are uncorrelated	Durbin-Watson test
Normality	Residuals ~ Normal(0, σ²)	QQ-plot, Shapiro-Wilk test
Equal Variance	Homoscedasticity	Residual vs fitted plot

Full Python Implementation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# ── 1. Load Data ─────────────────────────────────────────────────────
housing = fetch_california_housing(as_frame=True)
df = housing.frame
print(df.describe().round(2))
print(f"\nTarget: Median house value ($100k)")
print(f"Shape: {df.shape}")

# ── 2. EDA ───────────────────────────────────────────────────────────
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
for ax, col in zip(axes.flat, df.columns):
    ax.hist(df[col], bins=40, color='steelblue', edgecolor='white', alpha=0.8)
    ax.set_title(col, fontsize=10)
    ax.set_xlabel('')
plt.suptitle('Feature Distributions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm',
            center=0, square=True)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

# ── 3. Prepare Data ──────────────────────────────────────────────────
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# ── 4. OLS Implementation from Scratch ───────────────────────────────
def ols_solution(X, y):
    """Closed-form OLS: β = (XᵀX)⁻¹ Xᵀy"""
    X_b = np.column_stack([np.ones(len(X)), X])   # add intercept column
    beta = np.linalg.pinv(X_b.T @ X_b) @ X_b.T @ y
    return beta[0], beta[1:]   # intercept, coefficients

intercept_ols, coefs_ols = ols_solution(X_train_s, y_train.values)
print(f"\nOLS Intercept : {intercept_ols:.4f}")
coef_df = pd.DataFrame({'Feature': X.columns, 'OLS Coef': coefs_ols})
print(coef_df.sort_values('OLS Coef', key=abs, ascending=False))

# ── 5. Sklearn Linear Regression ─────────────────────────────────────
lr = LinearRegression()
lr.fit(X_train_s, y_train)
y_pred_lr = lr.predict(X_test_s)

def evaluate(name, y_true, y_pred):
    mse  = mean_squared_error(y_true, y_pred)
    mae  = mean_absolute_error(y_true, y_pred)
    r2   = r2_score(y_true, y_pred)
    rmse = np.sqrt(mse)
    print(f"\n{'─'*40}")
    print(f"  {name}")
    print(f"{'─'*40}")
    print(f"  RMSE  : {rmse:.4f}")
    print(f"  MAE   : {mae:.4f}")
    print(f"  R²    : {r2:.4f}")
    return {'RMSE': rmse, 'MAE': mae, 'R²': r2}

res_lr = evaluate("Linear Regression (OLS)", y_test, y_pred_lr)

# ── 6. Regularised Models ─────────────────────────────────────────────
alphas_to_try = {
    'Ridge (L2)':   Ridge(alpha=1.0),
    'Lasso (L1)':   Lasso(alpha=0.01),
    'ElasticNet':   ElasticNet(alpha=0.01, l1_ratio=0.5),
}

all_results = {'OLS': res_lr}
for name, model in alphas_to_try.items():
    model.fit(X_train_s, y_train)
    y_pred = model.predict(X_test_s)
    all_results[name] = evaluate(name, y_test, y_pred)

# ── 7. Visualise Coefficient Shrinkage ───────────────────────────────
coef_comparison = pd.DataFrame({
    'Feature':   X.columns,
    'OLS':       lr.coef_,
    'Ridge':     Ridge(alpha=1.0).fit(X_train_s, y_train).coef_,
    'Lasso':     Lasso(alpha=0.01).fit(X_train_s, y_train).coef_,
})
coef_comparison = coef_comparison.set_index('Feature')

coef_comparison.plot(kind='bar', figsize=(12, 5), width=0.7)
plt.title('Coefficient Comparison: OLS vs Ridge vs Lasso', fontsize=13)
plt.ylabel('Coefficient Value')
plt.axhline(0, color='black', linewidth=0.8, linestyle='--')
plt.xticks(rotation=30, ha='right')
plt.legend()
plt.tight_layout()
plt.show()

# ── 8. Residual Diagnostics ───────────────────────────────────────────
residuals = y_test.values - y_pred_lr

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Residuals vs Fitted
axes[0].scatter(y_pred_lr, residuals, alpha=0.4, color='steelblue', s=10)
axes[0].axhline(0, color='red', linewidth=1.5, linestyle='--')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted')
axes[0].grid(True, alpha=0.3)

# Distribution of residuals
axes[1].hist(residuals, bins=50, color='steelblue', edgecolor='white', alpha=0.8)
axes[1].set_xlabel('Residual')
axes[1].set_title('Residual Distribution')
axes[1].grid(True, alpha=0.3)

# Q-Q plot
from scipy import stats
stats.probplot(residuals, dist='norm', plot=axes[2])
axes[2].set_title('Normal Q-Q Plot')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# ── 9. Polynomial Regression ──────────────────────────────────────────
# Use just 1 feature (MedInc) to visualise
X_single = df[['MedInc']].values
y_single  = df['MedHouseVal'].values

plt.figure(figsize=(12, 4))
for deg, color in zip([1, 2, 3], ['#3b82f6', '#10b981', '#f59e0b']):
    pipe = Pipeline([
        ('poly', PolynomialFeatures(degree=deg, include_bias=False)),
        ('scaler', StandardScaler()),
        ('reg', LinearRegression()),
    ])
    pipe.fit(X_single, y_single)
    X_plot = np.linspace(X_single.min(), X_single.max(), 300).reshape(-1, 1)
    y_plot = pipe.predict(X_plot)
    plt.plot(X_plot, y_plot, color=color, linewidth=2.5, label=f'Degree {deg}')

plt.scatter(X_single[::20], y_single[::20], alpha=0.2, color='gray', s=10)
plt.xlabel('Median Income')
plt.ylabel('Median House Value')
plt.title('Polynomial Regression — MedInc vs MedHouseVal')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Output:

────────────────────────────────────────
  Linear Regression (OLS)
────────────────────────────────────────
  RMSE  : 0.7378
  MAE   : 0.5324
  R²    : 0.5757

────────────────────────────────────────
  Ridge (L2)
────────────────────────────────────────
  RMSE  : 0.7378
  MAE   : 0.5323
  R²    : 0.5757

Regularisation: Ridge vs Lasso vs ElasticNet

$\text{Ridge (L2):} \quad \mathcal{L} = \text{RSS} + \alpha\sum_{j=1}^{p}\beta_j^2$

$\text{Lasso (L1):} \quad \mathcal{L} = \text{RSS} + \alpha\sum_{j=1}^{p}|\beta_j|$

$\text{ElasticNet:} \quad \mathcal{L} = \text{RSS} + \alpha\lambda\sum|\beta_j| + \frac{\alpha(1-\lambda)}{2}\sum\beta_j^2$

Method	Effect	Best When
Ridge	Shrinks all coefficients towards 0	All features contribute, multicollinearity
Lasso	Drives some coefficients to exactly 0 (feature selection)	Sparse data, many irrelevant features
ElasticNet	Combines both	High-dimensional, correlated features

Evaluation Metrics Summary

Metric	Formula	Interpretation
RMSE	sqrt(mean squared errors)	Same units as y; penalises large errors
MAE	mean of absolute errors	Robust to outliers
R²	1 - SS_res / SS_tot	1 = perfect, 0 = predicts mean, less than 0 = worse than mean
Adj R²	1 - (1 - R²)(n-1)/(n-p-1)	Penalises extra features; use for model comparison

Key Takeaways

OLS gives the closed-form best-fit line by minimising RSS
Assumptions: Linearity, independence, normality, homoscedasticity
Ridge reduces variance without removing features; Lasso does feature selection
Always scale features before applying regularisation
Residual plots are the most important diagnostic — check them first