Linear Regression — Theory, Math & Python

Supervised LearningRegressionFree Lesson

Advertisement

What Is Linear Regression?

Linear regression models the relationship between a continuous target yy and one or more features XX by fitting the best straight line (or hyperplane in multiple dimensions).

y^=β0+β1x1+β2x2++βpxp+ε\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon

In matrix form:

y^=Xβ+ε\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}

Where:

  • X\mathbf{X} is the (n×p+1)(n \times p+1) design matrix (first column of 1s for intercept)
  • β\boldsymbol{\beta} is the (p+1)(p+1) vector of coefficients
  • εN(0,σ2I)\boldsymbol{\varepsilon} \sim \mathcal{N}(0, \sigma^2\mathbf{I}) is the error term

Ordinary Least Squares (OLS) — Deriving the Solution

We choose β\boldsymbol{\beta} to minimise the Residual Sum of Squares (RSS):

RSS(β)=i=1n(yiy^i)2=yXβ2\text{RSS}(\boldsymbol{\beta}) = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2

Taking the derivative and setting it to zero:

RSSβ=2X(yXβ)=0\frac{\partial \text{RSS}}{\partial \boldsymbol{\beta}} = -2\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = 0

Normal Equations → closed-form solution:

β^=(XX)1Xy\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}

This works when XX\mathbf{X}^\top\mathbf{X} is invertible (no perfect multicollinearity).


Key Assumptions (LINE)

AssumptionWhat It MeansHow to Check
LinearityE[yX]E[y\|X] is linear in XXScatter plot, residual vs fitted
IndependenceResiduals are uncorrelatedDurbin-Watson test
NormalityResiduals ~ Normal(0, σ²)QQ-plot, Shapiro-Wilk test
Equal VarianceHomoscedasticityResidual vs fitted plot

Full Python Implementation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# ── 1. Load Data ─────────────────────────────────────────────────────
housing = fetch_california_housing(as_frame=True)
df = housing.frame
print(df.describe().round(2))
print(f"\nTarget: Median house value ($100k)")
print(f"Shape: {df.shape}")

# ── 2. EDA ───────────────────────────────────────────────────────────
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
for ax, col in zip(axes.flat, df.columns):
    ax.hist(df[col], bins=40, color='steelblue', edgecolor='white', alpha=0.8)
    ax.set_title(col, fontsize=10)
    ax.set_xlabel('')
plt.suptitle('Feature Distributions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm',
            center=0, square=True)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

# ── 3. Prepare Data ──────────────────────────────────────────────────
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# ── 4. OLS Implementation from Scratch ───────────────────────────────
def ols_solution(X, y):
    """Closed-form OLS: β = (XᵀX)⁻¹ Xᵀy"""
    X_b = np.column_stack([np.ones(len(X)), X])   # add intercept column
    beta = np.linalg.pinv(X_b.T @ X_b) @ X_b.T @ y
    return beta[0], beta[1:]   # intercept, coefficients

intercept_ols, coefs_ols = ols_solution(X_train_s, y_train.values)
print(f"\nOLS Intercept : {intercept_ols:.4f}")
coef_df = pd.DataFrame({'Feature': X.columns, 'OLS Coef': coefs_ols})
print(coef_df.sort_values('OLS Coef', key=abs, ascending=False))

# ── 5. Sklearn Linear Regression ─────────────────────────────────────
lr = LinearRegression()
lr.fit(X_train_s, y_train)
y_pred_lr = lr.predict(X_test_s)

def evaluate(name, y_true, y_pred):
    mse  = mean_squared_error(y_true, y_pred)
    mae  = mean_absolute_error(y_true, y_pred)
    r2   = r2_score(y_true, y_pred)
    rmse = np.sqrt(mse)
    print(f"\n{'─'*40}")
    print(f"  {name}")
    print(f"{'─'*40}")
    print(f"  RMSE  : {rmse:.4f}")
    print(f"  MAE   : {mae:.4f}")
    print(f"  R²    : {r2:.4f}")
    return {'RMSE': rmse, 'MAE': mae, 'R²': r2}

res_lr = evaluate("Linear Regression (OLS)", y_test, y_pred_lr)

# ── 6. Regularised Models ─────────────────────────────────────────────
alphas_to_try = {
    'Ridge (L2)':   Ridge(alpha=1.0),
    'Lasso (L1)':   Lasso(alpha=0.01),
    'ElasticNet':   ElasticNet(alpha=0.01, l1_ratio=0.5),
}

all_results = {'OLS': res_lr}
for name, model in alphas_to_try.items():
    model.fit(X_train_s, y_train)
    y_pred = model.predict(X_test_s)
    all_results[name] = evaluate(name, y_test, y_pred)

# ── 7. Visualise Coefficient Shrinkage ───────────────────────────────
coef_comparison = pd.DataFrame({
    'Feature':   X.columns,
    'OLS':       lr.coef_,
    'Ridge':     Ridge(alpha=1.0).fit(X_train_s, y_train).coef_,
    'Lasso':     Lasso(alpha=0.01).fit(X_train_s, y_train).coef_,
})
coef_comparison = coef_comparison.set_index('Feature')

coef_comparison.plot(kind='bar', figsize=(12, 5), width=0.7)
plt.title('Coefficient Comparison: OLS vs Ridge vs Lasso', fontsize=13)
plt.ylabel('Coefficient Value')
plt.axhline(0, color='black', linewidth=0.8, linestyle='--')
plt.xticks(rotation=30, ha='right')
plt.legend()
plt.tight_layout()
plt.show()

# ── 8. Residual Diagnostics ───────────────────────────────────────────
residuals = y_test.values - y_pred_lr

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Residuals vs Fitted
axes[0].scatter(y_pred_lr, residuals, alpha=0.4, color='steelblue', s=10)
axes[0].axhline(0, color='red', linewidth=1.5, linestyle='--')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted')
axes[0].grid(True, alpha=0.3)

# Distribution of residuals
axes[1].hist(residuals, bins=50, color='steelblue', edgecolor='white', alpha=0.8)
axes[1].set_xlabel('Residual')
axes[1].set_title('Residual Distribution')
axes[1].grid(True, alpha=0.3)

# Q-Q plot
from scipy import stats
stats.probplot(residuals, dist='norm', plot=axes[2])
axes[2].set_title('Normal Q-Q Plot')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# ── 9. Polynomial Regression ──────────────────────────────────────────
# Use just 1 feature (MedInc) to visualise
X_single = df[['MedInc']].values
y_single  = df['MedHouseVal'].values

plt.figure(figsize=(12, 4))
for deg, color in zip([1, 2, 3], ['#3b82f6', '#10b981', '#f59e0b']):
    pipe = Pipeline([
        ('poly', PolynomialFeatures(degree=deg, include_bias=False)),
        ('scaler', StandardScaler()),
        ('reg', LinearRegression()),
    ])
    pipe.fit(X_single, y_single)
    X_plot = np.linspace(X_single.min(), X_single.max(), 300).reshape(-1, 1)
    y_plot = pipe.predict(X_plot)
    plt.plot(X_plot, y_plot, color=color, linewidth=2.5, label=f'Degree {deg}')

plt.scatter(X_single[::20], y_single[::20], alpha=0.2, color='gray', s=10)
plt.xlabel('Median Income')
plt.ylabel('Median House Value')
plt.title('Polynomial Regression — MedInc vs MedHouseVal')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Output:

────────────────────────────────────────
  Linear Regression (OLS)
────────────────────────────────────────
  RMSE  : 0.7378
  MAE   : 0.5324
  R²    : 0.5757

────────────────────────────────────────
  Ridge (L2)
────────────────────────────────────────
  RMSE  : 0.7378
  MAE   : 0.5323
  R²    : 0.5757

Regularisation: Ridge vs Lasso vs ElasticNet

Ridge (L2):L=RSS+αj=1pβj2\text{Ridge (L2):} \quad \mathcal{L} = \text{RSS} + \alpha\sum_{j=1}^{p}\beta_j^2

Lasso (L1):L=RSS+αj=1pβj\text{Lasso (L1):} \quad \mathcal{L} = \text{RSS} + \alpha\sum_{j=1}^{p}|\beta_j|

ElasticNet:L=RSS+αλβj+α(1λ)2βj2\text{ElasticNet:} \quad \mathcal{L} = \text{RSS} + \alpha\lambda\sum|\beta_j| + \frac{\alpha(1-\lambda)}{2}\sum\beta_j^2

MethodEffectBest When
RidgeShrinks all coefficients towards 0All features contribute, multicollinearity
LassoDrives some coefficients to exactly 0 (feature selection)Sparse data, many irrelevant features
ElasticNetCombines bothHigh-dimensional, correlated features

Evaluation Metrics Summary

MetricFormulaInterpretation
RMSEsqrt(mean squared errors)Same units as y; penalises large errors
MAEmean of absolute errorsRobust to outliers
1 - SS_res / SS_tot1 = perfect, 0 = predicts mean, less than 0 = worse than mean
Adj R²1 - (1 - R²)(n-1)/(n-p-1)Penalises extra features; use for model comparison

Key Takeaways

  1. OLS gives the closed-form best-fit line by minimising RSS
  2. Assumptions: Linearity, independence, normality, homoscedasticity
  3. Ridge reduces variance without removing features; Lasso does feature selection
  4. Always scale features before applying regularisation
  5. Residual plots are the most important diagnostic — check them first

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement