Linear Regression: Math + Code + Assumptions

Module 2: Machine LearningFree Lesson

Advertisement

Linear Regression: Math + Code + Assumptions

Overview

Linear regression is the foundational algorithm in machine learning and statistics for modeling relationships between variables. Despite its simplicity, understanding linear regression deeply provides insight into more complex algorithms.


Simple Linear Regression

Mathematical Formulation

The simplest form models a linear relationship between one predictor xx and response yy:

Simple Linear Regression Model

y=β0+β1x+ϵy = \beta_0 + \beta_1 x + \epsilon

Here,

  • β0\beta_0=Intercept (y-intercept)
  • β1\beta_1=Slope (coefficient)
  • ϵ\epsilon=Error term, \epsilon \sim \mathcal{N}(0, \sigma^2)

Predicted value:

y^=β^0+β^1x\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x

Visual Representation

Architecture Diagram
y
^
|                           *
|                       *
|                   *    /  Regression Line: y = β₀ + β₁x
|               *    /
|           *    /
|       *    /
|   *    /
|*    /
|   /  <-- residuals
| /
+---------------------------> x

Residual: e_i = y_i - y_hat_i

Ordinary Least Squares (OLS) Derivation

Objective Function

DfOLS Objective

Minimize the sum of squared residuals:

minβ0,β1S(β0,β1)=i=1n(yiβ0β1xi)2\min_{\beta_0, \beta_1} S(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2

Closed-Form Solution

ℹ️ Derivation Steps

Step 1: Partial derivative with respect to β0\beta_0:

Sβ0=2i=1n(yiβ0β1xi)=0\frac{\partial S}{\partial \beta_0} = -2 \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i) = 0

Solving for β0\beta_0:

β^0=yˉβ^1xˉ\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}

Step 2: Partial derivative with respect to β1\beta_1:

Sβ1=2i=1nxi(yiβ0β1xi)=0\frac{\partial S}{\partial \beta_1} = -2 \sum_{i=1}^{n} x_i(y_i - \beta_0 - \beta_1 x_i) = 0

Step 3: Solve for β1\beta_1:

β^1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2=Cov(x,y)Var(x)\hat{\beta}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2} = \frac{\text{Cov}(x, y)}{\text{Var}(x)}

Matrix Form (Multiple Regression)

For multiple features XRn×d\mathbf{X} \in \mathbb{R}^{n \times d}:

β^=(XTX)1XTy\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}

ThGauss-Markov Theorem

Under the assumptions of the classical linear regression model (linearity, independence, homoscedasticity, and zero conditional mean of errors), the OLS estimator boldsymbolbeta^\hat{\\boldsymbol{\\beta}} is the Best Linear Unbiased Estimator (BLUE). That is, among all linear unbiased estimators, OLS has the minimum variance.

Var(β~)Var(β^OLS)\text{Var}(\tilde{\boldsymbol{\beta}}) \geq \text{Var}(\hat{\boldsymbol{\beta}}_{\text{OLS}})

for any other linear unbiased estimator boldsymbolbeta~\tilde{\\boldsymbol{\\beta}}.

import numpy as np

def ols_fit(X, y):
    """Ordinary Least Squares using normal equation."""
    X_b = np.c_[np.ones((X.shape[0], 1)), X]
    beta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
    return beta

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
beta = ols_fit(X, y)
print(f"Intercept: {beta[0]:.2f}, Slope: {beta[1]:.2f}")

Gradient Descent

When OLS is computationally expensive (large datasets), use iterative optimization:

Algorithm

Gradient Descent Update Rule

βj(t+1)=βj(t)αJβj\beta_j^{(t+1)} = \beta_j^{(t)} - \alpha \frac{\partial J}{\partial \beta_j}

Here,

  • βj(t)\beta_j^{(t)}=Parameter value at iteration t
  • α\alpha=Learning rate
  • Jβj\frac{\partial J}{\partial \beta_j}=Partial derivative of cost

Gradient Update Rule

Jβj=2ni=1nxij(yiy^i)\frac{\partial J}{\partial \beta_j} = -\frac{2}{n} \sum_{i=1}^{n} x_{ij}(y_i - \hat{y}_i)

Convergence Visualization

Architecture Diagram
J(β)
^
|\
| \        Learning Rate Too High
|  \      (diverges)
|   \
|    \    ___
|     \  /   \  ___
|      \/     \/    \  Learning Rate Just Right
|                   \_________________________
|
|           _______________  Learning Rate Too Small
|          /               (slow convergence)
+----------------------------------------------> Iterations
def gradient_descent(X, y, learning_rate=0.01, n_iterations=1000):
    """Batch Gradient Descent for Linear Regression."""
    n_samples, n_features = X.shape
    beta = np.zeros(n_features + 1)
    X_b = np.c_[np.ones((n_samples, 1)), X]
    cost_history = []

    for iteration in range(n_iterations):
        y_pred = X_b @ beta
        cost = np.mean((y_pred - y) ** 2)
        cost_history.append(cost)
        gradients = (2 / n_samples) * X_b.T @ (y_pred - y)
        beta -= learning_rate * gradients

    return beta, cost_history

np.random.seed(42)
X = np.random.randn(100, 3)
y = 3 * X[:, 0] + 2 * X[:, 1] - 1 * X[:, 2] + 5 + np.random.randn(100) * 0.5

beta_gd, costs = gradient_descent(X, y, learning_rate=0.1, n_iterations=100)
print(f"Learned coefficients: {beta_gd}")

Multiple Linear Regression

Extends simple linear regression to multiple features:

Multiple Linear Regression

y=β0+β1x1+β2x2++βdxd+ϵy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_d x_d + \epsilon

Here,

  • βj\beta_j=Coefficient for feature j
  • dd=Number of features

Matrix notation: y=Xboldsymbolbeta+boldsymbolepsilon\mathbf{y} = \mathbf{X}\\boldsymbol{\\beta} + \\boldsymbol{\\epsilon}

💡 Coefficient Interpretation

βj\beta_j represents the expected change in yy for a one-unit increase in xjx_j, holding all other features constant. This is the ceteris paribus interpretation essential for causal analysis.

Feature Scaling for Gradient Descent

Without scaling, features with different magnitudes cause slow convergence:

xscaled=xμσx_{\text{scaled}} = \frac{x - \mu}{\sigma}
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Assumptions of Linear Regression

1. Linearity

The relationship between xx and yy is linear:

Linearity Assumption

E[yx]=β0+β1x\mathbb{E}[y|x] = \beta_0 + \beta_1 x

Here,

  • E[yx]\mathbb{E}[y|x]=Expected value of y given x

Test: Residual plots should show no pattern.

2. Independence of Errors

Errors are independent (no autocorrelation).

Test: Durbin-Watson statistic 2\approx 2 indicates no autocorrelation.

3. Homoscedasticity

Constant variance of errors across all levels of predictors:

Homoscedasticity Assumption

Var(ϵi)=σ2i\text{Var}(\epsilon_i) = \sigma^2 \quad \forall \, i

Here,

  • ϵi\epsilon_i=Error term for observation i
  • σ2\sigma^2=Constant error variance

4. Normality of Errors

Errors are normally distributed:

ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0, \sigma^2)

Test: Q-Q plot or Shapiro-Wilk test.

5. No Multicollinearity

Predictors are not highly correlated.

⚠️ Variance Inflation Factor

Test: VIF: VIFj=11Rj2\text{VIF}_j = \frac{1}{1 - R_j^2}

VIF ValueInterpretation
1No correlation
1-5Moderate correlation
5-10High correlation
> 10Severe multicollinearity

Model Evaluation Metrics

R-squared (R2R^2)

R-squared

R2=1SSresSStot=1i(yiy^i)2i(yiyˉ)2R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum_{i}(y_i - \hat{y}_i)^2}{\sum_{i}(y_i - \bar{y})^2}

Here,

  • SSresSS_{\text{res}}=Sum of squared residuals
  • SStotSS_{\text{tot}}=Total sum of squares
  • R2=1R^2 = 1: Perfect prediction
  • R2=0R^2 = 0: Model predicts mean only
  • R2<0R^2 < 0: Model worse than mean

Adjusted R-squared

Penalizes for adding irrelevant features:

Radj2=1(1R2)(n1)np1R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n-1)}{n-p-1}

where nn = samples, pp = features.

Root Mean Squared Error (RMSE)

RMSE=1ni=1n(yiy^i)2\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}

Mean Absolute Error (MAE)

MAE=1ni=1nyiy^i\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|

Comparison

MetricSensitivity to OutliersInterpretability
RMSEHighSame units as y
MAELowSame units as y
R2R^2MediumScale-free (0-1)

Complete Python Implementation

📝Linear Regression with Regularization

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

np.random.seed(42)
n = 500

data = pd.DataFrame({
    'square_feet': np.random.uniform(800, 4000, n),
    'bedrooms': np.random.randint(1, 6, n),
    'bathrooms': np.random.randint(1, 4, n),
    'age': np.random.uniform(0, 50, n),
    'garage': np.random.randint(0, 3, n)
})

data['price'] = (
    50000 +
    150 * data['square_feet'] +
    10000 * data['bedrooms'] +
    15000 * data['bathrooms'] -
    1000 * data['age'] +
    20000 * data['garage'] +
    np.random.normal(0, 20000, n)
)

X = data.drop('price', axis=1)
y = data['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)

print("=== Linear Regression ===")
print(f"R² Score: {r2_score(y_test, y_pred_lr):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred_lr)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, y_pred_lr):,.2f}")

ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
y_pred_ridge = ridge.predict(X_test_scaled)

print("\n=== Ridge Regression (L2) ===")
print(f"R² Score: {r2_score(y_test, y_pred_ridge):.4f}")

lasso = Lasso(alpha=100)
lasso.fit(X_train_scaled, y_train)
y_pred_lasso = lasso.predict(X_test_scaled)

print("\n=== Lasso Regression (L1) ===")
print(f"R² Score: {r2_score(y_test, y_pred_lasso):.4f}")
for name, coef in zip(X.columns, lasso.coef_):
    print(f"  {name}: ${coef:,.2f} {'(selected)' if abs(coef) > 0 else '(dropped)'}")

print("\n=== Cross-Validation (5-fold) ===")
models = {'Linear': LinearRegression(), 'Ridge': Ridge(alpha=1.0), 'Lasso': Lasso(alpha=100)}
for name, model in models.items():
    scores = cross_val_score(model, scaler.transform(X), y, cv=5, scoring='r2')
    print(f"{name}: R² = {scores.mean():.4f} ± {scores.std():.4f}")

Key Takeaways

📋Summary: Linear Regression

  1. OLS gives exact solution: beta^=(XTX)1XTy\hat{\\beta} = (X^TX)^{-1}X^Ty when features aren't too correlated
  2. Gradient Descent scales better but requires feature scaling and learning rate tuning
  3. Gauss-Markov Theorem: OLS is BLUE under classical assumptions
  4. Five assumptions must be checked: linearity, independence, homoscedasticity, normality, no multicollinearity
  5. Regularization (Ridge/Lasso) prevents overfitting and handles multicollinearity
  6. R2R^2 alone is insufficient: Always check residuals and use multiple metrics

Practice Exercises

Exercise 1: Manual Calculation

Given data points: (1,2),(2,4),(3,5),(4,4),(5,5)(1, 2), (2, 4), (3, 5), (4, 4), (5, 5)

  • a) Calculate beta^0\hat{\\beta}_0 and beta^1\hat{\\beta}_1 by hand
  • b) What is the predicted value for x=6x = 6?
  • c) Calculate R2R^2

Exercise 2: Implement from Scratch

class SimpleLinearRegression:
    def fit(self, X, y):
        # Calculate β₀ and β₁ using formulas
        pass

    def predict(self, X):
        # Return predictions
        pass

Exercise 3: Assumption Checking

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
# a) Fit a linear regression
# b) Check all 5 assumptions
# c) Identify which assumptions are violated
# d) Propose solutions

Exercise 4: Regularization Effect

alphas = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
# Plot train/test R² vs alpha
# Find the optimal alpha

Discussion Questions

  1. When might linear regression still be preferred over complex models like neural networks?
  2. How does multicollinearity affect coefficient interpretation?
  3. Why might a lower R2R^2 model be preferable in some cases?

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement