Linear Regression: Math + Code + Assumptions

Overview

Linear regression is the foundational algorithm in machine learning and statistics for modeling relationships between variables. Despite its simplicity, understanding linear regression deeply provides insight into more complex algorithms.

Simple Linear Regression

Mathematical Formulation

The simplest form models a linear relationship between one predictor $x$ and response $y$ :

Simple Linear Regression Model

y = \beta_0 + \beta_1 x + \epsilon

Here,

$\beta_0$ =Intercept (y-intercept)
$\beta_1$ =Slope (coefficient)
$\epsilon$ =Error term, \epsilon \sim \mathcal{N}(0, \sigma^2)

Predicted value:

\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x

Visual Representation

Architecture Diagram

y
^
|                           *
|                       *
|                   *    /  Regression Line: y = β₀ + β₁x
|               *    /
|           *    /
|       *    /
|   *    /
|*    /
|   /  <-- residuals
| /
+---------------------------> x

Residual: e_i = y_i - y_hat_i

Ordinary Least Squares (OLS) Derivation

Objective Function

DfOLS Objective

Minimize the sum of squared residuals:

\min_{\beta_0, \beta_1} S(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2

Closed-Form Solution

ℹ️ Derivation Steps

Step 1: Partial derivative with respect to $\beta_0$ :

\frac{\partial S}{\partial \beta_0} = -2 \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i) = 0

Solving for $\beta_0$ :

\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}

Step 2: Partial derivative with respect to $\beta_1$ :

\frac{\partial S}{\partial \beta_1} = -2 \sum_{i=1}^{n} x_i(y_i - \beta_0 - \beta_1 x_i) = 0

Step 3: Solve for $\beta_1$ :

\hat{\beta}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2} = \frac{\text{Cov}(x, y)}{\text{Var}(x)}

Matrix Form (Multiple Regression)

For multiple features $\mathbf{X} \in \mathbb{R}^{n \times d}$ :

\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}

ThGauss-Markov Theorem

Under the assumptions of the classical linear regression model (linearity, independence, homoscedasticity, and zero conditional mean of errors), the OLS estimator $\hat{\\boldsymbol{\\beta}}$ is the Best Linear Unbiased Estimator (BLUE). That is, among all linear unbiased estimators, OLS has the minimum variance.

\text{Var}(\tilde{\boldsymbol{\beta}}) \geq \text{Var}(\hat{\boldsymbol{\beta}}_{\text{OLS}})

for any other linear unbiased estimator $\tilde{\\boldsymbol{\\beta}}$ .

import numpy as np

def ols_fit(X, y):
    """Ordinary Least Squares using normal equation."""
    X_b = np.c_[np.ones((X.shape[0], 1)), X]
    beta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
    return beta

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
beta = ols_fit(X, y)
print(f"Intercept: {beta[0]:.2f}, Slope: {beta[1]:.2f}")

Gradient Descent

When OLS is computationally expensive (large datasets), use iterative optimization:

Algorithm

Gradient Descent Update Rule

\beta_j^{(t+1)} = \beta_j^{(t)} - \alpha \frac{\partial J}{\partial \beta_j}

Here,

$\beta_j^{(t)}$ =Parameter value at iteration t
$\alpha$ =Learning rate
$\frac{\partial J}{\partial \beta_j}$ =Partial derivative of cost

Gradient Update Rule

\frac{\partial J}{\partial \beta_j} = -\frac{2}{n} \sum_{i=1}^{n} x_{ij}(y_i - \hat{y}_i)

Convergence Visualization

Architecture Diagram

J(β)
^
|\
| \        Learning Rate Too High
|  \      (diverges)
|   \
|    \    ___
|     \  /   \  ___
|      \/     \/    \  Learning Rate Just Right
|                   \_________________________
|
|           _______________  Learning Rate Too Small
|          /               (slow convergence)
+----------------------------------------------> Iterations

def gradient_descent(X, y, learning_rate=0.01, n_iterations=1000):
    """Batch Gradient Descent for Linear Regression."""
    n_samples, n_features = X.shape
    beta = np.zeros(n_features + 1)
    X_b = np.c_[np.ones((n_samples, 1)), X]
    cost_history = []

    for iteration in range(n_iterations):
        y_pred = X_b @ beta
        cost = np.mean((y_pred - y) ** 2)
        cost_history.append(cost)
        gradients = (2 / n_samples) * X_b.T @ (y_pred - y)
        beta -= learning_rate * gradients

    return beta, cost_history

np.random.seed(42)
X = np.random.randn(100, 3)
y = 3 * X[:, 0] + 2 * X[:, 1] - 1 * X[:, 2] + 5 + np.random.randn(100) * 0.5

beta_gd, costs = gradient_descent(X, y, learning_rate=0.1, n_iterations=100)
print(f"Learned coefficients: {beta_gd}")

Multiple Linear Regression

Extends simple linear regression to multiple features:

Multiple Linear Regression

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_d x_d + \epsilon

Here,

$\beta_j$ =Coefficient for feature j
$d$ =Number of features

Matrix notation: $\mathbf{y} = \mathbf{X}\\boldsymbol{\\beta} + \\boldsymbol{\\epsilon}$

💡 Coefficient Interpretation

$\beta_j$ represents the expected change in $y$ for a one-unit increase in $x_j$ , holding all other features constant. This is the ceteris paribus interpretation essential for causal analysis.

Feature Scaling for Gradient Descent

Without scaling, features with different magnitudes cause slow convergence:

x_{\text{scaled}} = \frac{x - \mu}{\sigma}

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Assumptions of Linear Regression

1. Linearity

The relationship between $x$ and $y$ is linear:

Linearity Assumption

\mathbb{E}[y|x] = \beta_0 + \beta_1 x

Here,

$\mathbb{E}[y|x]$ =Expected value of y given x

Test: Residual plots should show no pattern.

2. Independence of Errors

Errors are independent (no autocorrelation).

Test: Durbin-Watson statistic $\approx 2$ indicates no autocorrelation.

3. Homoscedasticity

Constant variance of errors across all levels of predictors:

Homoscedasticity Assumption

\text{Var}(\epsilon_i) = \sigma^2 \quad \forall \, i

Here,

$\epsilon_i$ =Error term for observation i
$\sigma^2$ =Constant error variance

4. Normality of Errors

Errors are normally distributed:

\epsilon_i \sim \mathcal{N}(0, \sigma^2)

Test: Q-Q plot or Shapiro-Wilk test.

5. No Multicollinearity

Predictors are not highly correlated.

⚠️ Variance Inflation Factor

Test: VIF: $\text{VIF}_j = \frac{1}{1 - R_j^2}$

VIF Value	Interpretation
1	No correlation
1-5	Moderate correlation
5-10	High correlation
> 10	Severe multicollinearity

Model Evaluation Metrics

R-squared ( $R^2$ )

R-squared

R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum_{i}(y_i - \hat{y}_i)^2}{\sum_{i}(y_i - \bar{y})^2}

Here,

$SS_{\text{res}}$ =Sum of squared residuals
$SS_{\text{tot}}$ =Total sum of squares

$R^2 = 1$ : Perfect prediction
$R^2 = 0$ : Model predicts mean only
$R^2 < 0$ : Model worse than mean

Adjusted R-squared

Penalizes for adding irrelevant features:

R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n-1)}{n-p-1}

where $n$ = samples, $p$ = features.

Root Mean Squared Error (RMSE)

\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}

Mean Absolute Error (MAE)

\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|

Comparison

Metric	Sensitivity to Outliers	Interpretability
RMSE	High	Same units as y
MAE	Low	Same units as y
$R^2$	Medium	Scale-free (0-1)

Complete Python Implementation

📝Linear Regression with Regularization

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

np.random.seed(42)
n = 500

data = pd.DataFrame({
    'square_feet': np.random.uniform(800, 4000, n),
    'bedrooms': np.random.randint(1, 6, n),
    'bathrooms': np.random.randint(1, 4, n),
    'age': np.random.uniform(0, 50, n),
    'garage': np.random.randint(0, 3, n)
})

data['price'] = (
    50000 +
    150 * data['square_feet'] +
    10000 * data['bedrooms'] +
    15000 * data['bathrooms'] -
    1000 * data['age'] +
    20000 * data['garage'] +
    np.random.normal(0, 20000, n)
)

X = data.drop('price', axis=1)
y = data['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)

print("=== Linear Regression ===")
print(f"R² Score: {r2_score(y_test, y_pred_lr):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred_lr)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, y_pred_lr):,.2f}")

ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
y_pred_ridge = ridge.predict(X_test_scaled)

print("\n=== Ridge Regression (L2) ===")
print(f"R² Score: {r2_score(y_test, y_pred_ridge):.4f}")

lasso = Lasso(alpha=100)
lasso.fit(X_train_scaled, y_train)
y_pred_lasso = lasso.predict(X_test_scaled)

print("\n=== Lasso Regression (L1) ===")
print(f"R² Score: {r2_score(y_test, y_pred_lasso):.4f}")
for name, coef in zip(X.columns, lasso.coef_):
    print(f"  {name}: ${coef:,.2f} {'(selected)' if abs(coef) > 0 else '(dropped)'}")

print("\n=== Cross-Validation (5-fold) ===")
models = {'Linear': LinearRegression(), 'Ridge': Ridge(alpha=1.0), 'Lasso': Lasso(alpha=100)}
for name, model in models.items():
    scores = cross_val_score(model, scaler.transform(X), y, cv=5, scoring='r2')
    print(f"{name}: R² = {scores.mean():.4f} ± {scores.std():.4f}")

Key Takeaways

📋Summary: Linear Regression

OLS gives exact solution: $\hat{\\beta} = (X^TX)^{-1}X^Ty$ when features aren't too correlated
Gradient Descent scales better but requires feature scaling and learning rate tuning
Gauss-Markov Theorem: OLS is BLUE under classical assumptions
Five assumptions must be checked: linearity, independence, homoscedasticity, normality, no multicollinearity
Regularization (Ridge/Lasso) prevents overfitting and handles multicollinearity
$R^2$ alone is insufficient: Always check residuals and use multiple metrics

Practice Exercises

Exercise 1: Manual Calculation

Given data points: $(1, 2), (2, 4), (3, 5), (4, 4), (5, 5)$

a) Calculate $\hat{\\beta}_0$ and $\hat{\\beta}_1$ by hand
b) What is the predicted value for $x = 6$ ?
c) Calculate $R^2$

Exercise 2: Implement from Scratch

class SimpleLinearRegression:
    def fit(self, X, y):
        # Calculate β₀ and β₁ using formulas
        pass

    def predict(self, X):
        # Return predictions
        pass

Exercise 3: Assumption Checking

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
# a) Fit a linear regression
# b) Check all 5 assumptions
# c) Identify which assumptions are violated
# d) Propose solutions

Exercise 4: Regularization Effect

alphas = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
# Plot train/test R² vs alpha
# Find the optimal alpha

Discussion Questions

When might linear regression still be preferred over complex models like neural networks?
How does multicollinearity affect coefficient interpretation?
Why might a lower $R^2$ model be preferable in some cases?

Linear Regression: Math + Code + Assumptions

Linear Regression: Math + Code + Assumptions

Overview

Simple Linear Regression

Mathematical Formulation

Simple Linear Regression Model

Visual Representation

Ordinary Least Squares (OLS) Derivation

Objective Function

DfOLS Objective

Closed-Form Solution

Matrix Form (Multiple Regression)

ThGauss-Markov Theorem

Gradient Descent

Algorithm

Gradient Descent Update Rule

Gradient Update Rule

Convergence Visualization

Multiple Linear Regression

Multiple Linear Regression

Feature Scaling for Gradient Descent

Assumptions of Linear Regression

1. Linearity

Linearity Assumption

2. Independence of Errors

3. Homoscedasticity

Homoscedasticity Assumption

4. Normality of Errors

5. No Multicollinearity

Model Evaluation Metrics

R-squared (R2R^2R2)

R-squared

Adjusted R-squared

Root Mean Squared Error (RMSE)

Mean Absolute Error (MAE)

Comparison

Complete Python Implementation

📝Linear Regression with Regularization

Key Takeaways

📋Summary: Linear Regression

Practice Exercises

Exercise 1: Manual Calculation

Exercise 2: Implement from Scratch

Exercise 3: Assumption Checking

Exercise 4: Regularization Effect

Discussion Questions

Need Expert Data Science Help?

R-squared ( $R^2$ )