Linear Regression: Math + Code + Assumptions
Overview
Linear regression is the foundational algorithm in machine learning and statistics for modeling relationships between variables. Despite its simplicity, understanding linear regression deeply provides insight into more complex algorithms.
Simple Linear Regression
Mathematical Formulation
The simplest form models a linear relationship between one predictor and response :
Simple Linear Regression Model
Here,
- =Intercept (y-intercept)
- =Slope (coefficient)
- =Error term, \epsilon \sim \mathcal{N}(0, \sigma^2)
Predicted value:
Visual Representation
y
^
| *
| *
| * / Regression Line: y = β₀ + β₁x
| * /
| * /
| * /
| * /
|* /
| / <-- residuals
| /
+---------------------------> x
Residual: e_i = y_i - y_hat_i
Ordinary Least Squares (OLS) Derivation
Objective Function
DfOLS Objective
Minimize the sum of squared residuals:
Closed-Form Solution
ℹ️ Derivation Steps
Step 1: Partial derivative with respect to :
Solving for :
Step 2: Partial derivative with respect to :
Step 3: Solve for :
Matrix Form (Multiple Regression)
For multiple features :
ThGauss-Markov Theorem
Under the assumptions of the classical linear regression model (linearity, independence, homoscedasticity, and zero conditional mean of errors), the OLS estimator is the Best Linear Unbiased Estimator (BLUE). That is, among all linear unbiased estimators, OLS has the minimum variance.
for any other linear unbiased estimator .
import numpy as np
def ols_fit(X, y):
"""Ordinary Least Squares using normal equation."""
X_b = np.c_[np.ones((X.shape[0], 1)), X]
beta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
return beta
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
beta = ols_fit(X, y)
print(f"Intercept: {beta[0]:.2f}, Slope: {beta[1]:.2f}")
Gradient Descent
When OLS is computationally expensive (large datasets), use iterative optimization:
Algorithm
Gradient Descent Update Rule
Here,
- =Parameter value at iteration t
- =Learning rate
- =Partial derivative of cost
Gradient Update Rule
Convergence Visualization
J(β)
^
|\
| \ Learning Rate Too High
| \ (diverges)
| \
| \ ___
| \ / \ ___
| \/ \/ \ Learning Rate Just Right
| \_________________________
|
| _______________ Learning Rate Too Small
| / (slow convergence)
+----------------------------------------------> Iterations
def gradient_descent(X, y, learning_rate=0.01, n_iterations=1000):
"""Batch Gradient Descent for Linear Regression."""
n_samples, n_features = X.shape
beta = np.zeros(n_features + 1)
X_b = np.c_[np.ones((n_samples, 1)), X]
cost_history = []
for iteration in range(n_iterations):
y_pred = X_b @ beta
cost = np.mean((y_pred - y) ** 2)
cost_history.append(cost)
gradients = (2 / n_samples) * X_b.T @ (y_pred - y)
beta -= learning_rate * gradients
return beta, cost_history
np.random.seed(42)
X = np.random.randn(100, 3)
y = 3 * X[:, 0] + 2 * X[:, 1] - 1 * X[:, 2] + 5 + np.random.randn(100) * 0.5
beta_gd, costs = gradient_descent(X, y, learning_rate=0.1, n_iterations=100)
print(f"Learned coefficients: {beta_gd}")
Multiple Linear Regression
Extends simple linear regression to multiple features:
Multiple Linear Regression
Here,
- =Coefficient for feature j
- =Number of features
Matrix notation:
💡 Coefficient Interpretation
represents the expected change in for a one-unit increase in , holding all other features constant. This is the ceteris paribus interpretation essential for causal analysis.
Feature Scaling for Gradient Descent
Without scaling, features with different magnitudes cause slow convergence:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Assumptions of Linear Regression
1. Linearity
The relationship between and is linear:
Linearity Assumption
Here,
- =Expected value of y given x
Test: Residual plots should show no pattern.
2. Independence of Errors
Errors are independent (no autocorrelation).
Test: Durbin-Watson statistic indicates no autocorrelation.
3. Homoscedasticity
Constant variance of errors across all levels of predictors:
Homoscedasticity Assumption
Here,
- =Error term for observation i
- =Constant error variance
4. Normality of Errors
Errors are normally distributed:
Test: Q-Q plot or Shapiro-Wilk test.
5. No Multicollinearity
Predictors are not highly correlated.
⚠️ Variance Inflation Factor
Test: VIF:
| VIF Value | Interpretation |
|---|---|
| 1 | No correlation |
| 1-5 | Moderate correlation |
| 5-10 | High correlation |
| > 10 | Severe multicollinearity |
Model Evaluation Metrics
R-squared ()
R-squared
Here,
- =Sum of squared residuals
- =Total sum of squares
- : Perfect prediction
- : Model predicts mean only
- : Model worse than mean
Adjusted R-squared
Penalizes for adding irrelevant features:
where = samples, = features.
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
Comparison
| Metric | Sensitivity to Outliers | Interpretability |
|---|---|---|
| RMSE | High | Same units as y |
| MAE | Low | Same units as y |
| Medium | Scale-free (0-1) |
Complete Python Implementation
📝Linear Regression with Regularization
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
np.random.seed(42)
n = 500
data = pd.DataFrame({
'square_feet': np.random.uniform(800, 4000, n),
'bedrooms': np.random.randint(1, 6, n),
'bathrooms': np.random.randint(1, 4, n),
'age': np.random.uniform(0, 50, n),
'garage': np.random.randint(0, 3, n)
})
data['price'] = (
50000 +
150 * data['square_feet'] +
10000 * data['bedrooms'] +
15000 * data['bathrooms'] -
1000 * data['age'] +
20000 * data['garage'] +
np.random.normal(0, 20000, n)
)
X = data.drop('price', axis=1)
y = data['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)
print("=== Linear Regression ===")
print(f"R² Score: {r2_score(y_test, y_pred_lr):.4f}")
print(f"RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred_lr)):,.2f}")
print(f"MAE: ${mean_absolute_error(y_test, y_pred_lr):,.2f}")
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
y_pred_ridge = ridge.predict(X_test_scaled)
print("\n=== Ridge Regression (L2) ===")
print(f"R² Score: {r2_score(y_test, y_pred_ridge):.4f}")
lasso = Lasso(alpha=100)
lasso.fit(X_train_scaled, y_train)
y_pred_lasso = lasso.predict(X_test_scaled)
print("\n=== Lasso Regression (L1) ===")
print(f"R² Score: {r2_score(y_test, y_pred_lasso):.4f}")
for name, coef in zip(X.columns, lasso.coef_):
print(f" {name}: ${coef:,.2f} {'(selected)' if abs(coef) > 0 else '(dropped)'}")
print("\n=== Cross-Validation (5-fold) ===")
models = {'Linear': LinearRegression(), 'Ridge': Ridge(alpha=1.0), 'Lasso': Lasso(alpha=100)}
for name, model in models.items():
scores = cross_val_score(model, scaler.transform(X), y, cv=5, scoring='r2')
print(f"{name}: R² = {scores.mean():.4f} ± {scores.std():.4f}")
Key Takeaways
📋Summary: Linear Regression
- OLS gives exact solution: when features aren't too correlated
- Gradient Descent scales better but requires feature scaling and learning rate tuning
- Gauss-Markov Theorem: OLS is BLUE under classical assumptions
- Five assumptions must be checked: linearity, independence, homoscedasticity, normality, no multicollinearity
- Regularization (Ridge/Lasso) prevents overfitting and handles multicollinearity
- alone is insufficient: Always check residuals and use multiple metrics
Practice Exercises
Exercise 1: Manual Calculation
Given data points:
- a) Calculate and by hand
- b) What is the predicted value for ?
- c) Calculate
Exercise 2: Implement from Scratch
class SimpleLinearRegression:
def fit(self, X, y):
# Calculate β₀ and β₁ using formulas
pass
def predict(self, X):
# Return predictions
pass
Exercise 3: Assumption Checking
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
# a) Fit a linear regression
# b) Check all 5 assumptions
# c) Identify which assumptions are violated
# d) Propose solutions
Exercise 4: Regularization Effect
alphas = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
# Plot train/test R² vs alpha
# Find the optimal alpha
Discussion Questions
- When might linear regression still be preferred over complex models like neural networks?
- How does multicollinearity affect coefficient interpretation?
- Why might a lower model be preferable in some cases?