Linear Regression: Math, Code and Assumptions
The Foundation of Machine Learning
Linear regression is the most fundamental algorithm in ML. Despite its simplicity, understanding it deeply provides insight into all supervised learning methods.
ML Algorithm Landscape Supervised Learning Algorithms Linear Linear Regression Logistic Regression Ridge/ Lasso Tree-Based Decision Tree Random Forest XGBoost Neural Perceptron MLP Deep Learning Support Linear SVM Kernel SVM SVR Linear Regression is the foundation — understand this first!
1. Simple Linear Regression
Mathematical Formulation
Model:
y ^ = β 0 + β 1 x + ϵ \hat{y} = \beta_0 + \beta_1 x + \epsilon y ^ = β 0 + β 1 x + ϵ Where:
β 0 \beta_0 β 0 = intercept (bias) — value of y y y when x = 0 x = 0 x = 0
β 1 \beta_1 β 1 = slope (weight) — change in y y y for unit change in x x x
ϵ \epsilon ϵ = error term — ϵ ∼ N ( 0 , σ 2 ) \epsilon \sim N(0, \sigma^2) ϵ ∼ N ( 0 , σ 2 )
Simple Linear Regression: Finding the Best Fit Line Feature (x) Target (y) eᵢ eᵢ eᵢ eᵢ eᵢ eᵢ β₀ = intercept β₠= slope Actual data points Regression line Residuals (errors)
2. Cost Function (Ordinary Least Squares)
Mean Squared Error (MSE):
J ( β 0 , β 1 ) = 1 n ∑ i = 1 n ( y i − y ^ i ) 2 = 1 n ∑ i = 1 n ( y i − ( β 0 + β 1 x i ) ) 2 J(\beta_0, \beta_1) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1 x_i))^2 J ( β 0 , β 1 ) = n 1 i = 1 ∑ n ( y i − y ^ i ) 2 = n 1 i = 1 ∑ n ( y i − ( β 0 + β 1 x i ) ) 2 Goal: Find β 0 , β 1 \beta_0, \beta_1 β 0 , β 1 that minimize J J J
Closed-Form Solution (Normal Equation):
β 1 = ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ ) ∑ i = 1 n ( x i − x ˉ ) 2 = Cov ( X , Y ) Var ( X ) \beta_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)} β 1 = ∑ i = 1 n ( x i − x ˉ ) 2 ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ ) = Var ( X ) Cov ( X , Y ) β 0 = y ˉ − β 1 x ˉ \beta_0 = \bar{y} - \beta_1 \bar{x} β 0 = y ˉ − β 1 x ˉ
Cost Function: The Bowl-Shaped Surface Global Minimum Gradient Descent Gradient Descent β₠(slope) J(β₀, βâ‚) The cost function is convex — gradient descent finds the global minimum
3. Gradient Descent
Update Rule:
β j : = β j − α ∂ J ∂ β j \beta_j := \beta_j - \alpha \frac{\partial J}{\partial \beta_j} β j := β j − α ∂ β j ∂ J Partial Derivatives:
∂ J ∂ β 0 = − 2 n ∑ i = 1 n ( y i − y ^ i ) \frac{\partial J}{\partial \beta_0} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i) ∂ β 0 ∂ J = − n 2 i = 1 ∑ n ( y i − y ^ i ) ∂ J ∂ β 1 = − 2 n ∑ i = 1 n ( y i − y ^ i ) ⋅ x i \frac{\partial J}{\partial \beta_1} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i) \cdot x_i ∂ β 1 ∂ J = − n 2 i = 1 ∑ n ( y i − y ^ i ) ⋅ x i Where α \alpha α = learning rate (step size)
Gradient Descent: Learning Rate Impact α = 0.1 ✓ α = 1.0 ✗ (oscillates) α = 0.001 (too slow) Good learning rate Too large Too small
4. Multiple Linear Regression
Model:
y ^ = β 0 + β 1 x 1 + β 2 x 2 + ⋯ + β p x p = β 0 + ∑ j = 1 p β j x j \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p = \beta_0 + \sum_{j=1}^{p} \beta_j x_j y ^ = β 0 + β 1 x 1 + β 2 x 2 + ⋯ + β p x p = β 0 + j = 1 ∑ p β j x j Matrix Form:
y ^ = X β \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta} y ^ = X β Where X ∈ R n × ( p + 1 ) \mathbf{X} \in \mathbb{R}^{n \times (p+1)} X ∈ R n × ( p + 1 ) (design matrix with intercept column)
Normal Equation (Matrix):
β ^ = ( X T X ) − 1 X T y \boldsymbol{\hat{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} β ^ = ( X T X ) − 1 X T y
Multiple Regression: Multiple Features → Single Output x₠(Size) β₠x₂ (Beds) β₂ x₃ (Age) β₃ x₄ (Baths) β₄ Linear Model ŷ = β₀ + Σβⱼxⱼ Output ŷ (Price)
5. Model Evaluation Metrics
R² Score (Coefficient of Determination):
R 2 = 1 − S S r e s S S t o t = 1 − ∑ i = 1 n ( y i − y ^ i ) 2 ∑ i = 1 n ( y i − y ˉ ) 2 R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} R 2 = 1 − S S t o t S S r es = 1 − ∑ i = 1 n ( y i − y ˉ ) 2 ∑ i = 1 n ( y i − y ^ i ) 2
R 2 = 1 R^2 = 1 R 2 = 1 : Perfect fit
R 2 = 0 R^2 = 0 R 2 = 0 : Model predicts the mean
R 2 < 0 R^2 < 0 R 2 < 0 : Model is worse than predicting the mean
Adjusted R²:
R a d j 2 = 1 − ( 1 − R 2 ) ( n − 1 ) n − p − 1 R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1} R a d j 2 = 1 − n − p − 1 ( 1 − R 2 ) ( n − 1 )
R² Score: How Well Does the Model Fit? SS_total = Σ(yᵢ - ȳ)² = Total Variance SS_explained = Σ(ŷᵢ - ȳ)² = 70% SS_residual = 30% R² = 1 - (30/100) = 0.70 (70% variance explained)
6. Assumptions of Linear Regression
5 Key Assumptions to Validate 1. Linearity y = f(x) is linear 2. Independence Errors are independent 3. Homoscedasticity Constant variance 4. Normality of Errors ε ~ N(0, σ²) 5. No Multicollinearity X₠↔ X₂ Features not correlated
Checking Assumptions with Residual Plots
Residual Analysis: What to Look For ✓ Good: Random ✗ Bad: Funnel ✗ Bad: Pattern
7. Implementation in Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Generate sample data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit model
model = LinearRegression()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Evaluate
print(f"Intercept (β₀): {model.intercept_[0]:.4f}")
print(f"Slope (βâ‚): {model.coef_[0][0]:.4f}")
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
# Visualize
plt.scatter(X_test, y_test, color='blue', alpha=0.6, label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Linear Regression Fit')
plt.legend()
plt.show()
Key Takeaways
Linear regression finds the best-fit line through data points
Cost function (MSE) measures prediction error — minimize it
Gradient descent iteratively updates weights to find minimum
R² score tells you how much variance the model explains
Validate assumptions before trusting the model
Regularization (Ridge/Lasso) prevents overfitting
Next: Logistic Regression Extend linear regression to classification with the sigmoid function.