Math Foundations for Machine Learning

Math is the language of machine learning. This tutorial covers the essential math you need — with clear explanations, visual intuitions, and Python code.

Linear Algebra

Vectors

A vector is an ordered list of numbers:

v = [3, 1, 4, 2]

Geometrically:
- 2D vector: arrow in a plane
- 3D vector: arrow in space
- nD vector: arrow in n-dimensional space

Operations:
v = [1, 2, 3]
w = [4, 5, 6]

Addition:    v + w = [1+4, 2+5, 3+6] = [5, 7, 9]
Scalar mult: 2v = [2, 4, 6]
Dot product: v · w = 1×4 + 2×5 + 3×6 = 32

Matrices

A matrix is a 2D array of numbers:

A = | 1  2  3 |
    | 4  5  6 |

Shape: (2, 3) — 2 rows, 3 columns

Operations:
A = | 1  2 |    B = | 5  6 |
    | 3  4 |        | 7  8 |

A + B = | 6   8 |
        | 10  12 |

A × B = | 1×5+2×7  1×6+2×8 | = | 19  22 |
        | 3×5+4×7  3×6+4×8 |   | 43  50 |

Transpose: A^T = | 1  3 |
                 | 2  4 |

Matrix Multiplication in ML

Neural Network Layer:

Input: x = [1, 2, 3] (3 features)
Weights: W = | 0.2  0.8  0.1 | (3 neurons)
             | 0.5  0.3  0.9 |

Output: y = x × W = [0.2×1+0.5×2+0.1×3, 0.8×1+0.3×2+0.9×3]
                  = [1.5, 3.7]

This is how neural networks process data!

Calculus

Derivatives

The derivative measures RATE OF CHANGE:

f(x) = x²
f'(x) = 2x

At x=3: f'(3) = 6
Meaning: At x=3, the function is increasing at rate 6

In ML:
- We use derivatives to MINIMIZE loss functions
- Gradient = vector of partial derivatives
- Points in direction of steepest increase

Gradient Descent

The core optimization algorithm in ML:

1. Start at random point
2. Compute gradient (direction of steepest increase)
3. Move OPPOSITE to gradient (steepest decrease)
4. Repeat until convergence

θ_new = θ_old - α × ∇L(θ_old)

Where:
θ = model parameters
α = learning rate
∇L = gradient of loss function

Visualization:
Loss
│
│  ╲
│    ╲
│      ╲
│        ╲
│          ╲
│            ╲___
│                ╲_______ (minimum)
└──────────────────────── θ

Partial Derivatives

When function has multiple variables:

f(x, y) = x² + xy + y²

∂f/∂x = 2x + y  (change in f when x changes)
∂f/∂y = x + 2y  (change in f when y changes)

Gradient: ∇f = [2x+y, x+2y]

At point (1, 2):
∇f = [2(1)+2, 1+2(2)] = [4, 5]

Probability

Basic Probability

P(event) = favorable outcomes / total outcomes

P(heads) = 1/2 = 0.5
P(rolling 6) = 1/6 ≈ 0.167

Rules:
P(A or B) = P(A) + P(B) - P(A and B)
P(A and B) = P(A) × P(B|A)
P(A|B) = P(B|A) × P(A) / P(B)  ← Bayes' theorem

Bayes' Theorem

The foundation of Naive Bayes and probabilistic ML:

P(A|B) = P(B|A) × P(A) / P(B)

Example: Medical Testing
├─ P(Disease) = 0.01 (1% prevalence)
├─ P(Positive|Disease) = 0.99 (test sensitivity)
├─ P(Positive|No Disease) = 0.05 (false positive rate)
│
└─ P(Disease|Positive) = ?
   = 0.99 × 0.01 / (0.99×0.01 + 0.05×0.99)
   = 0.0099 / 0.0594
   = 0.1667 (16.7%)

Despite 99% test accuracy, only 16.7% of positives have disease!
This is why base rates matter.

Statistics

Descriptive Statistics

Mean: μ = (1/n) Σ xᵢ
Median: middle value
Mode: most frequent value
Variance: σ² = (1/n) Σ (xᵢ - μ)²
Standard Deviation: σ = √(variance)

Example: [2, 4, 4, 4, 5, 5, 7, 9]
Mean: 5.25
Median: 4.5
Mode: 4
Variance: 4.19
Std Dev: 2.05

Distributions

Normal Distribution (Gaussian):
├─ Bell-shaped curve
├─ Mean = Median = Mode
├─ 68% within 1σ, 95% within 2σ, 99.7% within 3σ
└─ Foundation of many ML algorithms

Uniform Distribution:
├─ All values equally likely
└─ Used in random initialization

Poisson Distribution:
├─ Count of events in fixed interval
└─ Used in anomaly detection

Python Code

import numpy as np

# Vectors
v = np.array([1, 2, 3])
w = np.array([4, 5, 6])

# Operations
print(v + w)           # [5, 7, 9]
print(np.dot(v, w))    # 32 (dot product)
print(np.linalg.norm(v))  # 3.74 (magnitude)

# Matrices
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print(A @ B)           # Matrix multiplication
print(A.T)             # Transpose
print(np.linalg.inv(A))  # Inverse

# Calculus
from scipy.misc import derivative

f = lambda x: x**2
print(derivative(f, 3))  # 6.0 (derivative at x=3)

# Gradient Descent
def gradient_descent(f_grad, start, lr=0.01, steps=100):
    x = start
    history = [x]
    for _ in range(steps):
        x = x - lr * f_grad(x)
        history.append(x)
    return x, history

# Minimize f(x) = x^2
result, history = gradient_descent(lambda x: 2*x, start=10, lr=0.1)
print(f"Minimum at x={result:.4f}")  # ~0

Key Takeaways

Vectors and matrices are the data structures of ML
Matrix multiplication is how neural networks process data
Derivatives tell us how to improve model parameters
Gradient descent is the core optimization algorithm
Probability underpins classification and generative models
Bayes' theorem is fundamental to probabilistic ML
Normal distribution appears everywhere in ML
You don't need a math PhD — focus on intuition over proofs
Python/NumPy handles the computation — you handle the understanding
Math is learnable — practice with code examples

Math Foundations for Machine Learning — Linear Algebra, Calculus, Probability

Math Foundations for Machine Learning

Linear Algebra

Vectors

Matrices

Matrix Multiplication in ML

Calculus

Derivatives

Gradient Descent

Partial Derivatives

Probability

Basic Probability

Bayes' Theorem

Statistics

Descriptive Statistics

Distributions

Python Code

Key Takeaways

Need Expert Machine Learning Help?