Math Foundations for Machine Learning
Math is the language of machine learning. This tutorial covers the essential math you need — with clear explanations, visual intuitions, and Python code.
Linear Algebra
Vectors
A vector is an ordered list of numbers:
v = [3, 1, 4, 2]
Geometrically:
- 2D vector: arrow in a plane
- 3D vector: arrow in space
- nD vector: arrow in n-dimensional space
Operations:
v = [1, 2, 3]
w = [4, 5, 6]
Addition: v + w = [1+4, 2+5, 3+6] = [5, 7, 9]
Scalar mult: 2v = [2, 4, 6]
Dot product: v · w = 1×4 + 2×5 + 3×6 = 32
Matrices
A matrix is a 2D array of numbers:
A = | 1 2 3 |
| 4 5 6 |
Shape: (2, 3) — 2 rows, 3 columns
Operations:
A = | 1 2 | B = | 5 6 |
| 3 4 | | 7 8 |
A + B = | 6 8 |
| 10 12 |
A × B = | 1×5+2×7 1×6+2×8 | = | 19 22 |
| 3×5+4×7 3×6+4×8 | | 43 50 |
Transpose: A^T = | 1 3 |
| 2 4 |
Matrix Multiplication in ML
Neural Network Layer:
Input: x = [1, 2, 3] (3 features)
Weights: W = | 0.2 0.8 0.1 | (3 neurons)
| 0.5 0.3 0.9 |
Output: y = x × W = [0.2×1+0.5×2+0.1×3, 0.8×1+0.3×2+0.9×3]
= [1.5, 3.7]
This is how neural networks process data!
Calculus
Derivatives
The derivative measures RATE OF CHANGE:
f(x) = x²
f'(x) = 2x
At x=3: f'(3) = 6
Meaning: At x=3, the function is increasing at rate 6
In ML:
- We use derivatives to MINIMIZE loss functions
- Gradient = vector of partial derivatives
- Points in direction of steepest increase
Gradient Descent
The core optimization algorithm in ML:
1. Start at random point
2. Compute gradient (direction of steepest increase)
3. Move OPPOSITE to gradient (steepest decrease)
4. Repeat until convergence
θ_new = θ_old - α × ∇L(θ_old)
Where:
θ = model parameters
α = learning rate
∇L = gradient of loss function
Visualization:
Loss
│
│ ╲
│ ╲
│ ╲
│ ╲
│ ╲
│ ╲___
│ ╲_______ (minimum)
└──────────────────────── θ
Partial Derivatives
When function has multiple variables:
f(x, y) = x² + xy + y²
∂f/∂x = 2x + y (change in f when x changes)
∂f/∂y = x + 2y (change in f when y changes)
Gradient: ∇f = [2x+y, x+2y]
At point (1, 2):
∇f = [2(1)+2, 1+2(2)] = [4, 5]
Probability
Basic Probability
P(event) = favorable outcomes / total outcomes
P(heads) = 1/2 = 0.5
P(rolling 6) = 1/6 ≈ 0.167
Rules:
P(A or B) = P(A) + P(B) - P(A and B)
P(A and B) = P(A) × P(B|A)
P(A|B) = P(B|A) × P(A) / P(B) ← Bayes' theorem
Bayes' Theorem
The foundation of Naive Bayes and probabilistic ML:
P(A|B) = P(B|A) × P(A) / P(B)
Example: Medical Testing
├─ P(Disease) = 0.01 (1% prevalence)
├─ P(Positive|Disease) = 0.99 (test sensitivity)
├─ P(Positive|No Disease) = 0.05 (false positive rate)
│
└─ P(Disease|Positive) = ?
= 0.99 × 0.01 / (0.99×0.01 + 0.05×0.99)
= 0.0099 / 0.0594
= 0.1667 (16.7%)
Despite 99% test accuracy, only 16.7% of positives have disease!
This is why base rates matter.
Statistics
Descriptive Statistics
Mean: μ = (1/n) Σ xᵢ
Median: middle value
Mode: most frequent value
Variance: σ² = (1/n) Σ (xᵢ - μ)²
Standard Deviation: σ = √(variance)
Example: [2, 4, 4, 4, 5, 5, 7, 9]
Mean: 5.25
Median: 4.5
Mode: 4
Variance: 4.19
Std Dev: 2.05
Distributions
Normal Distribution (Gaussian):
├─ Bell-shaped curve
├─ Mean = Median = Mode
├─ 68% within 1σ, 95% within 2σ, 99.7% within 3σ
└─ Foundation of many ML algorithms
Uniform Distribution:
├─ All values equally likely
└─ Used in random initialization
Poisson Distribution:
├─ Count of events in fixed interval
└─ Used in anomaly detection
Python Code
import numpy as np
# Vectors
v = np.array([1, 2, 3])
w = np.array([4, 5, 6])
# Operations
print(v + w) # [5, 7, 9]
print(np.dot(v, w)) # 32 (dot product)
print(np.linalg.norm(v)) # 3.74 (magnitude)
# Matrices
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(A @ B) # Matrix multiplication
print(A.T) # Transpose
print(np.linalg.inv(A)) # Inverse
# Calculus
from scipy.misc import derivative
f = lambda x: x**2
print(derivative(f, 3)) # 6.0 (derivative at x=3)
# Gradient Descent
def gradient_descent(f_grad, start, lr=0.01, steps=100):
x = start
history = [x]
for _ in range(steps):
x = x - lr * f_grad(x)
history.append(x)
return x, history
# Minimize f(x) = x^2
result, history = gradient_descent(lambda x: 2*x, start=10, lr=0.1)
print(f"Minimum at x={result:.4f}") # ~0
Key Takeaways
- Vectors and matrices are the data structures of ML
- Matrix multiplication is how neural networks process data
- Derivatives tell us how to improve model parameters
- Gradient descent is the core optimization algorithm
- Probability underpins classification and generative models
- Bayes' theorem is fundamental to probabilistic ML
- Normal distribution appears everywhere in ML
- You don't need a math PhD — focus on intuition over proofs
- Python/NumPy handles the computation — you handle the understanding
- Math is learnable — practice with code examples