Math Foundations for Machine Learning — Linear Algebra, Calculus, Probability

ML FoundationsMathFree Lesson

Advertisement

Math Foundations for Machine Learning

Math is the language of machine learning. This tutorial covers the essential math you need — with clear explanations, visual intuitions, and Python code.


Linear Algebra

Vectors

A vector is an ordered list of numbers:

v = [3, 1, 4, 2]

Geometrically:
- 2D vector: arrow in a plane
- 3D vector: arrow in space
- nD vector: arrow in n-dimensional space

Operations:
v = [1, 2, 3]
w = [4, 5, 6]

Addition:    v + w = [1+4, 2+5, 3+6] = [5, 7, 9]
Scalar mult: 2v = [2, 4, 6]
Dot product: v · w = 1×4 + 2×5 + 3×6 = 32

Matrices

A matrix is a 2D array of numbers:

A = | 1  2  3 |
    | 4  5  6 |

Shape: (2, 3) — 2 rows, 3 columns

Operations:
A = | 1  2 |    B = | 5  6 |
    | 3  4 |        | 7  8 |

A + B = | 6   8 |
        | 10  12 |

A × B = | 1×5+2×7  1×6+2×8 | = | 19  22 |
        | 3×5+4×7  3×6+4×8 |   | 43  50 |

Transpose: A^T = | 1  3 |
                 | 2  4 |

Matrix Multiplication in ML

Neural Network Layer:

Input: x = [1, 2, 3] (3 features)
Weights: W = | 0.2  0.8  0.1 | (3 neurons)
             | 0.5  0.3  0.9 |

Output: y = x × W = [0.2×1+0.5×2+0.1×3, 0.8×1+0.3×2+0.9×3]
                  = [1.5, 3.7]

This is how neural networks process data!

Calculus

Derivatives

The derivative measures RATE OF CHANGE:

f(x) = x²
f'(x) = 2x

At x=3: f'(3) = 6
Meaning: At x=3, the function is increasing at rate 6

In ML:
- We use derivatives to MINIMIZE loss functions
- Gradient = vector of partial derivatives
- Points in direction of steepest increase

Gradient Descent

The core optimization algorithm in ML:

1. Start at random point
2. Compute gradient (direction of steepest increase)
3. Move OPPOSITE to gradient (steepest decrease)
4. Repeat until convergence

θ_new = θ_old - α × ∇L(θ_old)

Where:
θ = model parameters
α = learning rate
∇L = gradient of loss function

Visualization:
Loss
│
│  ╲
│    ╲
│      ╲
│        ╲
│          ╲
│            ╲___
│                ╲_______ (minimum)
└──────────────────────── θ

Partial Derivatives

When function has multiple variables:

f(x, y) = x² + xy + y²

∂f/∂x = 2x + y  (change in f when x changes)
∂f/∂y = x + 2y  (change in f when y changes)

Gradient: ∇f = [2x+y, x+2y]

At point (1, 2):
∇f = [2(1)+2, 1+2(2)] = [4, 5]

Probability

Basic Probability

P(event) = favorable outcomes / total outcomes

P(heads) = 1/2 = 0.5
P(rolling 6) = 1/6 ≈ 0.167

Rules:
P(A or B) = P(A) + P(B) - P(A and B)
P(A and B) = P(A) × P(B|A)
P(A|B) = P(B|A) × P(A) / P(B)  ← Bayes' theorem

Bayes' Theorem

The foundation of Naive Bayes and probabilistic ML:

P(A|B) = P(B|A) × P(A) / P(B)

Example: Medical Testing
├─ P(Disease) = 0.01 (1% prevalence)
├─ P(Positive|Disease) = 0.99 (test sensitivity)
├─ P(Positive|No Disease) = 0.05 (false positive rate)
│
└─ P(Disease|Positive) = ?
   = 0.99 × 0.01 / (0.99×0.01 + 0.05×0.99)
   = 0.0099 / 0.0594
   = 0.1667 (16.7%)

Despite 99% test accuracy, only 16.7% of positives have disease!
This is why base rates matter.

Statistics

Descriptive Statistics

Mean: μ = (1/n) Σ xᵢ
Median: middle value
Mode: most frequent value
Variance: σ² = (1/n) Σ (xᵢ - μ)²
Standard Deviation: σ = √(variance)

Example: [2, 4, 4, 4, 5, 5, 7, 9]
Mean: 5.25
Median: 4.5
Mode: 4
Variance: 4.19
Std Dev: 2.05

Distributions

Normal Distribution (Gaussian):
├─ Bell-shaped curve
├─ Mean = Median = Mode
├─ 68% within 1σ, 95% within 2σ, 99.7% within 3σ
└─ Foundation of many ML algorithms

Uniform Distribution:
├─ All values equally likely
└─ Used in random initialization

Poisson Distribution:
├─ Count of events in fixed interval
└─ Used in anomaly detection

Python Code

import numpy as np

# Vectors
v = np.array([1, 2, 3])
w = np.array([4, 5, 6])

# Operations
print(v + w)           # [5, 7, 9]
print(np.dot(v, w))    # 32 (dot product)
print(np.linalg.norm(v))  # 3.74 (magnitude)

# Matrices
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print(A @ B)           # Matrix multiplication
print(A.T)             # Transpose
print(np.linalg.inv(A))  # Inverse

# Calculus
from scipy.misc import derivative

f = lambda x: x**2
print(derivative(f, 3))  # 6.0 (derivative at x=3)

# Gradient Descent
def gradient_descent(f_grad, start, lr=0.01, steps=100):
    x = start
    history = [x]
    for _ in range(steps):
        x = x - lr * f_grad(x)
        history.append(x)
    return x, history

# Minimize f(x) = x^2
result, history = gradient_descent(lambda x: 2*x, start=10, lr=0.1)
print(f"Minimum at x={result:.4f}")  # ~0

Key Takeaways

  1. Vectors and matrices are the data structures of ML
  2. Matrix multiplication is how neural networks process data
  3. Derivatives tell us how to improve model parameters
  4. Gradient descent is the core optimization algorithm
  5. Probability underpins classification and generative models
  6. Bayes' theorem is fundamental to probabilistic ML
  7. Normal distribution appears everywhere in ML
  8. You don't need a math PhD — focus on intuition over proofs
  9. Python/NumPy handles the computation — you handle the understanding
  10. Math is learnable — practice with code examples

Advertisement

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement