Math Foundations for Deep Learning
Deep learning is built on linear algebra, calculus, and probability. This tutorial covers the essential math you need to understand how neural networks learn.
See our Math Linear Algebra tutorial and Math Calculus tutorial for broader mathematical foundations.
Linear Algebra
Vectors and Matrices
DfVector and Matrix Operations
- Dot Product:
- Matrix Multiplication: where
- Norm:
- Transpose:
Matrix Multiplication
Here,
- =Input matrix of shape (m x n)
- =Input matrix of shape (n x p)
- =Output matrix of shape (m x p)
- =Inner dimension (must match)
Eigenvalues and Eigenvectors
DfEigenvalue Decomposition
For a square matrix , an eigenvector and eigenvalue satisfy:
The eigendecomposition reveals the matrix's action: stretching along eigenvector directions by eigenvalue factors. This is critical for understanding:
- PCA: Principal components are eigenvectors of the covariance matrix
- Hessian analysis: Eigenvalues indicate curvature directions
- Spectral initialization: Eigenvectors of weight matrices
Singular Value Decomposition (SVD)
DfSVD
Every matrix can be decomposed as:
where and are orthogonal, and is diagonal with singular values. SVD is used in weight pruning, low-rank approximation, and understanding network expressivity.
Calculus for Deep Learning
Gradients
DfGradient
The gradient of a scalar function is the vector of partial derivatives:
The gradient points in the direction of steepest ascent. Gradient descent minimizes by moving in the direction .
Gradient of a Function
Here,
- =Scalar-valued function
- =Input vector
- =Gradient vector (same shape as x)
The Chain Rule
DfChain Rule
For composite functions, the chain rule gives:
For multivariate functions:
This is the foundation of backpropagation — the algorithm that trains all neural networks.
Chain Rule for Multivariate Functions
Here,
- =Output variable
- =Intermediate variables
- =Input variables
ThChain Rule for Deep Networks (Composition)
For a deep network , the gradient of the loss with respect to parameters in layer is:
Each factor is the Jacobian of layer . The product of these Jacobians causes vanishing or exploding gradients when network depth is large.
The Hessian
DfHessian Matrix
The Hessian is the matrix of second-order partial derivatives:
It captures the curvature of . Positive definite Hessians indicate local minima; indefinite Hessians indicate saddle points.
Hessian Matrix
Here,
- =Hessian matrix (n x n)
- =Scalar function to differentiate
- =Input variable i
ℹ️ Hessian and Optimization
- Positive definite Hessian (): Local minimum
- Negative definite Hessian (): Local maximum
- Indefinite Hessian: Saddle point (common in high-dimensional spaces)
- Largest eigenvalue of Hessian = maximum curvature = affects step size
- Second-order optimizers (L-BFGS) use Hessian info but are expensive for large models
Probability Theory
Key Distributions
DfGaussian Distribution
The Gaussian (normal) distribution is fundamental to deep learning:
- Weights are often initialized from Gaussians
- Batch normalization assumes Gaussian activations
- VAEs model the latent space as Gaussian
DfBernoulli Distribution
The Bernoulli distribution models binary outcomes:
Used for binary classification, dropout masks, and binary cross-entropy loss.
Information Theory
DfCross-Entropy
Cross-entropy measures the difference between two probability distributions:
In classification, is the true label distribution and is the predicted distribution. Minimizing cross-entropy is equivalent to maximizing the likelihood of the correct class.
DfKL Divergence
KL divergence measures how one distribution diverges from another:
It is always non-negative and equals zero only when . Used in VAEs, knowledge distillation, and distribution matching.
Practical Example: Gradient Computation
📝Example: Manual Gradient Computation
import torch
# Define a simple computation: f(x, y) = (x^2 + y^2) * exp(-(x^2 + y^2))
# This is a "Mexican hat" function
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
# Forward pass
r_sq = x**2 + y**2
z = r_sq * torch.exp(-r_sq)
# Backward pass (computes gradients via chain rule)
z.backward()
# Analytical gradients:
# df/dx = (2x - 2x(x^2 + y^2)) * exp(-(x^2 + y^2))
# df/dy = (2y - 2y(x^2 + y^2)) * exp(-(x^2 + y^2))
print(f"f(1, 2) = {z.item():.6f}")
print(f"df/dx = {x.grad.item():.6f}")
print(f"df/dy = {y.grad.item():.6f}")
# Verify with torch.autograd.grad
grad_x, grad_y = torch.autograd.grad(
z, [x, y], create_graph=True
)
print(f"\nAutograd df/dx: {grad_x.item():.6f}")
print(f"Autograd df/dy: {grad_y.item():.6f}")
📝Example: Hessian-Vector Product
import torch
x = torch.tensor([2.0, 3.0], requires_grad=True)
# Quadratic function: f(x) = 0.5 * x^T H x
H = torch.tensor([[4.0, 1.0], [1.0, 3.0]])
f = 0.5 * x @ H @ x
# First-order gradient (Jacobian-vector product)
grad = torch.autograd.grad(f, x, create_graph=True)[0]
print(f"Gradient: {grad}")
# Hessian-vector product: H @ v
v = torch.tensor([1.0, 1.0])
hvp = torch.autograd.grad(grad, x, grad_outputs=v, retain_graph=True)[0]
print(f"Hessian-vector product: {hvp}")
print(f"Direct H @ v: {H @ v}")
Summary
📋Summary: Math Foundations for Deep Learning
- Linear algebra: Vectors, matrices, eigendecomposition, SVD — underpin all neural network operations
- Chain rule: The backbone of backpropagation — enables efficient gradient computation
- Hessian: Captures curvature — positive definite for minima, indefinite for saddle points
- Gaussian distribution: Weight initialization, batch normalization, VAEs
- Cross-entropy: The standard loss for classification — equivalent to maximum likelihood
- KL divergence: Measures distribution differences — used in VAEs, distillation
- Matrix calculus: Extends gradient computation to matrix-valued functions (weight matrices)
Practice Exercises
-
Linear Algebra: Given , compute its eigenvalues and eigenvectors. Verify that .
-
Chain Rule: Let and . Compute using the chain rule. Verify with PyTorch autograd.
-
Hessian: Compute the Hessian of . Find all critical points and classify them (min/max/saddle).
-
Probability: Derive the gradient of the negative log-likelihood for a Gaussian distribution with respect to and . Implement it in PyTorch.
-
Coding: Implement a function that computes the full Hessian matrix of a scalar function using PyTorch's autograd. Test it on a 2D quadratic function.