Math Foundations for Deep Learning — Linear Algebra, Calculus & Probability

FoundationsMathematicsFree Lesson

Advertisement

Math Foundations for Deep Learning

Deep learning is built on linear algebra, calculus, and probability. This tutorial covers the essential math you need to understand how neural networks learn.

See our Math Linear Algebra tutorial and Math Calculus tutorial for broader mathematical foundations.


Linear Algebra

Vectors and Matrices

DfVector and Matrix Operations

  • Dot Product: uv=iuivi=uvcosθ\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta
  • Matrix Multiplication: C=AB\mathbf{C} = \mathbf{A}\mathbf{B} where Cij=kAikBkjC_{ij} = \sum_k A_{ik} B_{kj}
  • Norm: xp=(ixip)1/p\|\mathbf{x}\|_p = \left(\sum_i |x_i|^p\right)^{1/p}
  • Transpose: (AT)ij=Aji(\mathbf{A}^T)_{ij} = A_{ji}

Matrix Multiplication

Cij=k=1nAikBkjC_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj}

Here,

  • AA=Input matrix of shape (m x n)
  • BB=Input matrix of shape (n x p)
  • CC=Output matrix of shape (m x p)
  • nn=Inner dimension (must match)

Eigenvalues and Eigenvectors

DfEigenvalue Decomposition

For a square matrix A\mathbf{A}, an eigenvector v\mathbf{v} and eigenvalue λ\lambda satisfy:

Av=λv\mathbf{A}\mathbf{v} = \lambda \mathbf{v}

The eigendecomposition A=QΛQT\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^T reveals the matrix's action: stretching along eigenvector directions by eigenvalue factors. This is critical for understanding:

  • PCA: Principal components are eigenvectors of the covariance matrix
  • Hessian analysis: Eigenvalues indicate curvature directions
  • Spectral initialization: Eigenvectors of weight matrices
A=QΛQTwhereΛ=diag(λ1,,λn)\mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T \quad \text{where} \quad \mathbf{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_n)

Singular Value Decomposition (SVD)

DfSVD

Every matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} can be decomposed as:

A=UΣVT\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T

where URm×m\mathbf{U} \in \mathbb{R}^{m \times m} and VRn×n\mathbf{V} \in \mathbb{R}^{n \times n} are orthogonal, and Σ\mathbf{\Sigma} is diagonal with singular values. SVD is used in weight pruning, low-rank approximation, and understanding network expressivity.


Calculus for Deep Learning

Gradients

DfGradient

The gradient of a scalar function f:RnRf: \mathbb{R}^n \to \mathbb{R} is the vector of partial derivatives:

f=[fx1fx2fxn]\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

The gradient points in the direction of steepest ascent. Gradient descent minimizes ff by moving in the direction f-\nabla f.

Gradient of a Function

f(x)=[fx1fpartialx2fxn]\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

Here,

  • ff=Scalar-valued function
  • xx=Input vector
  • f\nabla f=Gradient vector (same shape as x)

The Chain Rule

DfChain Rule

For composite functions, the chain rule gives:

ddxf(g(x))=f(g(x))g(x)\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)

For multivariate functions:

zxi=jzyjyjxi\frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial y_j} \cdot \frac{\partial y_j}{\partial x_i}

This is the foundation of backpropagation — the algorithm that trains all neural networks.

Chain Rule for Multivariate Functions

zxi=jzyjyjxi\frac{\partial z}{\partial x_i} = \sum_{j} \frac{\partial z}{\partial y_j} \cdot \frac{\partial y_j}{\partial x_i}

Here,

  • zz=Output variable
  • yjy_j=Intermediate variables
  • xix_i=Input variables

ThChain Rule for Deep Networks (Composition)

For a deep network fLfL1f1f_L \circ f_{L-1} \circ \cdots \circ f_1, the gradient of the loss L\mathcal{L} with respect to parameters θ(l)\theta^{(l)} in layer ll is:

Lθ(l)=Lh(L)k=l+1Lh(k)h(k1)h(l)θ(l)\frac{\partial \mathcal{L}}{\partial \theta^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(L)}} \cdot \prod_{k=l+1}^{L} \frac{\partial \mathbf{h}^{(k)}}{\partial \mathbf{h}^{(k-1)}} \cdot \frac{\partial \mathbf{h}^{(l)}}{\partial \theta^{(l)}}

Each factor h(k)h(k1)\frac{\partial \mathbf{h}^{(k)}}{\partial \mathbf{h}^{(k-1)}} is the Jacobian of layer kk. The product of these Jacobians causes vanishing or exploding gradients when network depth LL is large.

The Hessian

DfHessian Matrix

The Hessian is the matrix of second-order partial derivatives:

Hij=2fxixj\mathbf{H}_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

It captures the curvature of ff. Positive definite Hessians indicate local minima; indefinite Hessians indicate saddle points.

Hessian Matrix

H=[2fx122fx1x22fx2x12fpartialx22]\mathbf{H} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}

Here,

  • H\mathbf{H}=Hessian matrix (n x n)
  • ff=Scalar function to differentiate
  • xix_i=Input variable i

ℹ️ Hessian and Optimization

  • Positive definite Hessian (H0\mathbf{H} \succ 0): Local minimum
  • Negative definite Hessian (H0\mathbf{H} \prec 0): Local maximum
  • Indefinite Hessian: Saddle point (common in high-dimensional spaces)
  • Largest eigenvalue of Hessian = maximum curvature = affects step size
  • Second-order optimizers (L-BFGS) use Hessian info but are expensive for large models

Probability Theory

Key Distributions

DfGaussian Distribution

The Gaussian (normal) distribution is fundamental to deep learning:

p(x)=12πσ2exp((xμ)22σ2)p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)
  • Weights are often initialized from Gaussians
  • Batch normalization assumes Gaussian activations
  • VAEs model the latent space as Gaussian

DfBernoulli Distribution

The Bernoulli distribution models binary outcomes:

p(x)={pif x=11pif x=0p(x) = \begin{cases} p & \text{if } x = 1 \\ 1-p & \text{if } x = 0 \end{cases}

Used for binary classification, dropout masks, and binary cross-entropy loss.

Information Theory

DfCross-Entropy

Cross-entropy measures the difference between two probability distributions:

H(p,q)=xp(x)logq(x)H(p, q) = -\sum_x p(x) \log q(x)

In classification, pp is the true label distribution and qq is the predicted distribution. Minimizing cross-entropy is equivalent to maximizing the likelihood of the correct class.

LCE=1Ni=1Nc=1Cyiclog(y^ic)\mathcal{L}_{\text{CE}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{ic} \log(\hat{y}_{ic})

DfKL Divergence

KL divergence measures how one distribution diverges from another:

DKL(pq)=xp(x)logp(x)q(x)D_{\text{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)}

It is always non-negative and equals zero only when p=qp = q. Used in VAEs, knowledge distillation, and distribution matching.


Practical Example: Gradient Computation

📝Example: Manual Gradient Computation

import torch

# Define a simple computation: f(x, y) = (x^2 + y^2) * exp(-(x^2 + y^2))
# This is a "Mexican hat" function

x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)

# Forward pass
r_sq = x**2 + y**2
z = r_sq * torch.exp(-r_sq)

# Backward pass (computes gradients via chain rule)
z.backward()

# Analytical gradients:
# df/dx = (2x - 2x(x^2 + y^2)) * exp(-(x^2 + y^2))
# df/dy = (2y - 2y(x^2 + y^2)) * exp(-(x^2 + y^2))

print(f"f(1, 2) = {z.item():.6f}")
print(f"df/dx = {x.grad.item():.6f}")
print(f"df/dy = {y.grad.item():.6f}")

# Verify with torch.autograd.grad
grad_x, grad_y = torch.autograd.grad(
    z, [x, y], create_graph=True
)
print(f"\nAutograd df/dx: {grad_x.item():.6f}")
print(f"Autograd df/dy: {grad_y.item():.6f}")

📝Example: Hessian-Vector Product

import torch

x = torch.tensor([2.0, 3.0], requires_grad=True)

# Quadratic function: f(x) = 0.5 * x^T H x
H = torch.tensor([[4.0, 1.0], [1.0, 3.0]])
f = 0.5 * x @ H @ x

# First-order gradient (Jacobian-vector product)
grad = torch.autograd.grad(f, x, create_graph=True)[0]
print(f"Gradient: {grad}")

# Hessian-vector product: H @ v
v = torch.tensor([1.0, 1.0])
hvp = torch.autograd.grad(grad, x, grad_outputs=v, retain_graph=True)[0]
print(f"Hessian-vector product: {hvp}")
print(f"Direct H @ v: {H @ v}")

Summary

📋Summary: Math Foundations for Deep Learning

  • Linear algebra: Vectors, matrices, eigendecomposition, SVD — underpin all neural network operations
  • Chain rule: The backbone of backpropagation — enables efficient gradient computation
  • Hessian: Captures curvature — positive definite for minima, indefinite for saddle points
  • Gaussian distribution: Weight initialization, batch normalization, VAEs
  • Cross-entropy: The standard loss for classification — equivalent to maximum likelihood
  • KL divergence: Measures distribution differences — used in VAEs, distillation
  • Matrix calculus: Extends gradient computation to matrix-valued functions (weight matrices)

Practice Exercises

  1. Linear Algebra: Given A=[2113]\mathbf{A} = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}, compute its eigenvalues and eigenvectors. Verify that Av=λv\mathbf{A}\mathbf{v} = \lambda\mathbf{v}.

  2. Chain Rule: Let f(x)=sin(x2)f(x) = \sin(x^2) and g(x)=ex2g(x) = e^{-x^2}. Compute ddxf(g(x))\frac{d}{dx} f(g(x)) using the chain rule. Verify with PyTorch autograd.

  3. Hessian: Compute the Hessian of f(x,y)=x4+y44xy+1f(x, y) = x^4 + y^4 - 4xy + 1. Find all critical points and classify them (min/max/saddle).

  4. Probability: Derive the gradient of the negative log-likelihood for a Gaussian distribution with respect to μ\mu and σ\sigma. Implement it in PyTorch.

  5. Coding: Implement a function that computes the full Hessian matrix of a scalar function using PyTorch's autograd. Test it on a 2D quadratic function.

Advertisement

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement