← Math|21 of 100
Linear Algebra

Matrix Calculus

Master derivatives of matrix expressions, Jacobians, Hessians, backpropagation, and gradient computation for optimization and deep learning.

📂 Calculus📖 Lesson 21 of 100🎓 Free Course

Advertisement

Matrix Calculus


â„šī¸ Why It Matters

Matrix calculus is the mathematical backbone of modern machine learning. Every time a neural network learns, it computes gradients of a loss function with respect to millions of weights — that computation is matrix calculus. Backpropagation is simply the chain rule applied to compositions of matrix functions. Gradient descent, the optimizer that powers everything from linear regression to GPT, updates parameters by moving in the direction of the negative gradient — a vector of partial derivatives. Without matrix calculus, we cannot understand how models learn, why they fail, or how to design better architectures. From computing the gradient of a quadratic form in ridge regression to deriving the backward pass of a transformer, matrix calculus is the language of optimization.


Scalar by Matrix Derivatives

DfScalar Function of a Matrix

Let f:Rm×n→Rf: \mathbb{R}^{m \times n} \to \mathbb{R} be a scalar-valued function of a matrix A∈Rm×nA \in \mathbb{R}^{m \times n}. The gradient of ff with respect to AA is an m×nm \times n matrix where each entry is the partial derivative with respect to the corresponding entry of AA:

∂f∂A=[∂f∂a11∂f∂a12⋯∂f∂a1n∂f∂a21∂f∂a22⋯∂f∂a2n⋮⋮⋱⋮∂f∂am1∂f∂am2⋯∂f∂amn]\frac{\partial f}{\partial A} = \begin{bmatrix} \frac{\partial f}{\partial a_{11}} & \frac{\partial f}{\partial a_{12}} & \cdots & \frac{\partial f}{\partial a_{1n}} \\ \frac{\partial f}{\partial a_{21}} & \frac{\partial f}{\partial a_{22}} & \cdots & \frac{\partial f}{\partial a_{2n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial a_{m1}} & \frac{\partial f}{\partial a_{m2}} & \cdots & \frac{\partial f}{\partial a_{mn}} \end{bmatrix}

📝Example: Frobenius Norm Squared

Let f(A)=âˆĨAâˆĨF2=∑i,jaij2f(A) = \|A\|_F^2 = \sum_{i,j} a_{ij}^2. Then:

∂f∂aij=2aij  ⟹  ∂f∂A=2A\frac{\partial f}{\partial a_{ij}} = 2a_{ij} \implies \frac{\partial f}{\partial A} = 2A

📝Example: Trace

Let f(A)=tr(A)=∑iaiif(A) = \text{tr}(A) = \sum_i a_{ii}. Then:

∂f∂aij={1if i=j0if i≠j  ⟹  ∂f∂A=I\frac{\partial f}{\partial a_{ij}} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases} \implies \frac{\partial f}{\partial A} = I

Vector by Vector Derivatives

DfJacobian Matrix

For a vector-valued function f⃗:Rn→Rm\vec{f}: \mathbb{R}^n \to \mathbb{R}^m where f⃗(x⃗)=[f1(x⃗)f2(x⃗)⋮fm(x⃗)]\vec{f}(\vec{x}) = \begin{bmatrix} f_1(\vec{x}) \\ f_2(\vec{x}) \\ \vdots \\ f_m(\vec{x}) \end{bmatrix}, the Jacobian is the m×nm \times n matrix of all first-order partial derivatives:

Jij=∂fi∂xjJ_{ij} = \frac{\partial f_i}{\partial x_j}
J=[∂f1∂x1∂f1∂x2⋯∂f1∂xn∂f2∂x1∂f2∂x2⋯∂f2∂xn⋮⋮⋱⋮∂fm∂x1∂fm∂x2⋯∂fm∂xn]J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

â„šī¸ Jacobian vs Gradient

  • The Jacobian J∈Rm×nJ \in \mathbb{R}^{m \times n} generalizes the derivative to vector-to-vector maps.
  • The gradient ∇f∈Rn\nabla f \in \mathbb{R}^n is the Jacobian of a scalar function f:Rn→Rf: \mathbb{R}^n \to \mathbb{R} (a row vector or column vector depending on convention).
  • For a linear map f⃗(x⃗)=Ax⃗\vec{f}(\vec{x}) = A\vec{x}, the Jacobian is simply the matrix AA.

📝Jacobian of a Nonlinear Map

Let f⃗:R2→R2\vec{f}: \mathbb{R}^2 \to \mathbb{R}^2 be defined by f⃗(x1,x2)=[x1x2sin⁡(x1)+x22]\vec{f}(x_1, x_2) = \begin{bmatrix} x_1 x_2 \\ \sin(x_1) + x_2^2 \end{bmatrix}.

J=[∂f1∂x1∂f1∂x2∂f2∂x1∂f2∂x2]=[x2x1cos⁡(x1)2x2]J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} \end{bmatrix} = \begin{bmatrix} x_2 & x_1 \\ \cos(x_1) & 2x_2 \end{bmatrix}

At (1,0)(1, 0): J=[01cos⁥(1)0]J = \begin{bmatrix} 0 & 1 \\ \cos(1) & 0 \end{bmatrix}.


Key Matrix Calculus Rules

ThFundamental Derivative Rules

The following are the most commonly used derivative identities in matrix calculus. Let x⃗∈Rn\vec{x} \in \mathbb{R}^n, A∈Rm×nA \in \mathbb{R}^{m \times n}, a⃗,b⃗∈Rn\vec{a}, \vec{b} \in \mathbb{R}^n, and B∈Rn×nB \in \mathbb{R}^{n \times n}.

ExpressionDerivativeNotes
∂∂x⃗(a⃗Tx⃗)\frac{\partial}{\partial \vec{x}}(\vec{a}^T\vec{x})a⃗\vec{a}Linear form — the "constant rule"
∂∂x⃗(x⃗Ta⃗)\frac{\partial}{\partial \vec{x}}(\vec{x}^T\vec{a})a⃗\vec{a}Transpose doesn't change the gradient
∂∂x⃗(Ax⃗)\frac{\partial}{\partial \vec{x}}(A\vec{x})ATA^TOutput is matrix; Jacobian is ATA^T
∂∂x⃗(x⃗TBx⃗)\frac{\partial}{\partial \vec{x}}(\vec{x}^T B \vec{x})(B+BT)x⃗(B + B^T)\vec{x}Quadratic form
∂∂x⃗(x⃗Tx⃗)\frac{\partial}{\partial \vec{x}}(\vec{x}^T \vec{x})2x⃗2\vec{x}Special case: B=IB = I
∂∂x⃗(âˆĨx⃗âˆĨ2)\frac{\partial}{\partial \vec{x}}(\|\vec{x}\|^2)2x⃗2\vec{x}Same as above
∂∂x⃗(a⃗TBx⃗)\frac{\partial}{\partial \vec{x}}(\vec{a}^T B \vec{x})BTa⃗B^T \vec{a}Bilinear form
∂∂x⃗(sin⁡(x⃗))\frac{\partial}{\partial \vec{x}}(\sin(\vec{x}))diag(cos⁡(x⃗))\text{diag}(\cos(\vec{x}))Elementwise — Jacobian is diagonal
∂∂x⃗(΃(x⃗))\frac{\partial}{\partial \vec{x}}(\sigma(\vec{x}))diag(΃(x⃗)⊙(1âˆ’Īƒ(x⃗)))\text{diag}(\sigma(\vec{x}) \odot (1 - \sigma(\vec{x})))Sigmoid derivative

âš ī¸ Layout Convention

Different textbooks use different numerator-layout vs denominator-layout conventions. In numerator layout, the gradient of a scalar is a row vector; in denominator layout, it is a column vector. This course uses denominator layout (column vector gradient) to match the convention used in deep learning frameworks like PyTorch and TensorFlow.

📝Deriving the Quadratic Rule

For f(x⃗)=x⃗TBx⃗=∑i,jbijxixjf(\vec{x}) = \vec{x}^T B \vec{x} = \sum_{i,j} b_{ij} x_i x_j:

∂f∂xk=∑jbkjxj+∑ibikxi=(Bx⃗)k+(BTx⃗)k\frac{\partial f}{\partial x_k} = \sum_j b_{kj} x_j + \sum_i b_{ik} x_i = (B\vec{x})_k + (B^T\vec{x})_k

Therefore: ∂f∂x⃗=Bx⃗+BTx⃗=(B+BT)x⃗\frac{\partial f}{\partial \vec{x}} = B\vec{x} + B^T\vec{x} = (B + B^T)\vec{x}.

If BB is symmetric (B=BTB = B^T): ∂f∂x⃗=2Bx⃗\frac{\partial f}{\partial \vec{x}} = 2B\vec{x}.


Chain Rule for Matrices

ThMultivariable Chain Rule

For a composition of functions g⃗:Rp→Rm\vec{g}: \mathbb{R}^p \to \mathbb{R}^m and f⃗:Rn→Rp\vec{f}: \mathbb{R}^n \to \mathbb{R}^p, defining h⃗=g⃗∘f⃗\vec{h} = \vec{g} \circ \vec{f}, the Jacobian of h⃗\vec{h} is the product of Jacobians:

Jh⃗=Jg⃗⋅Jf⃗J_{\vec{h}} = J_{\vec{g}} \cdot J_{\vec{f}}

In terms of partial derivatives:

∂hi∂xj=∑k∂hi∂fk⋅∂fk∂xj\frac{\partial h_i}{\partial x_j} = \sum_k \frac{\partial h_i}{\partial f_k} \cdot \frac{\partial f_k}{\partial x_j}

For a scalar loss L\mathcal{L} and intermediate variables z1,z2,â€Ļ,zkz_1, z_2, \ldots, z_k:

∂L∂xj=∑i=1k∂L∂zi⋅∂zi∂xj\frac{\partial \mathcal{L}}{\partial x_j} = \sum_{i=1}^{k} \frac{\partial \mathcal{L}}{\partial z_i} \cdot \frac{\partial z_i}{\partial x_j}

â„šī¸ Why This Matters for Deep Learning

Neural networks are compositions of many functions: y^=fL(fL−1(⋯f1(x⃗)⋯ ))\hat{y} = f_L(f_{L-1}(\cdots f_1(\vec{x}) \cdots)). The chain rule lets us decompose the gradient computation into local derivatives at each layer. This is the principle behind backpropagation: compute local Jacobians layer by layer, then multiply them together (in reverse order) to get the total derivative.

📝Chain Rule in Two Layers

Let z⃗=W1x⃗\vec{z} = W_1 \vec{x}, a⃗=΃(z⃗)\vec{a} = \sigma(\vec{z}), y^=W2a⃗\hat{y} = W_2 \vec{a}, and L=12(y^−y)2\mathcal{L} = \frac{1}{2}(\hat{y} - y)^2.

Forward pass: x⃗→z⃗→a⃗→y^→L\vec{x} \to \vec{z} \to \vec{a} \to \hat{y} \to \mathcal{L}

Backward pass (chain rule):

  • ∂L∂y^=y^−y\frac{\partial \mathcal{L}}{\partial \hat{y}} = \hat{y} - y
  • ∂L∂W2=(y^−y)a⃗T\frac{\partial \mathcal{L}}{\partial W_2} = (\hat{y} - y) \vec{a}^T
  • ∂L∂a⃗=W2T(y^−y)\frac{\partial \mathcal{L}}{\partial \vec{a}} = W_2^T (\hat{y} - y)
  • ∂L∂z⃗=Īƒâ€˛(z⃗)⊙W2T(y^−y)\frac{\partial \mathcal{L}}{\partial \vec{z}} = \sigma'(\vec{z}) \odot W_2^T(\hat{y} - y)
  • ∂L∂W1=∂L∂z⃗x⃗T\frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial \vec{z}} \vec{x}^T

Gradient of Loss Functions

DfMean Squared Error (MSE)

For predictions y^i\hat{y}_i and targets yiy_i over NN samples:

LMSE=1N∑i=1N(y^i−yi)2\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2

Gradient with respect to predictions:

∂L∂y^i=2N(y^i−yi)\frac{\partial \mathcal{L}}{\partial \hat{y}_i} = \frac{2}{N}(\hat{y}_i - y_i)

For vector predictions y⃗^∈Rd\hat{\vec{y}} \in \mathbb{R}^d:

∇y⃗^L=2N(y⃗^−y⃗)\nabla_{\hat{\vec{y}}} \mathcal{L} = \frac{2}{N}(\hat{\vec{y}} - \vec{y})

DfBinary Cross-Entropy Loss

For binary predictions p^i∈[0,1]\hat{p}_i \in [0, 1] and targets yi∈{0,1}y_i \in \{0, 1\}:

LBCE=−1N∑i=1N[yilog⁡(p^i)+(1−yi)log⁡(1−p^i)]\mathcal{L}_{\text{BCE}} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]

Gradient with respect to predictions:

∂L∂p^i=−yip^i+1−yi1−p^i=p^i−yip^i(1−p^i)\frac{\partial \mathcal{L}}{\partial \hat{p}_i} = \frac{-y_i}{\hat{p}_i} + \frac{1 - y_i}{1 - \hat{p}_i} = \frac{\hat{p}_i - y_i}{\hat{p}_i(1 - \hat{p}_i)}

DfCategorical Cross-Entropy Loss

For KK-class predictions p^k\hat{p}_k (softmax outputs) and one-hot targets yky_k:

LCE=−1N∑i=1N∑k=1Kyiklog⁡(p^ik)\mathcal{L}_{\text{CE}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{ik} \log(\hat{p}_{ik})

Gradient with respect to logits zkz_k (before softmax):

∂L∂zk=p^k−yk\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}_k - y_k

This elegant result — the gradient is simply the prediction minus the target — is why cross-entropy + softmax is the standard output layer for classification.

📝MSE Derivation

Let f(w⃗)=1NâˆĨXw⃗−y⃗âˆĨ2=1N(Xw⃗−y⃗)T(Xw⃗−y⃗)f(\vec{w}) = \frac{1}{N}\|X\vec{w} - \vec{y}\|^2 = \frac{1}{N}(X\vec{w} - \vec{y})^T(X\vec{w} - \vec{y}).

Expanding: f=1N(w⃗TXTXw⃗−2y⃗TXw⃗+y⃗Ty⃗)f = \frac{1}{N}(\vec{w}^T X^T X \vec{w} - 2\vec{y}^T X \vec{w} + \vec{y}^T \vec{y}).

Using the rules from above:

∇w⃗f=1N(2XTXw⃗−2XTy⃗)=2NXT(Xw⃗−y⃗)\nabla_{\vec{w}} f = \frac{1}{N}(2X^T X \vec{w} - 2X^T \vec{y}) = \frac{2}{N}X^T(X\vec{w} - \vec{y})

Setting to zero gives the normal equation: XTXw⃗=XTy⃗X^T X \vec{w} = X^T \vec{y}.


Backpropagation Derivation

ThBackpropagation Through a Neural Network

Consider a feedforward network with LL layers. For layer ll:

  • Pre-activation: z⃗(l)=W(l)a⃗(l−1)+b⃗(l)\vec{z}^{(l)} = W^{(l)} \vec{a}^{(l-1)} + \vec{b}^{(l)}
  • Activation: a⃗(l)=΃(z⃗(l))\vec{a}^{(l)} = \sigma(\vec{z}^{(l)})

Forward pass:

a⃗(0)=x⃗,z⃗(l)=W(l)a⃗(l−1)+b⃗(l),a⃗(l)=΃(z⃗(l))\vec{a}^{(0)} = \vec{x}, \quad \vec{z}^{(l)} = W^{(l)}\vec{a}^{(l-1)} + \vec{b}^{(l)}, \quad \vec{a}^{(l)} = \sigma(\vec{z}^{(l)})

Output: y^=a⃗(L)\hat{y} = \vec{a}^{(L)} (for regression) or p⃗^=softmax(z⃗(L))\hat{\vec{p}} = \text{softmax}(\vec{z}^{(L)}) (for classification).

Backward pass — error signals:

δ(L)=∇a⃗(L)LâŠ™Īƒâ€˛(z⃗(L))\delta^{(L)} = \nabla_{\vec{a}^{(L)}} \mathcal{L} \odot \sigma'(\vec{z}^{(L)})
δ(l)=((W(l+1))Tδ(l+1))âŠ™Īƒâ€˛(z⃗(l))\delta^{(l)} = \left((W^{(l+1)})^T \delta^{(l+1)}\right) \odot \sigma'(\vec{z}^{(l)})

Weight gradients:

∂L∂W(l)=δ(l)(a⃗(l−1))T\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} (\vec{a}^{(l-1)})^T
∂L∂b⃗(l)=δ(l)\frac{\partial \mathcal{L}}{\partial \vec{b}^{(l)}} = \delta^{(l)}

📝Full 2-Layer Network Derivation

Let x⃗→W(1)→z⃗(1)=W(1)x⃗+b⃗(1)→a⃗(1)=΃(z⃗(1))→W(2)→z⃗(2)=W(2)a⃗(1)+b⃗(2)→y^=z⃗(2)\vec{x} \to W^{(1)} \to \vec{z}^{(1)} = W^{(1)}\vec{x} + \vec{b}^{(1)} \to \vec{a}^{(1)} = \sigma(\vec{z}^{(1)}) \to W^{(2)} \to \vec{z}^{(2)} = W^{(2)}\vec{a}^{(1)} + \vec{b}^{(2)} \to \hat{y} = \vec{z}^{(2)}.

Loss: L=12(y^−y)2\mathcal{L} = \frac{1}{2}(\hat{y} - y)^2.

Backward pass:

  1. ∂L∂y^=y^−y\frac{\partial \mathcal{L}}{\partial \hat{y}} = \hat{y} - y
  2. ∂L∂W(2)=(y^−y)(a⃗(1))T\frac{\partial \mathcal{L}}{\partial W^{(2)}} = (\hat{y} - y)(\vec{a}^{(1)})^T
  3. ∂L∂b⃗(2)=y^−y\frac{\partial \mathcal{L}}{\partial \vec{b}^{(2)}} = \hat{y} - y
  4. ∂L∂a⃗(1)=W(2)T(y^−y)\frac{\partial \mathcal{L}}{\partial \vec{a}^{(1)}} = W^{(2)T}(\hat{y} - y)
  5. ∂L∂z⃗(1)=Īƒâ€˛(z⃗(1))⊙W(2)T(y^−y)\frac{\partial \mathcal{L}}{\partial \vec{z}^{(1)}} = \sigma'(\vec{z}^{(1)}) \odot W^{(2)T}(\hat{y} - y)
  6. ∂L∂W(1)=∂L∂z⃗(1)x⃗T\frac{\partial \mathcal{L}}{\partial W^{(1)}} = \frac{\partial \mathcal{L}}{\partial \vec{z}^{(1)}} \vec{x}^T
  7. ∂L∂b⃗(1)=∂L∂z⃗(1)\frac{\partial \mathcal{L}}{\partial \vec{b}^{(1)}} = \frac{\partial \mathcal{L}}{\partial \vec{z}^{(1)}}

Hessian Matrices

DfHessian Matrix

For a twice-differentiable scalar function f:Rn→Rf: \mathbb{R}^n \to \mathbb{R}, the Hessian is the n×nn \times n symmetric matrix of second-order partial derivatives:

Hij=∂2f∂xi∂xjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}
H=[∂2f∂x12∂2f∂x1∂x2⋯∂2f∂x1∂xn∂2f∂x2∂x1∂2f∂x22⋯∂2f∂x2∂xn⋮⋮⋱⋮∂2f∂xn∂x1∂2f∂xn∂x2⋯∂2f∂xn2]H = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

By Clairaut's theorem, HH is symmetric: H=HTH = H^T (assuming continuous second partials).

ThHessian and Optimization

  • Positive definite Hâ‰ģ0H \succ 0: local minimum (all eigenvalues >0> 0)
  • Negative definite Hâ‰ē0H \prec 0: local maximum (all eigenvalues <0< 0)
  • Indefinite (mixed eigenvalues): saddle point
  • Semi-definite: inconclusive (higher-order test needed)

Newton's method uses the Hessian:

x⃗new=x⃗old−H−1∇f(x⃗old)\vec{x}_{\text{new}} = \vec{x}_{\text{old}} - H^{-1} \nabla f(\vec{x}_{\text{old}})

For a quadratic form f(x⃗)=12x⃗THx⃗+g⃗Tx⃗+cf(\vec{x}) = \frac{1}{2}\vec{x}^T H \vec{x} + \vec{g}^T \vec{x} + c, the Hessian is constant and the optimum is x⃗∗=−H−1g⃗\vec{x}^* = -H^{-1}\vec{g}.

📝Hessian of a Quadratic Form

Let f(x⃗)=x⃗TBx⃗f(\vec{x}) = \vec{x}^T B \vec{x}. We showed ∇f=(B+BT)x⃗\nabla f = (B + B^T)\vec{x}.

The Hessian is:

H=∂∂x⃗[(B+BT)x⃗]=B+BTH = \frac{\partial}{\partial \vec{x}} \left[(B + B^T)\vec{x}\right] = B + B^T

If BB is symmetric: H=2BH = 2B.

📝Hessian of MSE

For f(w⃗)=1NâˆĨXw⃗−y⃗âˆĨ2f(\vec{w}) = \frac{1}{N}\|X\vec{w} - \vec{y}\|^2, the gradient is 2NXT(Xw⃗−y⃗)\frac{2}{N}X^T(X\vec{w} - \vec{y}).

The Hessian is:

H=2NXTXH = \frac{2}{N}X^T X

This is always positive semi-definite, confirming MSE has a unique global minimum (when XX has full column rank).


Python Implementation

import numpy as np

# --- Gradient of f(x) = x^T A x ---
def grad_quadratic(x, A):
    return (A + A.T) @ x

A = np.array([[2, 1], [1, 3]])
x = np.array([1.0, 2.0])
print(f"Gradient of x^T A x: {grad_quadratic(x, A)}")
# Output: [4. 7.]

# --- Hessian of f(x) = x^T A x ---
def hessian_quadratic(A):
    return A + A.T

print(f"Hessian:\n{hessian_quadratic(A)}")
# [[4 2]
#  [2 6]]

# --- Manual Jacobian computation ---
def jacobian(f, x, eps=1e-7):
    """Compute Jacobian via finite differences."""
    n = len(x)
    m = len(f(x))
    J = np.zeros((m, n))
    fx = f(x)
    for j in range(n):
        x_plus = x.copy()
        x_plus[j] += eps
        J[:, j] = (f(x_plus) - fx) / eps
    return J

# Example: f(x) = [x1*x2, sin(x1) + x2^2]
f = lambda x: np.array([x[0]*x[1], np.sin(x[0]) + x[1]**2])
x0 = np.array([1.0, 0.0])
J = jacobian(f, x0)
print(f"Jacobian at (1,0):\n{J}")
# [[0.         1.        ]
#  [0.54030231 0.        ]]

# --- Autograd with JAX ---
import jax
import jax.numpy as jnp
from jax import grad, hessian

def loss_fn(x):
    return jnp.sum(x**2) + jnp.dot(x, jnp.array([1.0, 2.0]))

# Gradient
g = grad(loss_fn)
print(f"JAX gradient: {g(x)}")

# Hessian
H = hessian(loss_fn)
print(f"JAX Hessian:\n{H}")

# --- Backpropagation: 2-layer network ---
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_prime(z):
    s = sigmoid(z)
    return s * (1 - s)

def forward_pass(x, W1, b1, W2, b2):
    z1 = W1 @ x + b1
    a1 = sigmoid(z1)
    z2 = W2 @ a1 + b2
    return z2, z1, a1

def backward_pass(x, y, z2, z1, a1, W2):
    N = len(y)
    dz2 = (z2 - y) * 2 / N       # dL/dz2
    dW2 = np.outer(dz2, a1)       # dL/dW2
    da1 = W2.T @ dz2              # dL/da1
    dz1 = sigmoid_prime(z1) * da1 # dL/dz1
    dW1 = np.outer(dz1, x)        # dL/dW1
    db1 = dz1                     # dL/db1
    return dW1, db1, dW2

# Simple example
np.random.seed(42)
W1 = np.random.randn(3, 2) * 0.1
b1 = np.zeros(3)
W2 = np.random.randn(1, 3) * 0.1
b2 = np.zeros(1)
x = np.array([1.0, 0.5])
y = np.array([1.0])

z2, z1, a1 = forward_pass(x, W1, b1, W2, b2)
dW1, db1, dW2 = backward_pass(x, y, z2, z1, a1, W2)

print(f"dW1 shape: {dW1.shape}")  # (3, 2)
print(f"dW2 shape: {dW2.shape}")  # (1, 3)
print(f"db1 shape: {db1.shape}")  # (3,)

Applications in AI/ML

Gradient Descent Optimization

The fundamental update rule: w⃗t+1=w⃗t−η∇w⃗L(w⃗t)\vec{w}_{t+1} = \vec{w}_t - \eta \nabla_{\vec{w}} \mathcal{L}(\vec{w}_t). All variants — SGD, Adam, AdaGrad, RMSProp — modify this basic step using matrix calculus. Adam tracks running averages of both the gradient (first moment) and the gradient squared (second moment), requiring elementwise matrix operations.

Neural Network Training

Each layer performs a linear transformation Wx⃗+b⃗W\vec{x} + \vec{b} followed by a nonlinear activation ΃(⋅)\sigma(\cdot). Backpropagation computes ∂L∂W(l)=δ(l)(a⃗(l−1))T\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} (\vec{a}^{(l-1)})^T for every layer — this is a rank-1 outer product, computed LL times per sample. The total computation for a network with NN parameters over BB samples is O(B⋅N)O(B \cdot N) — the backbone of deep learning scalability.

Attention Mechanisms (Transformers)

The attention function Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V requires computing the gradient through softmax (a matrix function) and matrix multiplication. The derivative of softmax produces a full Jacobian matrix, not a diagonal one, making this one of the more complex backprop computations in modern architectures.

Regularization

L2 regularization adds Îģ2âˆĨWâˆĨF2\frac{\lambda}{2}\|W\|_F^2 to the loss, contributing ÎģW\lambda W to the weight gradient — a direct application of ∂∂WâˆĨWâˆĨF2=2W\frac{\partial}{\partial W}\|W\|_F^2 = 2W. L1 regularization uses the subgradient of âˆĨWâˆĨ1\|W\|_1, leading to sparse solutions.

Generative Models

In GANs, the discriminator's loss involves gradients through a minimax objective. The generator's gradient ∇z⃗D(G(z⃗))\nabla_{\vec{z}} D(G(\vec{z})) flows through both the generator and discriminator networks, requiring careful application of the chain rule through both networks simultaneously.


Common Mistakes

MistakeCorrection
Confusing ∂∂x(Ax)=AT\frac{\partial}{\partial x}(Ax) = A^T with AAThe Jacobian of a linear map AxAx is ATA^T in denominator layout
Forgetting to transpose in the chain ruleWhen chaining f(g(x))f(g(x)), multiply Jacobians: Jf⋅JgJ_f \cdot J_g, not Jg⋅JfJ_g \cdot J_f
Ignoring elementwise vs matrix operations΃(x⃗)\sigma(\vec{x}) applies elementwise; its Jacobian is diagonal, not Īƒâ‹…I\sigma \cdot I
Using ∇f=∂f∂x\nabla f = \frac{\partial f}{\partial x} without specifying layoutAlways state whether gradient is row or column vector
Assuming ∂∂x(xTAx)=2Ax\frac{\partial}{\partial x}(x^T Ax) = 2Ax for non-symmetric AAThe correct formula is (A+AT)x(A + A^T)x; only when A=ATA = A^T does it simplify to 2Ax2Ax
Forgetting the 1N\frac{1}{N} factor in loss gradientsMSE gradient is 2NXT(Xw−y)\frac{2}{N}X^T(Xw - y), not 2XT(Xw−y)2X^T(Xw - y)
Applying scalar chain rule to matrix chain ruleScalar: ddxf(g(x))=f′(g(x))g′(x)\frac{d}{dx}f(g(x)) = f'(g(x))g'(x). Matrix: Jf∘g=Jf⋅JgJ_{f \circ g} = J_f \cdot J_g (matrix multiply, not elementwise)

Interview Questions

Q1: What is the gradient of f(x⃗)=x⃗TAx⃗f(\vec{x}) = \vec{x}^T A \vec{x}?

(A+AT)x⃗(A + A^T)\vec{x}. If AA is symmetric, this simplifies to 2Ax⃗2A\vec{x}. Derivation: expand f=∑i,jaijxixjf = \sum_{i,j} a_{ij} x_i x_j, differentiate with respect to xkx_k, and collect terms to get (Ax⃗)k+(ATx⃗)k(A\vec{x})_k + (A^T\vec{x})_k.

Q2: Why is backpropagation O(N)O(N) where NN is the number of parameters?

Each layer's gradient is an outer product δ(l)(a⃗(l−1))T\delta^{(l)} (\vec{a}^{(l-1)})^T, which costs O(nl⋅nl−1)O(n_l \cdot n_{l-1}) — exactly the number of parameters in W(l)W^{(l)}. Summing over all LL layers gives O(∑lnl⋅nl−1)=O(N)O(\sum_l n_l \cdot n_{l-1}) = O(N). The backward pass has the same time complexity as the forward pass.

Q3: What is the Jacobian of the softmax function?

For softmax(z⃗)i=ezi∑kezk\text{softmax}(\vec{z})_i = \frac{e^{z_i}}{\sum_k e^{z_k}}, the Jacobian is:

Jij=softmax(zi)(δij−softmax(zj))J_{ij} = \text{softmax}(z_i)(\delta_{ij} - \text{softmax}(z_j))

where δij\delta_{ij} is the Kronecker delta. In matrix form: J=diag(p⃗)−p⃗p⃗TJ = \text{diag}(\vec{p}) - \vec{p}\vec{p}^T. This is NOT a diagonal matrix — it's a rank-1 update to a diagonal matrix.

Q4: How does the Hessian relate to Newton's method in optimization?

Newton's method uses the Hessian to compute the optimal step: x⃗new=x⃗old−H−1∇f\vec{x}_{\text{new}} = \vec{x}_{\text{old}} - H^{-1}\nabla f. If Hâ‰ģ0H \succ 0, the step always points toward the minimum. However, computing HH costs O(n2)O(n^2) space and O(n3)O(n^3) for the inverse, so quasi-Newton methods (BFGS, L-BFGS) approximate H−1H^{-1} instead.

Q5: Derive the gradient of the cross-entropy loss with respect to the logits.

For softmax output p^k=ezk∑jezj\hat{p}_k = \frac{e^{z_k}}{\sum_j e^{z_j}} and cross-entropy L=−∑kyklog⁡(p^k)\mathcal{L} = -\sum_k y_k \log(\hat{p}_k):

∂L∂zk=∑j∂L∂p^j⋅∂p^j∂zk=∑j(−yjp^j)⋅p^j(δjk−p^k)=−yk+p^k∑jyj=p^k−yk\frac{\partial \mathcal{L}}{\partial z_k} = \sum_j \frac{\partial \mathcal{L}}{\partial \hat{p}_j} \cdot \frac{\partial \hat{p}_j}{\partial z_k} = \sum_j \left(\frac{-y_j}{\hat{p}_j}\right) \cdot \hat{p}_j(\delta_{jk} - \hat{p}_k) = -y_k + \hat{p}_k \sum_j y_j = \hat{p}_k - y_k

(using ∑jyj=1\sum_j y_j = 1 for one-hot targets). This is why cross-entropy + softmax gives such clean gradients.

Q6: What is the difference between the Jacobian and the Hessian?

The Jacobian J∈Rm×nJ \in \mathbb{R}^{m \times n} is the matrix of first derivatives of a vector function f⃗:Rn→Rm\vec{f}: \mathbb{R}^n \to \mathbb{R}^m. The Hessian H∈Rn×nH \in \mathbb{R}^{n \times n} is the matrix of second derivatives of a scalar function f:Rn→Rf: \mathbb{R}^n \to \mathbb{R}. The Hessian is the Jacobian of the gradient: H=J∇fH = J_{\nabla f}.


Practice Problems

📝Problem 1

Compute ∂∂x⃗(x⃗TAx⃗)\frac{\partial}{\partial \vec{x}}(\vec{x}^T A \vec{x}) for A=[1002]A = \begin{bmatrix}1&0\\0&2\end{bmatrix}.

💡Solution

Since AA is symmetric, ∂∂x⃗(x⃗TAx⃗)=2Ax⃗\frac{\partial}{\partial \vec{x}}(\vec{x}^T A \vec{x}) = 2A\vec{x}.

2[1002][x1x2]=[2x14x2]2\begin{bmatrix}1&0\\0&2\end{bmatrix}\begin{bmatrix}x_1\\x_2\end{bmatrix} = \begin{bmatrix}2x_1\\4x_2\end{bmatrix}

Verification: f=x12+2x22f = x_1^2 + 2x_2^2, so ∂f∂x1=2x1\frac{\partial f}{\partial x_1} = 2x_1, ∂f∂x2=4x2\frac{\partial f}{\partial x_2} = 4x_2. ✓

📝Problem 2

Find the Jacobian of f⃗:R3→R2\vec{f}: \mathbb{R}^3 \to \mathbb{R}^2 defined by f⃗(x,y,z)=[xy+z2exsin⁡(y)]\vec{f}(x, y, z) = \begin{bmatrix} xy + z^2 \\ e^x \sin(y) \end{bmatrix}.

💡Solution

J=[∂f1∂x∂f1∂y∂f1∂z∂f2∂x∂f2∂y∂f2∂z]=[yx2zexsin⁡(y)excos⁡(y)0]J = \begin{bmatrix} \frac{\partial f_1}{\partial x} & \frac{\partial f_1}{\partial y} & \frac{\partial f_1}{\partial z} \\ \frac{\partial f_2}{\partial x} & \frac{\partial f_2}{\partial y} & \frac{\partial f_2}{\partial z} \end{bmatrix} = \begin{bmatrix} y & x & 2z \\ e^x \sin(y) & e^x \cos(y) & 0 \end{bmatrix}

This is a 2×32 \times 3 matrix, mapping from R3\mathbb{R}^3 to R2\mathbb{R}^2.

📝Problem 3

Derive the gradient of L=12âˆĨWx⃗−y⃗âˆĨ2\mathcal{L} = \frac{1}{2}\|W\vec{x} - \vec{y}\|^2 with respect to WW.

💡Solution

Let r⃗=Wx⃗−y⃗\vec{r} = W\vec{x} - \vec{y}. Then L=12r⃗Tr⃗\mathcal{L} = \frac{1}{2}\vec{r}^T\vec{r}.

By chain rule: ∂L∂Wij=∑k∂L∂rk⋅∂rk∂Wij\frac{\partial \mathcal{L}}{\partial W_{ij}} = \sum_k \frac{\partial \mathcal{L}}{\partial r_k} \cdot \frac{\partial r_k}{\partial W_{ij}}.

We have ∂L∂rk=rk\frac{\partial \mathcal{L}}{\partial r_k} = r_k and ∂rk∂Wij=δkixj\frac{\partial r_k}{\partial W_{ij}} = \delta_{ki} x_j.

Therefore: ∂L∂Wij=rixj\frac{\partial \mathcal{L}}{\partial W_{ij}} = r_i x_j, which in matrix form is:

∂L∂W=r⃗x⃗T=(Wx⃗−y⃗)x⃗T\frac{\partial \mathcal{L}}{\partial W} = \vec{r} \vec{x}^T = (W\vec{x} - \vec{y})\vec{x}^T

📝Problem 4

For a two-layer network h=W2΃(W1x+b1)+b2h = W_2 \sigma(W_1 x + b_1) + b_2, derive ∂h∂W1\frac{\partial h}{\partial W_1}.

💡Solution

Let z⃗1=W1x+b1\vec{z}_1 = W_1 x + b_1, a⃗1=΃(z⃗1)\vec{a}_1 = \sigma(\vec{z}_1), z⃗2=W2a⃗1+b2\vec{z}_2 = W_2 \vec{a}_1 + b_2, h=z⃗2h = \vec{z}_2.

Chain rule:

∂h∂W1=∂h∂a⃗1⋅∂a⃗1∂z⃗1⋅∂z⃗1∂W1\frac{\partial h}{\partial W_1} = \frac{\partial h}{\partial \vec{a}_1} \cdot \frac{\partial \vec{a}_1}{\partial \vec{z}_1} \cdot \frac{\partial \vec{z}_1}{\partial W_1}
  • ∂h∂a⃗1=W2\frac{\partial h}{\partial \vec{a}_1} = W_2 (vector: W2W_2 applied to scalar output)
  • ∂a⃗1∂z⃗1=diag(Īƒâ€˛(z⃗1))\frac{\partial \vec{a}_1}{\partial \vec{z}_1} = \text{diag}(\sigma'(\vec{z}_1))
  • ∂z⃗1∂W1=x\frac{\partial \vec{z}_1}{\partial W_1} = x (outer product structure)

Result: ∂h∂W1=W2⋅diag(Īƒâ€˛(z⃗1))⋅xT\frac{\partial h}{\partial W_1} = W_2 \cdot \text{diag}(\sigma'(\vec{z}_1)) \cdot x^T

In component form: ∂h∂(W1)ij=∑k(W2)kĪƒâ€˛((z1)k)δkixj=(W2)iĪƒâ€˛((z1)i)xj\frac{\partial h}{\partial (W_1)_{ij}} = \sum_k (W_2)_k \sigma'((z_1)_k) \delta_{ki} x_j = (W_2)_i \sigma'((z_1)_i) x_j


Quick Reference

ConceptFormulaKey Insight
Linear gradient∇x⃗(a⃗Tx⃗)=a⃗\nabla_{\vec{x}}(\vec{a}^T\vec{x}) = \vec{a}Derivative of linear form is the coefficient
Quadratic gradient∇x⃗(x⃗TBx⃗)=(B+BT)x⃗\nabla_{\vec{x}}(\vec{x}^T B \vec{x}) = (B+B^T)\vec{x}Symmetric: 2Bx⃗2B\vec{x}
Frobenius norm∇AâˆĨAâˆĨF2=2A\nabla_A \|A\|_F^2 = 2AUsed in weight decay / regularization
Linear map JacobianJAx⃗=ATJ_{A\vec{x}} = A^TDenominator layout convention
Chain ruleJg∘f=Jg⋅JfJ_{g \circ f} = J_g \cdot J_fMultiply Jacobians, not derivatives
JacobianJij=∂fi∂xjJ_{ij} = \frac{\partial f_i}{\partial x_j}m×nm \times n matrix for f⃗:Rn→Rm\vec{f}: \mathbb{R}^n \to \mathbb{R}^m
HessianHij=∂2f∂xi∂xjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}Symmetric matrix of second derivatives
Backprop weight grad∂L∂W(l)=δ(l)(a⃗(l−1))T\frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} (\vec{a}^{(l-1)})^TOuter product of error signal and activation
Backprop error signalδ(l)=(W(l+1)Tδ(l+1))âŠ™Īƒâ€˛(z⃗(l))\delta^{(l)} = (W^{(l+1)T}\delta^{(l+1)}) \odot \sigma'(\vec{z}^{(l)})Error propagated backward, scaled by derivative
MSE gradient∇y^L=2N(y⃗^−y⃗)\nabla_{\hat{y}} \mathcal{L} = \frac{2}{N}(\hat{\vec{y}} - \vec{y})Proportional to prediction error
Cross-entropy gradient∂L∂zk=p^k−yk\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}_k - y_kPrediction minus target (for softmax)

Cross-References

Lesson Progress21 / 100