Master derivatives of matrix expressions, Jacobians, Hessians, backpropagation, and gradient computation for optimization and deep learning.
đ Calculusđ Lesson 21 of 100đ Free Course
Advertisement
Matrix Calculus
âšī¸ Why It Matters
Matrix calculus is the mathematical backbone of modern machine learning. Every time a neural network learns, it computes gradients of a loss function with respect to millions of weights â that computation is matrix calculus. Backpropagation is simply the chain rule applied to compositions of matrix functions. Gradient descent, the optimizer that powers everything from linear regression to GPT, updates parameters by moving in the direction of the negative gradient â a vector of partial derivatives. Without matrix calculus, we cannot understand how models learn, why they fail, or how to design better architectures. From computing the gradient of a quadratic form in ridge regression to deriving the backward pass of a transformer, matrix calculus is the language of optimization.
Scalar by Matrix Derivatives
DfScalar Function of a Matrix
Let f:RmÃnâR be a scalar-valued function of a matrix AâRmÃn. The gradient of f with respect to A is an mÃn matrix where each entry is the partial derivative with respect to the corresponding entry of A:
For a vector-valued function fâ:RnâRm where fâ(x)=âf1â(x)f2â(x)âŽfmâ(x)ââ, the Jacobian is the mÃn matrix of all first-order partial derivatives:
The following are the most commonly used derivative identities in matrix calculus. Let xâRn, AâRmÃn, a,bâRn, and BâRnÃn.
Expression
Derivative
Notes
âxââ(aTx)
a
Linear form â the "constant rule"
âxââ(xTa)
a
Transpose doesn't change the gradient
âxââ(Ax)
AT
Output is matrix; Jacobian is AT
âxââ(xTBx)
(B+BT)x
Quadratic form
âxââ(xTx)
2x
Special case: B=I
âxââ(âĨxâĨ2)
2x
Same as above
âxââ(aTBx)
BTa
Bilinear form
âxââ(sin(x))
diag(cos(x))
Elementwise â Jacobian is diagonal
âxââ(Ī(x))
diag(Ī(x)â(1âĪ(x)))
Sigmoid derivative
â ī¸ Layout Convention
Different textbooks use different numerator-layout vs denominator-layout conventions. In numerator layout, the gradient of a scalar is a row vector; in denominator layout, it is a column vector. This course uses denominator layout (column vector gradient) to match the convention used in deep learning frameworks like PyTorch and TensorFlow.
Neural networks are compositions of many functions: y^â=fLâ(fLâ1â(â¯f1â(x)â¯)). The chain rule lets us decompose the gradient computation into local derivatives at each layer. This is the principle behind backpropagation: compute local Jacobians layer by layer, then multiply them together (in reverse order) to get the total derivative.
đChain Rule in Two Layers
Let z=W1âx, a=Ī(z), y^â=W2âa, and L=21â(y^âây)2.
Forward pass: xâzâaây^ââL
Backward pass (chain rule):
ây^ââLâ=y^âây
âW2ââLâ=(y^âây)aT
âaâLâ=W2Tâ(y^âây)
âzâLâ=Īâ˛(z)âW2Tâ(y^âây)
âW1ââLâ=âzâLâxT
Gradient of Loss Functions
DfMean Squared Error (MSE)
For predictions y^âiâ and targets yiâ over N samples:
LMSEâ=N1âi=1âNâ(y^âiââyiâ)2
Gradient with respect to predictions:
ây^âiââLâ=N2â(y^âiââyiâ)
For vector predictions yâ^ââRd:
âyâ^ââL=N2â(yâ^ââyâ)
DfBinary Cross-Entropy Loss
For binary predictions p^âiââ[0,1] and targets yiââ{0,1}:
Gradient with respect to logits zkâ (before softmax):
âzkââLâ=p^âkââykâ
This elegant result â the gradient is simply the prediction minus the target â is why cross-entropy + softmax is the standard output layer for classification.
đMSE Derivation
Let f(w)=N1ââĨXwâyââĨ2=N1â(Xwâyâ)T(Xwâyâ).
By Clairaut's theorem, H is symmetric: H=HT (assuming continuous second partials).
ThHessian and Optimization
Positive definiteHâģ0: local minimum (all eigenvalues >0)
Negative definiteHâē0: local maximum (all eigenvalues <0)
Indefinite (mixed eigenvalues): saddle point
Semi-definite: inconclusive (higher-order test needed)
Newton's method uses the Hessian:
xnewâ=xoldââHâ1âf(xoldâ)
For a quadratic form f(x)=21âxTHx+gâTx+c, the Hessian is constant and the optimum is xâ=âHâ1gâ.
đHessian of a Quadratic Form
Let f(x)=xTBx. We showed âf=(B+BT)x.
The Hessian is:
H=âxââ[(B+BT)x]=B+BT
If B is symmetric: H=2B.
đHessian of MSE
For f(w)=N1ââĨXwâyââĨ2, the gradient is N2âXT(Xwâyâ).
The Hessian is:
H=N2âXTX
This is always positive semi-definite, confirming MSE has a unique global minimum (when X has full column rank).
Python Implementation
import numpy as np
# --- Gradient of f(x) = x^T A x ---
def grad_quadratic(x, A):
return (A + A.T) @ x
A = np.array([[2, 1], [1, 3]])
x = np.array([1.0, 2.0])
print(f"Gradient of x^T A x: {grad_quadratic(x, A)}")
# Output: [4. 7.]
# --- Hessian of f(x) = x^T A x ---
def hessian_quadratic(A):
return A + A.T
print(f"Hessian:\n{hessian_quadratic(A)}")
# [[4 2]
# [2 6]]
# --- Manual Jacobian computation ---
def jacobian(f, x, eps=1e-7):
"""Compute Jacobian via finite differences."""
n = len(x)
m = len(f(x))
J = np.zeros((m, n))
fx = f(x)
for j in range(n):
x_plus = x.copy()
x_plus[j] += eps
J[:, j] = (f(x_plus) - fx) / eps
return J
# Example: f(x) = [x1*x2, sin(x1) + x2^2]
f = lambda x: np.array([x[0]*x[1], np.sin(x[0]) + x[1]**2])
x0 = np.array([1.0, 0.0])
J = jacobian(f, x0)
print(f"Jacobian at (1,0):\n{J}")
# [[0. 1. ]
# [0.54030231 0. ]]
# --- Autograd with JAX ---
import jax
import jax.numpy as jnp
from jax import grad, hessian
def loss_fn(x):
return jnp.sum(x**2) + jnp.dot(x, jnp.array([1.0, 2.0]))
# Gradient
g = grad(loss_fn)
print(f"JAX gradient: {g(x)}")
# Hessian
H = hessian(loss_fn)
print(f"JAX Hessian:\n{H}")
# --- Backpropagation: 2-layer network ---
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_prime(z):
s = sigmoid(z)
return s * (1 - s)
def forward_pass(x, W1, b1, W2, b2):
z1 = W1 @ x + b1
a1 = sigmoid(z1)
z2 = W2 @ a1 + b2
return z2, z1, a1
def backward_pass(x, y, z2, z1, a1, W2):
N = len(y)
dz2 = (z2 - y) * 2 / N # dL/dz2
dW2 = np.outer(dz2, a1) # dL/dW2
da1 = W2.T @ dz2 # dL/da1
dz1 = sigmoid_prime(z1) * da1 # dL/dz1
dW1 = np.outer(dz1, x) # dL/dW1
db1 = dz1 # dL/db1
return dW1, db1, dW2
# Simple example
np.random.seed(42)
W1 = np.random.randn(3, 2) * 0.1
b1 = np.zeros(3)
W2 = np.random.randn(1, 3) * 0.1
b2 = np.zeros(1)
x = np.array([1.0, 0.5])
y = np.array([1.0])
z2, z1, a1 = forward_pass(x, W1, b1, W2, b2)
dW1, db1, dW2 = backward_pass(x, y, z2, z1, a1, W2)
print(f"dW1 shape: {dW1.shape}") # (3, 2)
print(f"dW2 shape: {dW2.shape}") # (1, 3)
print(f"db1 shape: {db1.shape}") # (3,)
Applications in AI/ML
Gradient Descent Optimization
The fundamental update rule: wt+1â=wtââΡâwâL(wtâ). All variants â SGD, Adam, AdaGrad, RMSProp â modify this basic step using matrix calculus. Adam tracks running averages of both the gradient (first moment) and the gradient squared (second moment), requiring elementwise matrix operations.
Neural Network Training
Each layer performs a linear transformation Wx+b followed by a nonlinear activation Ī(â ). Backpropagation computes âW(l)âLâ=δ(l)(a(lâ1))T for every layer â this is a rank-1 outer product, computed L times per sample. The total computation for a network with N parameters over B samples is O(Bâ N) â the backbone of deep learning scalability.
Attention Mechanisms (Transformers)
The attention function Attention(Q,K,V)=softmax(dkââQKTâ)V requires computing the gradient through softmax (a matrix function) and matrix multiplication. The derivative of softmax produces a full Jacobian matrix, not a diagonal one, making this one of the more complex backprop computations in modern architectures.
Regularization
L2 regularization adds 2ÎģââĨWâĨF2â to the loss, contributing ÎģW to the weight gradient â a direct application of âWâââĨWâĨF2â=2W. L1 regularization uses the subgradient of âĨWâĨ1â, leading to sparse solutions.
Generative Models
In GANs, the discriminator's loss involves gradients through a minimax objective. The generator's gradient âzâD(G(z)) flows through both the generator and discriminator networks, requiring careful application of the chain rule through both networks simultaneously.
Common Mistakes
Mistake
Correction
Confusing âxââ(Ax)=AT with A
The Jacobian of a linear map Ax is AT in denominator layout
Forgetting to transpose in the chain rule
When chaining f(g(x)), multiply Jacobians: Jfââ Jgâ, not Jgââ Jfâ
Ignoring elementwise vs matrix operations
Ī(x) applies elementwise; its Jacobian is diagonal, not Īâ I
Using âf=âxâfâ without specifying layout
Always state whether gradient is row or column vector
Assuming âxââ(xTAx)=2Ax for non-symmetric A
The correct formula is (A+AT)x; only when A=AT does it simplify to 2Ax
Forgetting the N1â factor in loss gradients
MSE gradient is N2âXT(Xwây), not 2XT(Xwây)
Applying scalar chain rule to matrix chain rule
Scalar: dxdâf(g(x))=fâ˛(g(x))gâ˛(x). Matrix: Jfâgâ=Jfââ Jgâ (matrix multiply, not elementwise)
Interview Questions
Q1: What is the gradient of f(x)=xTAx?
(A+AT)x. If A is symmetric, this simplifies to 2Ax. Derivation: expand f=âi,jâaijâxiâxjâ, differentiate with respect to xkâ, and collect terms to get (Ax)kâ+(ATx)kâ.
Q2: Why is backpropagation O(N) where N is the number of parameters?
Each layer's gradient is an outer product δ(l)(a(lâ1))T, which costs O(nlââ nlâ1â) â exactly the number of parameters in W(l). Summing over all L layers gives O(âlânlââ nlâ1â)=O(N). The backward pass has the same time complexity as the forward pass.
Q3: What is the Jacobian of the softmax function?
For softmax(z)iâ=âkâezkâeziââ, the Jacobian is:
Jijâ=softmax(ziâ)(δijââsoftmax(zjâ))
where δijâ is the Kronecker delta. In matrix form: J=diag(pâ)âpâpâT. This is NOT a diagonal matrix â it's a rank-1 update to a diagonal matrix.
Q4: How does the Hessian relate to Newton's method in optimization?
Newton's method uses the Hessian to compute the optimal step: xnewâ=xoldââHâ1âf. If Hâģ0, the step always points toward the minimum. However, computing H costs O(n2) space and O(n3) for the inverse, so quasi-Newton methods (BFGS, L-BFGS) approximate Hâ1 instead.
Q5: Derive the gradient of the cross-entropy loss with respect to the logits.
For softmax output p^âkâ=âjâezjâezkââ and cross-entropy L=ââkâykâlog(p^âkâ):
(using âjâyjâ=1 for one-hot targets). This is why cross-entropy + softmax gives such clean gradients.
Q6: What is the difference between the Jacobian and the Hessian?
The JacobianJâRmÃn is the matrix of first derivatives of a vector function fâ:RnâRm. The HessianHâRnÃn is the matrix of second derivatives of a scalar function f:RnâR. The Hessian is the Jacobian of the gradient: H=Jâfâ.
Practice Problems
đProblem 1
Compute âxââ(xTAx) for A=[10â02â].
đĄSolution
Since A is symmetric, âxââ(xTAx)=2Ax.
2[10â02â][x1âx2ââ]=[2x1â4x2ââ]
Verification: f=x12â+2x22â, so âx1ââfâ=2x1â, âx2ââfâ=4x2â. â
đProblem 2
Find the Jacobian of fâ:R3âR2 defined by fâ(x,y,z)=[xy+z2exsin(y)â].