← Math|32 of 100
Calculus

Multivariable Calculus

Master Jacobians, Hessians, multiple integrals, and change of variables.

📂 Advanced📖 Lesson 32 of 100🎓 Free Course

Advertisement

Multivariable Calculus

â„šī¸ Why It Matters

Multivariable calculus is the mathematical foundation of modern machine learning and artificial intelligence. Every neural network training step involves computing gradients in high-dimensional spaces, where the loss function depends on millions or billions of parameters. Understanding partial derivatives, the gradient vector, directional derivatives, and second-order information (Hessian matrix) enables you to diagnose training issues, design better optimization algorithms, and understand why certain methods converge while others fail. In generative models like VAEs and GANs, change of variables through the Jacobian determinant is essential for computing probability densities. Double and multiple integrals appear in Bayesian inference, expectation calculations, and probabilistic models. Without multivariable calculus, advanced AI research and implementation would be impossible.


Multivariable Functions

DfMultivariable Function

A multivariable function is a mapping f:Rn→Rmf: \mathbb{R}^n \to \mathbb{R}^m that takes a vector of nn inputs and produces a vector of mm outputs. When m=1m = 1, it is a scalar-valued function; when m>1m > 1, it is a vector-valued function. Common notation includes f(x1,x2,â€Ļ,xn)f(x_1, x_2, \ldots, x_n) for scalar functions and f⃗(x⃗)\vec{f}(\vec{x}) for vector functions.

Scalar Function

f:Rn→R,x⃗=(x1,x2,â€Ļ,xn)â†Ļf(x⃗)f: \mathbb{R}^n \to \mathbb{R}, \quad \vec{x} = (x_1, x_2, \ldots, x_n) \mapsto f(\vec{x})

Here,

  • nn=Number of input variables (dimension)
  • x⃗\vec{x}=Input vector in n-dimensional space
  • f(x⃗)f(\vec{x})=Scalar output value

Vector Function

vecf:mathbbRntomathbbRm,quadvecxmapsto(f1(vecx),f2(vecx),ldots,fm(vecx))\\vec{f}: \\mathbb{R}^n \\to \\mathbb{R}^m, \\quad \\vec{x} \\mapsto (f_1(\\vec{x}), f_2(\\vec{x}), \\ldots, f_m(\\vec{x}))

Here,

  • mm=Number of output components
  • fif_i=The i-th component function

Example: A neural network layer with 256 inputs and 64 outputs is a vector function f⃗:R256→R64\vec{f}: \mathbb{R}^{256} \to \mathbb{R}^{64}.


Partial Derivatives Review

â„šī¸ Quick Summary

The partial derivative of f(x1,â€Ļ,xn)f(x_1, \ldots, x_n) with respect to xix_i is the derivative of ff while holding all other variables constant. It measures how ff changes as only xix_i varies.

Partial Derivative

fracpartialfpartialxi=limhto0fracf(x1,ldots,xi+h,ldots,xn)−f(x1,ldots,xi,ldots,xn)h\\frac{\\partial f}{\\partial x_i} = \\lim_{h \\to 0} \\frac{f(x_1, \\ldots, x_i + h, \\ldots, x_n) - f(x_1, \\ldots, x_i, \\ldots, x_n)}{h}

Here,

  • ∂f∂xi\frac{\partial f}{\partial x_i}=Partial derivative with respect to x_i
  • hh=Infinitesimal increment in x_i

Key Rules:

  • Sum Rule: ∂∂xi(f+g)=∂f∂xi+∂g∂xi\frac{\partial}{\partial x_i}(f + g) = \frac{\partial f}{\partial x_i} + \frac{\partial g}{\partial x_i}
  • Product Rule: ∂∂xi(f⋅g)=∂f∂xig+f∂g∂xi\frac{\partial}{\partial x_i}(f \cdot g) = \frac{\partial f}{\partial x_i} g + f \frac{\partial g}{\partial x_i}
  • Chain Rule: ∂∂xif(g1(x⃗),â€Ļ)=∑j∂f∂gj∂gj∂xi\frac{\partial}{\partial x_i} f(g_1(\vec{x}), \ldots) = \sum_j \frac{\partial f}{\partial g_j} \frac{\partial g_j}{\partial x_i}
  • Clairaut's Theorem: ∂2f∂xi∂xj=∂2f∂xj∂xi\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i} (for continuous second derivatives)

The Gradient Vector

DfGradient Vector

The gradient of a scalar-valued function f:Rn→Rf: \mathbb{R}^n \to \mathbb{R} is the vector of all first-order partial derivatives. It points in the direction of steepest ascent and has magnitude equal to the maximum rate of change of ff.

Gradient

nablaf=left(fracpartialfpartialx1,fracpartialfpartialx2,ldots,fracpartialfpartialxnright)\\nabla f = \\left( \\frac{\\partial f}{\\partial x_1}, \\frac{\\partial f}{\\partial x_2}, \\ldots, \\frac{\\partial f}{\\partial x_n} \\right)

Here,

  • ∇f\nabla f=Gradient vector (nabla f)
  • ∂f∂xi\frac{\partial f}{\partial x_i}=Partial derivative along i-th axis

â„šī¸ Geometric Interpretation

  • The gradient ∇f(x⃗0)\nabla f(\vec{x}_0) is perpendicular (normal) to the level curve f(x⃗)=cf(\vec{x}) = c passing through x⃗0\vec{x}_0.
  • The gradient points in the direction of steepest increase of ff.
  • The negative gradient −∇f-\nabla f points in the direction of steepest decrease.
  • The magnitude âˆŖâˆ‡fâˆŖ|\nabla f| equals the maximum rate of change at that point.

Gradient Magnitude as Directional Max

âˆŖnablafâˆŖ=maxâˆŖvecuâˆŖ=1Dvecuf=maxâˆŖvecuâˆŖ=1nablafcdotvecu|\\nabla f| = \\max_{\\|\\vec{u}\\| = 1} D_{\\vec{u}} f = \\max_{\\|\\vec{u}\\| = 1} \\nabla f \\cdot \\vec{u}

Here,

  • u⃗\vec{u}=Unit direction vector
  • Du⃗fD_{\vec{u}} f=Directional derivative in direction u

Example: For f(x,y)=x2y+sin⁥(y)f(x, y) = x^2 y + \sin(y):

∇f=(2xy,  x2+cos⁡(y))\nabla f = \left( 2xy, \; x^2 + \cos(y) \right)

At point (1,Ī€)(1, \pi): ∇f(1,Ī€)=(2Ī€,  1−1)=(2Ī€,0)\nabla f(1, \pi) = (2\pi, \; 1 - 1) = (2\pi, 0)


Directional Derivative

DfDirectional Derivative

The directional derivative of ff at point x⃗0\vec{x}_0 in the direction of unit vector u⃗\vec{u} measures the rate of change of ff as we move from x⃗0\vec{x}_0 in direction u⃗\vec{u}.

Directional Derivative

Dvecuf(vecx0)=limhto0fracf(vecx0+hvecu)−f(vecx0)h=nablaf(vecx0)cdotvecuD_{\\vec{u}} f(\\vec{x}_0) = \\lim_{h \\to 0} \\frac{f(\\vec{x}_0 + h\\vec{u}) - f(\\vec{x}_0)}{h} = \\nabla f(\\vec{x}_0) \\cdot \\vec{u}

Here,

  • Du⃗fD_{\vec{u}} f=Directional derivative in direction u
  • u⃗\vec{u}=Unit vector (||u|| = 1)
  • ∇f⋅u⃗\nabla f \cdot \vec{u}=Dot product with gradient

ThCauchy-Schwarz Bound

The directional derivative is bounded by:

âˆŖDu⃗fâˆŖâ‰¤âˆŖâˆ‡fâˆŖâ‹…âˆŖuâƒ—âˆŖ=âˆŖâˆ‡fâˆŖ|D_{\vec{u}} f| \leq |\nabla f| \cdot |\vec{u}| = |\nabla f|

Equality holds when u⃗\vec{u} is parallel to ∇f\nabla f (maximum ascent) or anti-parallel (maximum descent).

âš ī¸ Important

The direction vector u⃗\vec{u} must be a unit vector for the formula Du⃗f=∇f⋅u⃗D_{\vec{u}} f = \nabla f \cdot \vec{u} to give the rate of change per unit distance. If v⃗\vec{v} is not normalized, use u⃗=v⃗/âˆŖvâƒ—âˆŖ\vec{u} = \vec{v}/|\vec{v}|.

Example: For f(x,y)=x2+y2f(x,y) = x^2 + y^2 at (1,1)(1,1), direction u⃗=(1/2,1/2)\vec{u} = (1/\sqrt{2}, 1/\sqrt{2}):

∇f(1,1)=(2,2)\nabla f(1,1) = (2, 2), so Du⃗f=2⋅12+2⋅12=22D_{\vec{u}} f = 2 \cdot \frac{1}{\sqrt{2}} + 2 \cdot \frac{1}{\sqrt{2}} = 2\sqrt{2}


Hessian Matrix

DfHessian Matrix

The Hessian matrix of a scalar-valued function f:Rn→Rf: \mathbb{R}^n \to \mathbb{R} is the n×nn \times n matrix of all second-order partial derivatives. It captures the local curvature (convexity/concavity) of ff and is essential for classifying critical points and designing second-order optimization methods.

Hessian

Hij=fracpartial2fpartialxipartialxj=fracpartialpartialxileft(fracpartialfpartialxjright)H_{ij} = \\frac{\\partial^2 f}{\\partial x_i \\partial x_j} = \\frac{\\partial}{\\partial x_i} \\left( \\frac{\\partial f}{\\partial x_j} \\right)

Here,

  • HijH_{ij}=The (i,j) entry of the Hessian
  • ff=The scalar-valued function
  • nn=Dimension of the input space

â„šī¸ Properties

  • The Hessian is always symmetric (by Clairaut's theorem): Hij=HjiH_{ij} = H_{ji} for all i,ji, j.
  • It has n(n+1)/2n(n+1)/2 unique entries (not n2n^2).
  • At a critical point (∇f=0\nabla f = 0), the Hessian determines whether the point is a minimum, maximum, or saddle point.
  • The trace of the Hessian equals the Laplacian: tr(H)=∇2f=∑i∂2f∂xi2\text{tr}(H) = \nabla^2 f = \sum_i \frac{\partial^2 f}{\partial x_i^2}.

Example: For f(x,y)=x4+y4−4xy+1f(x, y) = x^4 + y^4 - 4xy + 1:

∇f=(4x3−4y,  4y3−4x)\nabla f = (4x^3 - 4y, \; 4y^3 - 4x) H=[12x2−4−412y2]H = \begin{bmatrix} 12x^2 & -4 \\ -4 & 12y^2 \end{bmatrix}

At critical point (1,1)(1,1): H=[12−4−412]H = \begin{bmatrix} 12 & -4 \\ -4 & 12 \end{bmatrix}


Multivariable Taylor Series

Second-Order Taylor Expansion

f(vecx0+vech)=f(vecx0)+nablaf(vecx0)Tvech+frac12vechTH(vecx0)vech+O(âˆŖvechâˆŖ3)f(\\vec{x}_0 + \\vec{h}) = f(\\vec{x}_0) + \\nabla f(\\vec{x}_0)^T \\vec{h} + \\frac{1}{2} \\vec{h}^T H(\\vec{x}_0) \\vec{h} + O(\\|\\vec{h}\\|^3)

Here,

  • x⃗0\vec{x}_0=Expansion point
  • h⃗\vec{h}=Small displacement vector
  • ∇f(x⃗0)\nabla f(\vec{x}_0)=Gradient at x_0
  • H(x⃗0)H(\vec{x}_0)=Hessian at x_0

â„šī¸ Interpretation of Terms

  • Zeroth order: f(x⃗0)f(\vec{x}_0) — the function value at the expansion point.
  • First order: ∇fTh⃗\nabla f^T \vec{h} — linear approximation using the gradient.
  • Second order: 12h⃗THh⃗\frac{1}{2} \vec{h}^T H \vec{h} — curvature correction using the Hessian.
  • Higher-order terms O(âˆĨh⃗âˆĨ3)O(\|\vec{h}\|^3) become negligible for small h⃗\vec{h}.

ThTaylor's Theorem with Remainder

For f:Rn→Rf: \mathbb{R}^n \to \mathbb{R} with continuous partial derivatives up to order k+1k+1:

f(x⃗0+h⃗)=âˆ‘âˆŖÎąâˆŖâ‰¤kDÎąf(x⃗0)Îą!h⃗α+Rk(h⃗)f(\vec{x}_0 + \vec{h}) = \sum_{|\alpha| \leq k} \frac{D^\alpha f(\vec{x}_0)}{\alpha!} \vec{h}^\alpha + R_k(\vec{h})

where Îą\alpha is a multi-index and RkR_k is the remainder term.

Second-order approximation (scalar case):

f(x0+h)≈f(x0)+f′(x0)h+12f′′(x0)h2f(x_0 + h) \approx f(x_0) + f'(x_0)h + \frac{1}{2}f''(x_0)h^2

Vector form: h⃗THh⃗=∑i,jHijhihj\vec{h}^T H \vec{h} = \sum_{i,j} H_{ij} h_i h_j is a quadratic form.


Critical Points Classification

DfCritical Point

A point x⃗0\vec{x}_0 is a critical point of ff if ∇f(x⃗0)=0⃗\nabla f(\vec{x}_0) = \vec{0} (all partial derivatives are zero simultaneously).

ThSecond Derivative Test

At a critical point x⃗0\vec{x}_0 where ∇f(x⃗0)=0⃗\nabla f(\vec{x}_0) = \vec{0}, classify using the Hessian H(x⃗0)H(\vec{x}_0):

  • Positive definite (all eigenvalues >0> 0): x⃗0\vec{x}_0 is a local minimum.
  • Negative definite (all eigenvalues <0< 0): x⃗0\vec{x}_0 is a local maximum.
  • Indefinite (eigenvalues of mixed signs): x⃗0\vec{x}_0 is a saddle point.
  • Semi-definite (some eigenvalues =0= 0): Test is inconclusive; need higher-order analysis.

Determinant Criterion (2D)

det(H)=H11H22−H122\\det(H) = H_{11}H_{22} - H_{12}^2

Here,

  • H11H_{11}=Second partial w.r.t. x
  • H22H_{22}=Second partial w.r.t. y
  • H12H_{12}=Mixed partial

Classification rules for f(x,y)f(x,y):

H11H_{11}det⁥(H)\det(H)Classification
>0> 0>0> 0Local minimum
<0< 0>0> 0Local maximum
—<0< 0Saddle point
—=0= 0Inconclusive

Example: f(x,y)=x2+y2−2x−4y+5f(x,y) = x^2 + y^2 - 2x - 4y + 5

∇f=(2x−2,  2y−4)=(0,0)  ⟹  (x,y)=(1,2)\nabla f = (2x - 2, \; 2y - 4) = (0,0) \implies (x,y) = (1,2)

H=[2002]H = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}, eigenvalues {2,2}>0  ⟹  \{2, 2\} > 0 \implies local minimum


Double Integrals

Double Integral over a Region

iintDf(x,y),dA=intabintg1(x)g2(x)f(x,y),dy,dx\\iint_D f(x,y) \\, dA = \\int_a^b \\int_{g_1(x)}^{g_2(x)} f(x,y) \\, dy \\, dx

Here,

  • DD=The region of integration in the xy-plane
  • f(x,y)f(x,y)=The integrand function
  • g1(x),g2(x)g_1(x), g_2(x)=Lower and upper bounds for y

â„šī¸ Geometric Interpretation

  • If f(x,y)â‰Ĩ0f(x,y) \geq 0, the double integral âˆŦDf(x,y) dA\iint_D f(x,y) \, dA represents the volume under the surface z=f(x,y)z = f(x,y) and above the region DD.
  • If f(x,y)=1f(x,y) = 1, the integral âˆŦD1 dA=Area(D)\iint_D 1 \, dA = \text{Area}(D) gives the area of region DD.
  • Probability density: âˆŦDp(x,y) dA\iint_D p(x,y) \, dA gives the probability that (X,Y)∈D(X,Y) \in D.

Triple Integral

iiintVf(x,y,z),dV=intabintg1(x)g2(x)inth1(x,y)h2(x,y)f(x,y,z),dz,dy,dx\\iiint_V f(x,y,z) \\, dV = \\int_a^b \\int_{g_1(x)}^{g_2(x)} \\int_{h_1(x,y)}^{h_2(x,y)} f(x,y,z) \\, dz \\, dy \\, dx

Here,

  • VV=Volume region in 3D space
  • dVdV=Volume element dx dy dz

â„šī¸ Polar Coordinates

For circular regions, use polar coordinates: x=rcos⁥θx = r\cos\theta, y=rsin⁥θy = r\sin\theta.

âˆŦDf(x,y) dx dy=âˆĢιβâˆĢ0Rf(rcos⁥θ,rsin⁥θ) r dr dθ\iint_D f(x,y) \, dx \, dy = \int_\alpha^\beta \int_0^R f(r\cos\theta, r\sin\theta) \, r \, dr \, d\theta

The factor rr is the Jacobian determinant of the polar coordinate transformation.


Change of Variables

DfJacobian Determinant

When changing variables in a multiple integral, the Jacobian determinant accounts for the distortion of area/volume caused by the coordinate transformation. For a transformation (x,y)=T(u,v)(x, y) = T(u, v), the area element transforms as dx dy=âˆŖJTâˆŖâ€‰du dvdx \, dy = |J_T| \, du \, dv.

Jacobian Determinant

\\det(J_T) = \\det \\begin{bmatrix} \\frac{\\partial x}{\\partial u} & \\frac{\\partial x}{\\partial v} \\\\ \\frac{\\partial y}{\\partial u} & \\frac{\\partial y}{\\partial v} \\end{bmatrix}

Here,

  • JTJ_T=Jacobian matrix of the transformation T
  • âˆŖdet⁥(JT)âˆŖ|\det(J_T)|=Absolute value of Jacobian determinant

Change of Variables Formula (2D)

iintT(D)f(x,y),dx,dy=iintDf(T(u,v)),âˆŖdet(JT)âˆŖ,du,dv\\iint_{T(D)} f(x,y) \\, dx \\, dy = \\iint_D f(T(u,v)) \\, |\\det(J_T)| \\, du \\, dv

Here,

  • TT=Transformation from (u,v) to (x,y)
  • DD=Region in the (u,v) plane
  • T(D)T(D)=Transformed region in the (x,y) plane

Common transformations:

TransformationxxyyâˆŖdet⁥(J)âˆŖ|\det(J)|
Polarrcos⁥θr\cos\thetarsin⁥θr\sin\thetarr
Cylindricalrcos⁥θr\cos\thetarsin⁥θr\sin\thetarr
Spherical΁sinâĄĪ•cos⁥θ\rho\sin\phi\cos\theta΁sinâĄĪ•sin⁥θ\rho\sin\phi\sin\theta΁2sinâĄĪ•\rho^2\sin\phi

ThInverse Function Theorem

If T:Rn→RnT: \mathbb{R}^n \to \mathbb{R}^n is continuously differentiable and det⁡(JT(x⃗0))≠0\det(J_T(\vec{x}_0)) \neq 0, then TT is locally invertible near x⃗0\vec{x}_0 and the Jacobian of T−1T^{-1} is JT−1(T(x⃗0))=JT(x⃗0)−1J_{T^{-1}}(T(\vec{x}_0)) = J_T(\vec{x}_0)^{-1}.


Python Implementation

import numpy as np
import sympy as sp

# ============================================
# 1. Partial Derivatives with SymPy
# ============================================
x, y = sp.symbols('x y')
f = x**2 * y + sp.sin(y)

# Partial derivatives
df_dx = sp.diff(f, x)
df_dy = sp.diff(f, y)
print(f"f = {f}")
print(f"∂f/∂x = {df_dx}")      # 2*x*y
print(f"∂f/∂y = {df_dy}")      # x**2 + cos(y)

# Gradient
grad_f = [df_dx, df_dy]
print(f"∇f = {grad_f}")

# ============================================
# 2. Hessian Matrix with SymPy
# ============================================
f_hess = x**4 + y**4 - 4*x*y + 1
H = sp.hessian(f_hess, (x, y))
print(f"\nHessian of {f_hess}:")
sp.pprint(H)

eigenvals = H.eigenvals()
print(f"Eigenvalues at (1,1): {H.subs({x:1, y:1}).eigenvals()}")

# ============================================
# 3. Gradient Descent Implementation
# ============================================
def gradient_descent(grad_f, x0, lr=0.01, max_iter=1000, tol=1e-8):
    """Simple gradient descent optimization."""
    x = np.array(x0, dtype=float)
    history = [x.copy()]
    
    for i in range(max_iter):
        g = np.array(grad_f(x), dtype=float)
        if np.linalg.norm(g) < tol:
            print(f"Converged in {i} iterations")
            break
        x = x - lr * g
        history.append(x.copy())
    
    return x, np.array(history)

# Minimize f(x,y) = x^2 + y^2 - 2x - 4y + 5
f = lambda v: v[0]**2 + v[1]**2 - 2*v[0] - 4*v[1] + 5
grad = lambda v: np.array([2*v[0]-2, 2*v[1]-4])

x_min, hist = gradient_descent(grad, [5.0, 5.0], lr=0.1)
print(f"\nMinimum at: {x_min}")  # Should be [1, 2]

# ============================================
# 4. Numerical Jacobian
# ============================================
def numerical_jacobian(f, x, h=1e-7):
    """Compute Jacobian matrix numerically using central differences."""
    n = len(x)
    m = len(f(x))
    J = np.zeros((m, n))
    for j in range(n):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[j] += h
        x_minus[j] -= h
        J[:, j] = (f(x_plus) - f(x_minus)) / (2 * h)
    return J

# Example: f(x,y) = [x^2 + y, x*y^2]
f_vec = lambda v: np.array([v[0]**2 + v[1], v[0]*v[1]**2])
x_point = np.array([1.0, 2.0])
J = numerical_jacobian(f_vec, x_point)
print(f"\nJacobian at (1,2):\n{J}")
# Expected: [[2, 1], [4, 4]]

# ============================================
# 5. Numerical Hessian
# ============================================
def numerical_hessian(f, x, h=1e-5):
    """Compute Hessian matrix numerically."""
    n = len(x)
    H = np.zeros((n, n))
    f0 = f(x)
    
    for i in range(n):
        for j in range(n):
            x_pp = x.copy()
            x_pm = x.copy()
            x_mp = x.copy()
            x_mm = x.copy()
            
            x_pp[i] += h; x_pp[j] += h
            x_pm[i] += h; x_pm[j] -= h
            x_mp[i] -= h; x_mp[j] += h
            x_mm[i] -= h; x_mm[j] -= h
            
            H[i, j] = (f(x_pp) - f(x_pm) - f(x_mp) + f(x_mm)) / (4 * h**2)
    
    return H

# Test on f(x,y) = x^4 + y^4 - 4xy + 1
f_test = lambda v: v[0]**4 + v[1]**4 - 4*v[0]*v[1] + 1
H_num = numerical_hessian(f_test, np.array([1.0, 1.0]))
print(f"\nNumerical Hessian at (1,1):\n{H_num}")

# ============================================
# 6. Double Integration with SciPy
# ============================================
from scipy import integrate

def integrand(y, x):
    return np.exp(-(x**2 + y**2))

# Integrate over circle of radius R
result, error = integrate.dblquad(
    integrand,
    -2, 2,           # x bounds
    lambda x: -np.sqrt(4-x**2),  # y lower
    lambda x: np.sqrt(4-x**2)    # y upper
)
print(f"\nDouble integral over circle: {result:.6f}")
print(f"Expected (≈ Ī€(1-e^(-4))): {np.pi*(1-np.exp(-4)):.6f}")

Applications in AI/ML

â„šī¸ Optimization Landscapes

The loss function L(θ)L(\theta) in machine learning is a scalar function of the parameter vector θ∈Rd\theta \in \mathbb{R}^d. The geometry of this function — its gradient, curvature, and critical points — directly determines the behavior of training algorithms.

1. Gradient Descent: The most fundamental optimization algorithm uses θt+1=θt−η∇L(θt)\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t). The gradient ∇L\nabla L provides the direction of steepest descent.

2. Second-Order Methods: Newton's method uses the Hessian to achieve quadratic convergence:

θt+1=θt−H−1(θt)∇L(θt)\theta_{t+1} = \theta_t - H^{-1}(\theta_t) \nabla L(\theta_t)

The Hessian's eigenvalues determine convergence rate: eigenvalues close to zero slow convergence, while large condition numbers indicate ill-conditioning.

3. Natural Gradient: The Fisher information matrix (an expected Hessian) accounts for the geometry of the parameter space, providing better updates than standard gradient descent.

4. Saddle Points in High Dimensions: In high-dimensional loss landscapes, saddle points are exponentially more common than local minima. The Hessian's indefinite eigenvalues reveal saddle points that trap first-order methods.

5. Normalizing Flows: Use the change of variables formula with the Jacobian determinant to transform simple distributions into complex ones:

log⁥pX(x)=log⁥pZ(f−1(x))+logâĄâˆŖdet⁥(Jf−1(x))âˆŖ\log p_X(x) = \log p_Z(f^{-1}(x)) + \log |\det(J_{f^{-1}}(x))|

6. Backpropagation: Each layer's Jacobian matrix propagates gradients through the chain rule. Vanishing/exploding gradients correspond to Jacobian eigenvalues less than or greater than 1.


Common Mistakes

📋Common Mistakes in Multivariable Calculus

MistakeCorrect Approach
Forgetting chain rule: ddxf(g(x))=f′(g(x))\frac{d}{dx}f(g(x)) = f'(g(x))ddxf(g(x))=f′(g(x))⋅g′(x)\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)
Ignoring mixed partialsVerify fxy=fyxf_{xy} = f_{yx} for continuous functions
Wrong integration orderCheck bounds: âˆĢabâˆĢg1(x)g2(x)dy dx\int_a^b \int_{g_1(x)}^{g_2(x)} dy \, dx vs âˆĢcdâˆĢh1(y)h2(y)dx dy\int_c^d \int_{h_1(y)}^{h_2(y)} dx \, dy
Missing Jacobian determinantAlways include âˆŖdet⁥(J)âˆŖ|\det(J)| when changing variables
Confusing ∇f\nabla f with dfdf∇f\nabla f is a vector; dfdf is a differential (covector)
Not normalizing direction vectoru⃗\vec{u} must satisfy âˆĨu⃗âˆĨ=1\|\vec{u}\| = 1 for directional derivative
Assuming Hessian is positive definiteCheck eigenvalues, not just diagonal entries
Forgetting absolute value in JacobianUse âˆŖdet⁥(J)âˆŖ|\det(J)| for area/volume element

Interview Questions

📝Question 1: Gradient Properties

Q: Why is the gradient perpendicular to level curves?

💡Answer

A level curve is defined by f(x⃗)=cf(\vec{x}) = c. Taking the total differential: ∇f⋅dx⃗=0\nabla f \cdot d\vec{x} = 0. Since dx⃗d\vec{x} is tangent to the level curve, ∇f\nabla f must be perpendicular to it. This is because the directional derivative along the level curve is zero (the function value doesn't change), so ∇f⋅t⃗=0\nabla f \cdot \vec{t} = 0 for any tangent vector t⃗\vec{t}.

📝Question 2: Hessian in Optimization

Q: Why can't we always use Newton's method with the Hessian?

💡Answer

Three main reasons: (1) Computing and storing the n×nn \times n Hessian requires O(n2)O(n^2) memory, infeasible for millions of parameters. (2) Inverting the Hessian costs O(n3)O(n^3). (3) At saddle points, the Hessian is not positive definite, causing Newton's method to move toward the saddle instead of a minimum. Quasi-Newton methods (L-BFGS) approximate the Hessian to address these issues.

📝Question 3: Jacobian in Neural Networks

Q: How does the Jacobian relate to backpropagation?

💡Answer

Backpropagation applies the chain rule: if layer ll computes a⃗l=΃(Wla⃗l−1+b⃗l)\vec{a}^l = \sigma(W^l \vec{a}^{l-1} + \vec{b}^l), then ∂L∂a⃗l−1=(Wl)Tdiag(Īƒâ€˛)∂L∂a⃗l\frac{\partial L}{\partial \vec{a}^{l-1}} = (W^l)^T \text{diag}(\sigma') \frac{\partial L}{\partial \vec{a}^l}. The Jacobian of layer ll is Jl=diag(Īƒâ€˛)WlJ_l = \text{diag}(\sigma') W^l. If eigenvalues of JlJ_l are all <1< 1, gradients vanish; if >1> 1, they explode.

📝Question 4: Directional Derivative

Q: In what direction does ff increase most rapidly at a point?

💡Answer

The direction of most rapid increase is the gradient direction ∇f/âˆĨ∇fâˆĨ\nabla f / \|\nabla f\|. The rate of increase is âˆĨ∇fâˆĨ\|\nabla f\|. This follows from the Cauchy-Schwarz inequality: Du⃗f=∇f⋅u⃗≤âˆĨ∇fâˆĨ⋅âˆĨu⃗âˆĨ=âˆĨ∇fâˆĨD_{\vec{u}} f = \nabla f \cdot \vec{u} \leq \|\nabla f\| \cdot \|\vec{u}\| = \|\nabla f\|, with equality when u⃗=∇f/âˆĨ∇fâˆĨ\vec{u} = \nabla f / \|\nabla f\|.

📝Question 5: Saddle Points

Q: Why are saddle points more common than local minima in high dimensions?

💡Answer

A critical point is a saddle point if the Hessian has at least one positive and one negative eigenvalue. In dd dimensions, the probability that all eigenvalues have the same sign decreases exponentially with dd. For a random matrix, the probability of being positive definite is approximately 2−d2^{-d}. Thus, in high dimensions (e.g., d=106d = 10^6), virtually all critical points are saddle points.

📝Question 6: Taylor Series

Q: When does the second-order Taylor approximation underestimate the true function value?

💡Answer

The second-order approximation f(x⃗0+h⃗)≈f(x⃗0)+∇fTh⃗+12h⃗THh⃗f(\vec{x}_0 + \vec{h}) \approx f(\vec{x}_0) + \nabla f^T \vec{h} + \frac{1}{2}\vec{h}^T H \vec{h} underestimates when the remainder R2=16∇3f(Ξ)h3R_2 = \frac{1}{6} \nabla^3 f(\xi) h^3 (in 1D) is positive. For multivariable functions, if ff is convex (Hâǰ0H \succeq 0) and we expand around a minimum (∇f=0\nabla f = 0), the approximation is exact for quadratic functions and underestimates for functions with positive higher-order derivatives.


Practice Problems

📝Problem 1: Gradient and Directional Derivative

Compute the gradient of f(x,y,z)=x2y+yz2f(x,y,z) = x^2 y + yz^2 at the point (1,2,3)(1,2,3) and find the directional derivative in the direction of v⃗=(1,1,1)\vec{v} = (1,1,1).

💡Solution

∇f=(2xy,  x2+z2,  2yz)\nabla f = (2xy, \; x^2 + z^2, \; 2yz)

At (1,2,3)(1,2,3): ∇f=(4,  10,  12)\nabla f = (4, \; 10, \; 12)

Normalize: u⃗=13(1,1,1)\vec{u} = \frac{1}{\sqrt{3}}(1,1,1)

Du⃗f=∇f⋅u⃗=13(4+10+12)=263≈15.01D_{\vec{u}} f = \nabla f \cdot \vec{u} = \frac{1}{\sqrt{3}}(4 + 10 + 12) = \frac{26}{\sqrt{3}} \approx 15.01

📝Problem 2: Hessian Classification

Classify the critical points of f(x,y)=x3−3xy2+y3f(x,y) = x^3 - 3xy^2 + y^3.

💡Solution

∇f=(3x2−3y2,  −6xy+3y2)=(0,0)\nabla f = (3x^2 - 3y^2, \; -6xy + 3y^2) = (0,0)

From first equation: x2=y2  ⟹  x=±yx^2 = y^2 \implies x = \pm y

If x=yx = y: −6y2+3y2=−3y2=0  ⟹  y=0  ⟹  (0,0)-6y^2 + 3y^2 = -3y^2 = 0 \implies y = 0 \implies (0,0)

If x=−yx = -y: 6y2+3y2=9y2=0  ⟹  y=0  ⟹  (0,0)6y^2 + 3y^2 = 9y^2 = 0 \implies y = 0 \implies (0,0)

H=[6x−6y−6y−6x+6y]H = \begin{bmatrix} 6x & -6y \\ -6y & -6x + 6y \end{bmatrix}

At (0,0)(0,0): H=[0000]H = \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix}

The Hessian is the zero matrix — test is inconclusive. Higher-order analysis shows (0,0)(0,0) is a saddle point (the function takes both positive and negative values in every neighborhood of the origin).

📝Problem 3: Change of Variables

Evaluate âˆŦD(x2+y2) dA\iint_D (x^2 + y^2) \, dA where DD is the disk x2+y2≤4x^2 + y^2 \leq 4 using polar coordinates.

💡Solution

Convert to polar: x=rcos⁡θx = r\cos\theta, y=rsin⁡θy = r\sin\theta, x2+y2=r2x^2 + y^2 = r^2, dA=r dr dθdA = r \, dr \, d\theta

âˆĢ02Ī€âˆĢ02r2⋅r dr dθ=âˆĢ02Ī€dθâˆĢ02r3 dr\int_0^{2\pi} \int_0^2 r^2 \cdot r \, dr \, d\theta = \int_0^{2\pi} d\theta \int_0^2 r^3 \, dr=2Ī€â‹…[r44]02=2Ī€â‹…164=8Ī€â‰ˆ25.13= 2\pi \cdot \left[\frac{r^4}{4}\right]_0^2 = 2\pi \cdot \frac{16}{4} = 8\pi \approx 25.13

📝Problem 4: Taylor Expansion

Find the second-order Taylor approximation of f(x,y)=ex+yf(x,y) = e^{x+y} at (0,0)(0,0) and evaluate at (0.1,0.1)(0.1, 0.1).

💡Solution

At (0,0)(0,0): f=1f = 1, ∇f=(1,1)\nabla f = (1,1), H=[1111]H = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}

f(h⃗)≈1+(h1+h2)+12(h12+2h1h2+h22)f(\vec{h}) \approx 1 + (h_1 + h_2) + \frac{1}{2}(h_1^2 + 2h_1 h_2 + h_2^2)

At (0.1,0.1)(0.1, 0.1): f≈1+0.2+12(0.01+0.02+0.01)=1+0.2+0.02=1.22f \approx 1 + 0.2 + \frac{1}{2}(0.01 + 0.02 + 0.01) = 1 + 0.2 + 0.02 = 1.22

Exact: e0.2≈1.2214e^{0.2} \approx 1.2214 (error ≈0.0014\approx 0.0014)

📝Problem 5: Jacobian Determinant

Compute the Jacobian determinant of the transformation x=u2−v2x = u^2 - v^2, y=2uvy = 2uv and use it to evaluate âˆŦR(x+y) dx dy\iint_R (x+y) \, dx \, dy where RR is the region bounded by x=0x = 0, x=1x = 1, y=0y = 0, y=1y = 1.

💡Solution

∂(x,y)∂(u,v)=det⁡[2u−2v2v2u]=4u2+4v2=4(u2+v2)\frac{\partial(x,y)}{\partial(u,v)} = \det \begin{bmatrix} 2u & -2v \\ 2v & 2u \end{bmatrix} = 4u^2 + 4v^2 = 4(u^2+v^2)

The transformation maps to the first quadrant. Setting u2−v2=0  ⟹  u=vu^2 - v^2 = 0 \implies u = v, and u2−v2=1u^2 - v^2 = 1 defines a hyperbola.

âˆŦR(x+y) dx dy=âˆŦD(u2−v2+2uv)⋅4(u2+v2) du dv\iint_R (x+y) \, dx \, dy = \iint_D (u^2 - v^2 + 2uv) \cdot 4(u^2+v^2) \, du \, dv

This integral evaluates to approximately 0.50.5 after computing the appropriate bounds in (u,v)(u,v) space.


Quick Reference

📋Multivariable Calculus Quick Reference

ConceptFormulaKey Property
Gradient∇f=(∂f∂x1,â€Ļ,∂f∂xn)\nabla f = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right)Direction of steepest ascent
Directional DerivativeDu⃗f=∇f⋅u⃗D_{\vec{u}} f = \nabla f \cdot \vec{u}, âˆĨu⃗âˆĨ=1\|\vec{u}\| = 1Rate of change in direction u⃗\vec{u}
HessianHij=∂2f∂xi∂xjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}Symmetric matrix; determines curvature
Taylor (2nd order)f(x⃗0+h⃗)≈f+∇fTh⃗+12h⃗THh⃗f(\vec{x}_0 + \vec{h}) \approx f + \nabla f^T \vec{h} + \frac{1}{2}\vec{h}^T H \vec{h}Quadratic approximation
Critical point∇f(x⃗0)=0⃗\nabla f(\vec{x}_0) = \vec{0}Candidate for min/max/saddle
MinHâ‰ģ0H \succ 0 (all Îģi>0\lambda_i > 0)Convex curvature
MaxHâ‰ē0H \prec 0 (all Îģi<0\lambda_i < 0)Concave curvature
SaddleHH indefinite (mixed-sign eigenvalues)Curves up in some directions, down in others
Double integralâˆŦDf(x,y) dA=âˆĢâˆĢf(x,y) dy dx\iint_D f(x,y) \, dA = \int \int f(x,y) \, dy \, dxVolume under surface
JacobianJij=∂fi∂xjJ_{ij} = \frac{\partial f_i}{\partial x_j}Local linear approximation of f⃗\vec{f}
Jacobian det.det⁥(J)\det(J)Area/volume scaling factor
Polar coordsx=rcos⁡θx = r\cos\theta, y=rsin⁡θy = r\sin\theta, dA=r dr dθdA = r \, dr \, d\thetaUse for circular regions

Cross-References

Lesson Progress32 / 100