← Math|27 of 100
Calculus

Partial Derivatives and Gradients

Master partial derivatives, gradients, directional derivatives, tangent planes, and their applications in optimization and machine learning.

📂 Multivariable📖 Lesson 27 of 100🎓 Free Course

Advertisement

Partial Derivatives and Gradients

â„šī¸ Why It Matters

Machine learning models depend on multivariable functions — a neural network loss is a function of thousands (or billions) of parameters simultaneously. Partial derivatives let us measure how the loss changes when we tweak one parameter while holding all others fixed. The gradient vector, which stacks all partial derivatives, points in the direction of steepest ascent. Gradient descent — the algorithm that powers virtually all model training — follows the negative gradient to minimize the loss. Without a deep understanding of partial derivatives and gradients, you cannot understand backpropagation, optimization, or any modern ML algorithm.


What is a Partial Derivative

DfPartial Derivative

The partial derivative of a multivariable function f(x1,x2,â€Ļ,xn)f(x_1, x_2, \ldots, x_n) with respect to one variable xix_i is the limit of the difference quotient, holding all other variables constant. Geometrically, it measures the slope of the function along the xix_i-axis while the other coordinates remain fixed.

Partial Derivative (Limit Definition)

∂f∂xi=lim⁥h→0f(x1,â€Ļ,xi+h,â€Ļ,xn)−f(x1,â€Ļ,xi,â€Ļ,xn)h\frac{\partial f}{\partial x_i} = \lim_{h \to 0}\frac{f(x_1,\ldots,x_i+h,\ldots,x_n) - f(x_1,\ldots,x_i,\ldots,x_n)}{h}

Here,

  • ff=The multivariable function
  • xix_i=The variable being differentiated with respect to
  • hh=An infinitesimally small increment in x_i
  • ∂f∂xi\frac{\partial f}{\partial x_i}=The partial derivative of f with respect to x_i

💡 Intuition

Think of standing on a hilly terrain described by z=f(x,y)z = f(x, y). The partial derivative ∂f∂x\frac{\partial f}{\partial x} tells you how steep the hill is if you walk purely in the east-west (xx) direction. The partial derivative ∂f∂y\frac{\partial f}{\partial y} tells you the slope in the north-south (yy) direction. Neither tells you the full picture — that is what the gradient is for.

âš ī¸ Notation

The symbol ∂\partial (curly "d") is read "partial." It signals that ff depends on multiple variables and you are differentiating with respect to only one. Do not confuse ∂f∂x\frac{\partial f}{\partial x} with dfdx\frac{df}{dx} — the latter (total derivative) is used when ff depends on a single variable.


Computing Partial Derivatives

To compute ∂f∂xi\frac{\partial f}{\partial x_i}, treat all variables except xix_i as constants and apply the standard single-variable differentiation rules.

📝Example 1: Polynomial

Problem: Find all partial derivatives of f(x,y)=3x2y+y3−2xf(x, y) = 3x^2y + y^3 - 2x.

Solution:

  • ∂f∂x=6xy−2\frac{\partial f}{\partial x} = 6xy - 2 (treat yy as a constant)
  • ∂f∂y=3x2+3y2\frac{\partial f}{\partial y} = 3x^2 + 3y^2 (treat xx as a constant)

📝Example 2: Trigonometric

Problem: Find ∂f∂x\frac{\partial f}{\partial x} and ∂f∂y\frac{\partial f}{\partial y} for f(x,y)=sin⁡(xy)+ex+yf(x,y) = \sin(xy) + e^{x+y}.

Solution:

  • ∂f∂x=ycos⁥(xy)+ex+y\frac{\partial f}{\partial x} = y\cos(xy) + e^{x+y} (chain rule on sin⁥(xy)\sin(xy): derivative of sin⁥(u)\sin(u) is cos⁥(u)\cos(u), times ∂(xy)∂x=y\frac{\partial(xy)}{\partial x} = y)
  • ∂f∂y=xcos⁥(xy)+ex+y\frac{\partial f}{\partial y} = x\cos(xy) + e^{x+y} (symmetric, with xx playing the role of yy)

📝Example 3: Quotient

Problem: Find ∂f∂x\frac{\partial f}{\partial x} for f(x,y)=x2x2+y2f(x,y) = \frac{x^2}{x^2 + y^2}.

Solution: Using the quotient rule with u=x2u = x^2 and v=x2+y2v = x^2 + y^2:

∂f∂x=2x(x2+y2)−x2⋅2x(x2+y2)2=2x(x2+y2)−2x3(x2+y2)2=2xy2(x2+y2)2\frac{\partial f}{\partial x} = \frac{2x(x^2+y^2) - x^2 \cdot 2x}{(x^2+y^2)^2} = \frac{2x(x^2+y^2) - 2x^3}{(x^2+y^2)^2} = \frac{2xy^2}{(x^2+y^2)^2}

📝Example 4: Exponential

Problem: Find all partial derivatives of f(x,y,z)=x2ye−zf(x,y,z) = x^2 y e^{-z}.

Solution:

  • ∂f∂x=2xye−z\frac{\partial f}{\partial x} = 2xy e^{-z}
  • ∂f∂y=x2e−z\frac{\partial f}{\partial y} = x^2 e^{-z}
  • ∂f∂z=−x2ye−z\frac{\partial f}{\partial z} = -x^2 y e^{-z}

Higher-Order Partial Derivatives

Just as we can take derivatives of derivatives in single-variable calculus, we can take partial derivatives of partial derivatives.

DfSecond-Order Partial Derivatives

The second partial derivatives are obtained by differentiating a first partial derivative with respect to one of the variables. For a function f(x,y)f(x, y), there are four second-order partials:

Second-Order Partial Derivatives

fxx=∂2f∂x2=∂∂x(∂f∂x),fyy=∂2f∂y2=∂∂y(∂f∂y)f_{xx} = \frac{\partial^2 f}{\partial x^2} = \frac{\partial}{\partial x}\left(\frac{\partial f}{\partial x}\right), \quad f_{yy} = \frac{\partial^2 f}{\partial y^2} = \frac{\partial}{\partial y}\left(\frac{\partial f}{\partial y}\right)

Here,

  • fxxf_{xx}=Second partial with respect to x (twice)
  • fyyf_{yy}=Second partial with respect to y (twice)

Mixed Partial Derivatives

fxy=∂2f∂y ∂x=∂∂y(∂f∂x),fyx=∂2f∂x ∂y=∂∂x(∂f∂y)f_{xy} = \frac{\partial^2 f}{\partial y\,\partial x} = \frac{\partial}{\partial y}\left(\frac{\partial f}{\partial x}\right), \quad f_{yx} = \frac{\partial^2 f}{\partial x\,\partial y} = \frac{\partial}{\partial x}\left(\frac{\partial f}{\partial y}\right)

Here,

  • fxyf_{xy}=First differentiate w.r.t. x, then w.r.t. y
  • fyxf_{yx}=First differentiate w.r.t. y, then w.r.t. x

ThClairaut's Theorem (Equality of Mixed Partials)

If ff and its partial derivatives fxf_x, fyf_y, fxyf_{xy}, and fyxf_{yx} are all continuous on a region containing the point (a,b)(a, b), then the mixed partial derivatives are equal:

fxy(a,b)=fyx(a,b)f_{xy}(a,b) = f_{yx}(a,b)

In Leibniz notation: ∂2f∂y ∂x=∂2f∂x ∂y\frac{\partial^2 f}{\partial y\,\partial x} = \frac{\partial^2 f}{\partial x\,\partial y}

This holds for most functions encountered in practice. The theorem fails only for specially constructed pathological functions.

📝Verifying Clairaut's Theorem

Problem: Verify that fxy=fyxf_{xy} = f_{yx} for f(x,y)=x3y2+sin⁥(xy)f(x,y) = x^3 y^2 + \sin(xy).

Solution: First partials:

  • fx=3x2y2+ycos⁥(xy)f_x = 3x^2 y^2 + y\cos(xy)
  • fy=2x3y+xcos⁥(xy)f_y = 2x^3 y + x\cos(xy)

Mixed partials:

  • fxy=∂∂y(3x2y2+ycos⁥(xy))=6x2y+cos⁥(xy)−xysin⁥(xy)f_{xy} = \frac{\partial}{\partial y}(3x^2 y^2 + y\cos(xy)) = 6x^2 y + \cos(xy) - xy\sin(xy)
  • fyx=∂∂x(2x3y+xcos⁥(xy))=6x2y+cos⁥(xy)−xysin⁥(xy)f_{yx} = \frac{\partial}{\partial x}(2x^3 y + x\cos(xy)) = 6x^2 y + \cos(xy) - xy\sin(xy)

Confirmed: fxy=fyxf_{xy} = f_{yx}.


The Gradient

DfGradient

The gradient of a function f:Rn→Rf: \mathbb{R}^n \to \mathbb{R} is the vector of all its partial derivatives. It is denoted ∇f\nabla f (read "nabla f" or "grad f") and is the most important vector in multivariable optimization.

Gradient Vector

∇f=[∂f∂x1∂f∂x2⋮∂f∂xn]\nabla f = \begin{bmatrix}\frac{\partial f}{\partial x_1}\\\frac{\partial f}{\partial x_2}\\\vdots\\\frac{\partial f}{\partial x_n}\end{bmatrix}

Here,

  • ∇f\nabla f=The gradient of f (an n-dimensional column vector)
  • ∂f∂xi\frac{\partial f}{\partial x_i}=Partial derivative of f with respect to x_i

Key Properties of the Gradient:

  • Points in the direction of steepest ascent of ff
  • The magnitude âˆĨ∇fâˆĨ\|\nabla f\| equals the rate of steepest change in that direction
  • The gradient is perpendicular (orthogonal) to level sets (contour lines) of ff
  • At a local maximum or minimum, ∇f=0\nabla f = \mathbf{0} (the zero vector)

📝Compute the Gradient

Problem: Find ∇f\nabla f for f(x,y,z)=x2y+y2z−z3f(x,y,z) = x^2 y + y^2 z - z^3.

Solution:

∇f=[2xyx2+2yzy2−3z2]\nabla f = \begin{bmatrix} 2xy \\ x^2 + 2yz \\ y^2 - 3z^2 \end{bmatrix}

At the point (1,2,1)(1, 2, 1): ∇f(1,2,1)=[451]\nabla f(1,2,1) = \begin{bmatrix} 4 \\ 5 \\ 1 \end{bmatrix}

This means the function increases most steeply in the direction (4,5,1)(4, 5, 1) at that point.


Directional Derivative

The gradient tells you the slope along each coordinate axis, but what if you want the slope in an arbitrary direction?

DfDirectional Derivative

The directional derivative of ff at a point in the direction of a unit vector u⃗\vec{u} measures the rate of change of ff as you move from that point in the direction u⃗\vec{u}.

Directional Derivative

Du⃗f=∇f⋅u⃗=âˆĨ∇fâˆĨâˆĨu⃗âˆĨcos⁥θ=âˆĨ∇fâˆĨcos⁥θD_{\vec{u}}f = \nabla f \cdot \vec{u} = \|\nabla f\| \|\vec{u}\| \cos\theta = \|\nabla f\| \cos\theta

Here,

  • Du⃗fD_{\vec{u}}f=The directional derivative in direction u
  • ∇f\nabla f=The gradient of f
  • u⃗\vec{u}=A unit vector specifying the direction
  • θ\theta=Angle between the gradient and the direction vector

Important Facts:

  • Du⃗fD_{\vec{u}}f is maximized when θ=0\theta = 0 (i.e., u⃗\vec{u} is parallel to ∇f\nabla f). The maximum value is âˆĨ∇fâˆĨ\|\nabla f\|.
  • Du⃗fD_{\vec{u}}f is minimized when θ=Ī€\theta = \pi (i.e., u⃗\vec{u} is antiparallel to ∇f\nabla f). The minimum value is −âˆĨ∇fâˆĨ-\|\nabla f\|.
  • Du⃗f=0D_{\vec{u}}f = 0 when u⃗âŠĨ∇f\vec{u} \perp \nabla f, meaning you are moving along a level curve.

📝Directional Derivative Calculation

Problem: Find the directional derivative of f(x,y)=x2−y2f(x,y) = x^2 - y^2 at (1,2)(1, 2) in the direction v⃗=⟨3,4⟩\vec{v} = \langle 3, 4 \rangle.

Solution: Step 1: Compute the gradient. ∇f=⟨2x,−2y⟩\nabla f = \langle 2x, -2y \rangle, so ∇f(1,2)=⟨2,−4⟩\nabla f(1,2) = \langle 2, -4 \rangle.

Step 2: Normalize the direction vector. âˆĨv⃗âˆĨ=9+16=5\|\vec{v}\| = \sqrt{9 + 16} = 5, so u⃗=⟨3/5,4/5⟩\vec{u} = \langle 3/5, 4/5 \rangle.

Step 3: Compute the dot product.

Du⃗f=∇f⋅u⃗=2⋅35+(−4)⋅45=65−165=−2D_{\vec{u}}f = \nabla f \cdot \vec{u} = 2 \cdot \frac{3}{5} + (-4) \cdot \frac{4}{5} = \frac{6}{5} - \frac{16}{5} = -2

The function decreases at a rate of 2 per unit step in direction (3,4)(3, 4).


Tangent Plane

DfTangent Plane

The tangent plane to the surface z=f(x,y)z = f(x, y) at the point (x0,y0,f(x0,y0))(x_0, y_0, f(x_0, y_0)) is the plane that best approximates ff near that point. It is the multivariable analog of the tangent line.

Tangent Plane Equation

z=f(x0,y0)+fx(x0,y0)(x−x0)+fy(x0,y0)(y−y0)z = f(x_0, y_0) + f_x(x_0, y_0)(x - x_0) + f_y(x_0, y_0)(y - y_0)

Here,

  • (x0,y0)(x_0, y_0)=The point of tangency
  • f(x0,y0)f(x_0, y_0)=Function value at the point
  • fx(x0,y0)f_x(x_0, y_0)=Partial derivative with respect to x at the point
  • fy(x0,y0)f_y(x_0, y_0)=Partial derivative with respect to y at the point

💡 Connection to Linear Approximation

The tangent plane is the first-order Taylor approximation of ff at (x0,y0)(x_0, y_0):

f(x,y)≈f(x0,y0)+fx(x0,y0)(x−x0)+fy(x0,y0)(y−y0)f(x, y) \approx f(x_0, y_0) + f_x(x_0, y_0)(x - x_0) + f_y(x_0, y_0)(y - y_0)

This approximation is accurate near (x0,y0)(x_0, y_0) and becomes exact as (x,y)→(x0,y0)(x, y) \to (x_0, y_0).

📝Tangent Plane Example

Problem: Find the tangent plane to f(x,y)=x2+y2f(x,y) = x^2 + y^2 at (1,2)(1, 2).

Solution: f(1,2)=5f(1,2) = 5, fx(1,2)=2(1)=2f_x(1,2) = 2(1) = 2, fy(1,2)=2(2)=4f_y(1,2) = 2(2) = 4

Tangent plane: z=5+2(x−1)+4(y−2)=2x+4y−1z = 5 + 2(x - 1) + 4(y - 2) = 2x + 4y - 1

Check: At (1,2)(1, 2), z=2+8−1=9z = 2 + 8 - 1 = 9... this does not equal 5. Let me recalculate.

Actually: z=5+2(x−1)+4(y−2)=5+2x−2+4y−8=2x+4y−5z = 5 + 2(x-1) + 4(y-2) = 5 + 2x - 2 + 4y - 8 = 2x + 4y - 5. At (1,2)(1,2): z=2+8−5=5z = 2 + 8 - 5 = 5. Correct.


Total Derivative

DfTotal Derivative

The total derivative of a function accounts for how the function changes with respect to all variables simultaneously, including cases where the variables themselves depend on other parameters.

Total Derivative (Two Variables)

df=∂f∂x dx+∂f∂y dydf = \frac{\partial f}{\partial x}\,dx + \frac{\partial f}{\partial y}\,dy

Here,

  • dfdf=The total differential (infinitesimal change in f)
  • dxdx=Infinitesimal change in x
  • dydy=Infinitesimal change in y

If xx and yy are both functions of a parameter tt, the chain rule for total derivatives gives:

Total Derivative via Chain Rule

dfdt=∂f∂xdxdt+∂f∂ydydt\frac{df}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}

Here,

  • dfdt\frac{df}{dt}=Rate of change of f with respect to t
  • ∂f∂x\frac{\partial f}{\partial x}=Partial derivative of f with respect to x
  • dxdt\frac{dx}{dt}=Rate of change of x with respect to t

💡 Partial vs Total

A partial derivative ∂f∂x\frac{\partial f}{\partial x} holds all other variables constant. The total derivative dfdt\frac{df}{dt} tracks how everything changes simultaneously. The distinction matters when variables are not independent.


Chain Rule for Partial Derivatives

ThMultivariable Chain Rule

If z=f(u,v)z = f(u, v) where u=g(x,y)u = g(x, y) and v=h(x,y)v = h(x, y), then:

∂z∂x=∂f∂u∂u∂x+∂f∂v∂v∂x\frac{\partial z}{\partial x} = \frac{\partial f}{\partial u}\frac{\partial u}{\partial x} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial x}
∂z∂y=∂f∂u∂u∂y+∂f∂v∂v∂y\frac{\partial z}{\partial y} = \frac{\partial f}{\partial u}\frac{\partial u}{\partial y} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial y}

This generalizes the single-variable chain rule to multiple variables by summing over all intermediate paths.

📝Multivariable Chain Rule

Problem: Let z=u2vz = u^2 v where u=x+yu = x + y and v=x−yv = x - y. Find ∂z∂x\frac{\partial z}{\partial x}.

Solution: ∂f∂u=2uv\frac{\partial f}{\partial u} = 2uv, ∂f∂v=u2\frac{\partial f}{\partial v} = u^2, ∂u∂x=1\frac{\partial u}{\partial x} = 1, ∂v∂x=1\frac{\partial v}{\partial x} = 1

∂z∂x=2uv⋅1+u2⋅1=2uv+u2=u(2v+u)\frac{\partial z}{\partial x} = 2uv \cdot 1 + u^2 \cdot 1 = 2uv + u^2 = u(2v + u)

Substituting: =(x+y)(2(x−y)+(x+y))=(x+y)(3x−y)= (x+y)(2(x-y) + (x+y)) = (x+y)(3x - y)

âš ī¸ Chain Rule Gotcha

When applying the multivariable chain rule, you must sum over all intermediate variables. Missing a term is the most common error. If ff depends on u,v,wu, v, w and each depends on xx, you need three terms in the sum, not two.


Python Implementation

Numerical Gradient (Finite Differences)

import numpy as np

def numerical_gradient(f, x, h=1e-7):
    """Compute the gradient of f at x using central differences."""
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[i] += h
        x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad

# Example: f(x, y) = x^2 + y^2, gradient should be [2x, 2y]
f = lambda x: x[0]**2 + x[1]**2
x = np.array([1.0, 2.0])
print(f"Numerical gradient at (1,2): {numerical_gradient(f, x)}")
# Output: [2.0, 4.0]

Symbolic Differentiation with SymPy

from sympy import symbols, diff, sin, exp, pprint

x, y = symbols('x y')

# Define a multivariable function
f = x**2 * y + sin(x * y)

# Compute partial derivatives
fx = diff(f, x)
fy = diff(f, y)

print(f"f(x,y)  = {f}")
print(f"df/dx   = {fx}")
print(f"df/dy   = {fy}")

# Second-order partials
fxx = diff(f, x, 2)
fxy = diff(f, x, y)
print(f"d2f/dx2 = {fxx}")
print(f"d2f/dxdy = {fxy}")

# Gradient at a point
grad_at = {x: 1, y: 2}
print(f"Gradient at (1,2): df/dx={fx.subs(grad_at)}, df/dy={fy.subs(grad_at)}")

Vectorized Gradient with NumPy

import numpy as np

def gradient_field(f, grid_x, grid_y, h=1e-7):
    """Compute gradient on a 2D grid."""
    dz_dx = (f(grid_x + h, grid_y) - f(grid_x - h, grid_y)) / (2 * h)
    dz_dy = (f(grid_x, grid_y + h) - f(grid_x, grid_y - h)) / (2 * h)
    return dz_dx, dz_dy

# Example: f(x,y) = sin(x) * cos(y)
f = lambda x, y: np.sin(x) * np.cos(y)
x = np.linspace(-np.pi, np.pi, 5)
y = np.linspace(-np.pi, np.pi, 5)
X, Y = np.meshgrid(x, y)

gx, gy = gradient_field(f, X, Y)
print("Gradient at (pi/4, pi/4):")
print(f"  df/dx = {np.cos(np.pi/4) * np.cos(np.pi/4):.4f}")
print(f"  df/dy = {-np.sin(np.pi/4) * np.sin(np.pi/4):.4f}")

Directional Derivative in Python

import numpy as np

def directional_derivative(f, point, direction, h=1e-7):
    """Compute directional derivative of f at point in given direction."""
    grad = np.zeros_like(point, dtype=float)
    for i in range(len(point)):
        p_plus = point.copy().astype(float)
        p_minus = point.copy().astype(float)
        p_plus[i] += h
        p_minus[i] -= h
        grad[i] = (f(p_plus) - f(p_minus)) / (2 * h)
    
    unit_dir = direction / np.linalg.norm(direction)
    return np.dot(grad, unit_dir), grad

# f(x,y) = x^2 - y^2 at (1,2) in direction (3,4)
f = lambda p: p[0]**2 - p[1]**2
point = np.array([1.0, 2.0])
direction = np.array([3.0, 4.0])

dd, grad = directional_derivative(f, point, direction)
print(f"Gradient: {grad}")       # [2, -4]
print(f"Directional derivative: {dd}")  # -2.0

Applications in AI/ML

Gradient Descent

The core training loop for neural networks:

Gradient Descent Update

θt+1=θt−α∇θL(θt)\theta_{t+1} = \theta_t - \alpha \nabla_\theta \mathcal{L}(\theta_t)

Here,

  • θt\theta_t=Model parameters at iteration t
  • Îą\alpha=Learning rate (step size)
  • ∇θL\nabla_\theta \mathcal{L}=Gradient of the loss with respect to parameters

Each component ∂L∂θi\frac{\partial \mathcal{L}}{\partial \theta_i} tells the optimizer how to adjust parameter θi\theta_i to reduce the loss. The negative sign means we move in the direction of steepest descent.

Backpropagation

Backpropagation is the chain rule applied to a deep neural network. For a network with layers f1,f2,â€Ļ,fLf_1, f_2, \ldots, f_L:

∂L∂w(l)=∂L∂a(L)⋅∂a(L)∂a(L−1)⋯∂a(l+1)∂a(l)⋅∂a(l)∂w(l)\frac{\partial \mathcal{L}}{\partial w^{(l)}} = \frac{\partial \mathcal{L}}{\partial a^{(L)}} \cdot \frac{\partial a^{(L)}}{\partial a^{(L-1)}} \cdots \frac{\partial a^{(l+1)}}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial w^{(l)}}

Each factor is a Jacobian matrix of partial derivatives. The gradient is computed by chaining these Jacobians from output to input — this is why the chain rule is the backbone of deep learning.

Hessian in Optimization

The matrix of second-order partials (the Hessian) determines the curvature of the loss surface:

Hessian Matrix

Hij=∂2L∂θi∂θjH_{ij} = \frac{\partial^2 \mathcal{L}}{\partial \theta_i \partial \theta_j}

Here,

  • HijH_{ij}=The (i,j)-th entry of the Hessian
  • ∂2L∂θi∂θj\frac{\partial^2 \mathcal{L}}{\partial \theta_i \partial \theta_j}=Second mixed partial derivative
  • If HH is positive definite at a critical point, it is a local minimum
  • Newton's method uses H−1H^{-1} to converge faster than gradient descent
  • In practice, HH is too large to compute for deep networks, so first-order methods (SGD, Adam) are used instead

Common Mistakes

MistakeWhy It Is WrongCorrect Approach
Forgetting to hold variables constantPartial derivatives treat other variables as constants, not as functions of xxWhen computing ∂f∂x\frac{\partial f}{\partial x}, freeze every other variable
Confusing partial and total derivativesdfdx\frac{df}{dx} vs ∂f∂x\frac{\partial f}{\partial x} have different meaningsUse ∂\partial when the function has multiple independent variables
Mixing up fxyf_{xy} vs fyxf_{yx} notationfxyf_{xy} means differentiate w.r.t. xx first, then yy — the opposite of subscript order in some conventionsClarify your convention; under Clairaut's theorem they are equal
Neglecting the chain rule on composite functionsForgetting to multiply by ∂u∂x\frac{\partial u}{\partial x} when ff depends on u(x,y)u(x, y)Always trace the dependency tree: every path from output to input contributes a term
Not normalizing direction vectorsThe directional derivative formula requires âˆĨu⃗âˆĨ=1\|\vec{u}\| = 1Always divide by âˆĨv⃗âˆĨ\|\vec{v}\| before computing ∇f⋅u⃗\nabla f \cdot \vec{u}
Assuming the gradient is zero at all critical points for minimaA critical point (∇f=0\nabla f = 0) can be a saddle pointCheck the Hessian (or use the second derivative test) to distinguish minima from saddle points

Interview Questions

Q1: What does the gradient vector point toward?

💡Answer

The gradient ∇f\nabla f points in the direction of steepest ascent — the direction in which ff increases the most rapidly. Its magnitude âˆĨ∇fâˆĨ\|\nabla f\| is the rate of increase in that direction. The negative gradient −∇f-\nabla f points toward steepest descent, which is why gradient descent uses θ←θ−α∇f\theta \leftarrow \theta - \alpha \nabla f.

Q2: When are mixed partial derivatives not equal?

💡Answer

Clairaut's theorem guarantees fxy=fyxf_{xy} = f_{yx} when fxf_x, fyf_y, fxyf_{xy}, and fyxf_{yx} are all continuous. They can fail to be equal when these partials are discontinuous. A classic counterexample is:

f(x,y)={xy(x2−y2)x2+y2(x,y)≠(0,0)0(x,y)=(0,0)f(x,y) = \begin{cases} \frac{xy(x^2 - y^2)}{x^2 + y^2} & (x,y) \neq (0,0) \\ 0 & (x,y) = (0,0) \end{cases}

Here fxy(0,0)=−1f_{xy}(0,0) = -1 but fyx(0,0)=1f_{yx}(0,0) = 1. The function is continuous but the mixed partials are not continuous at the origin.

Q3: Why is the gradient orthogonal to level curves?

💡Answer

Along a level curve f(x,y)=cf(x,y) = c, the total derivative is zero: ∂f∂xdx+∂f∂ydy=0\frac{\partial f}{\partial x}dx + \frac{\partial f}{\partial y}dy = 0. This means ∇f⋅⟨dx,dy⟩=0\nabla f \cdot \langle dx, dy \rangle = 0 for any direction tangent to the level curve. Since the gradient is orthogonal to every tangent direction, it is perpendicular to the level curve itself. This is why following the gradient moves you directly toward (or away from) higher contour lines.

Q4: How does gradient descent use partial derivatives in neural network training?

💡Answer

Each weight wij(l)w_{ij}^{(l)} in the network is a variable in the loss function L\mathcal{L}. The partial derivative ∂L∂wij(l)\frac{\partial \mathcal{L}}{\partial w_{ij}^{(l)}} measures how the loss changes when that specific weight changes. Backpropagation computes all these partial derivatives efficiently using the chain rule. The update rule wij(l)←wij(l)−α∂L∂wij(l)w_{ij}^{(l)} \leftarrow w_{ij}^{(l)} - \alpha \frac{\partial \mathcal{L}}{\partial w_{ij}^{(l)}} adjusts every weight simultaneously, with each weight moving in the direction that most reduces the loss.

Q5: What is the difference between the directional derivative and the gradient?

💡Answer

The directional derivative Du⃗f=∇f⋅u⃗D_{\vec{u}}f = \nabla f \cdot \vec{u} is a scalar that gives the rate of change of ff in a specific direction u⃗\vec{u}. The gradient ∇f\nabla f is a vector that encodes the directional derivative for every possible direction simultaneously. The maximum directional derivative equals âˆĨ∇fâˆĨ\|\nabla f\| (in the direction of ∇f\nabla f), and the minimum is −âˆĨ∇fâˆĨ-\|\nabla f\| (opposite to ∇f\nabla f).

Q6: Why can't we use the Hessian for large neural networks?

💡Answer

The Hessian is an n×nn \times n matrix where nn is the number of parameters. For a model with 1 million parameters, the Hessian has 101210^{12} entries — far too large to store or invert. Computing it requires O(n2)O(n^2) second derivatives. This is why second-order methods (like Newton's method) are impractical for deep learning, and first-order methods (SGD, Adam) that only need the gradient (O(n)O(n) entries) are used instead. Approximations like L-BFGS or natural gradient methods try to capture curvature information more cheaply.


Practice Problems

Problem 1: Partial Derivatives

📝Find All Partial Derivatives

Problem: Find fxf_x, fyf_y, and fzf_z for f(x,y,z)=x2yz+exy+ln⁥(yz)f(x,y,z) = x^2 y z + e^{xy} + \ln(yz).

💡Solution

fx=2xyz+yexyf_x = 2xyz + ye^{xy}fy=x2z+xexy+1yf_y = x^2 z + xe^{xy} + \frac{1}{y}fz=x2y+1zf_z = x^2 y + \frac{1}{z}

Problem 2: Gradient and Directional Derivative

📝Directional Derivative

Problem: Let f(x,y)=x2+3xyf(x,y) = x^2 + 3xy. Find the directional derivative at (1,−1)(1, -1) in the direction of v⃗=⟨1,1⟩\vec{v} = \langle 1, 1 \rangle.

💡Solution

∇f=⟨2x+3y,3x⟩\nabla f = \langle 2x + 3y, 3x \rangle, so ∇f(1,−1)=⟨−1,3⟩\nabla f(1,-1) = \langle -1, 3 \rangle.

Normalize: u⃗=⟨1/2,1/2⟩\vec{u} = \langle 1/\sqrt{2}, 1/\sqrt{2} \rangle.

Du⃗f=(−1)(1/2)+(3)(1/2)=22=2D_{\vec{u}}f = (-1)(1/\sqrt{2}) + (3)(1/\sqrt{2}) = \frac{2}{\sqrt{2}} = \sqrt{2}.

Problem 3: Second-Order Partials and Clairaut's Theorem

📝Mixed Partials

Problem: Find all second-order partial derivatives of f(x,y)=x3y2+xyex+yf(x,y) = x^3 y^2 + xy e^{x+y} and verify Clairaut's theorem.

💡Solution

First partials:

  • fx=3x2y2+yex+y+xyex+y=3x2y2+yex+y(1+x)f_x = 3x^2 y^2 + ye^{x+y} + xye^{x+y} = 3x^2 y^2 + ye^{x+y}(1 + x)
  • fy=2x3y+xex+y+xyex+y=2x3y+xex+y(1+y)f_y = 2x^3 y + xe^{x+y} + xye^{x+y} = 2x^3 y + xe^{x+y}(1 + y)

Second partials:

  • fxx=6xy2+yex+y(2+x)f_{xx} = 6xy^2 + ye^{x+y}(2 + x)
  • fyy=2x3+xex+y(2+y)f_{yy} = 2x^3 + xe^{x+y}(2 + y)
  • fxy=6x2y+ex+y(1+x)+yex+y(1+x)=6x2y+ex+y(1+x)(1+y)f_{xy} = 6x^2 y + e^{x+y}(1 + x) + ye^{x+y}(1 + x) = 6x^2 y + e^{x+y}(1 + x)(1 + y)
  • fyx=6x2y+ex+y(1+y)+xex+y(1+y)=6x2y+ex+y(1+x)(1+y)f_{yx} = 6x^2 y + e^{x+y}(1 + y) + xe^{x+y}(1 + y) = 6x^2 y + e^{x+y}(1 + x)(1 + y)

Confirmed: fxy=fyxf_{xy} = f_{yx}.

Problem 4: Tangent Plane

📝Find the Tangent Plane

Problem: Find the tangent plane to f(x,y)=x2−y2f(x,y) = x^2 - y^2 at the point (3,4)(3, 4).

💡Solution

f(3,4)=9−16=−7f(3,4) = 9 - 16 = -7fx(3,4)=2(3)=6f_x(3,4) = 2(3) = 6fy(3,4)=−2(4)=−8f_y(3,4) = -2(4) = -8

Tangent plane: z=−7+6(x−3)−8(y−4)=6x−8y−1z = -7 + 6(x - 3) - 8(y - 4) = 6x - 8y - 1

Verification at (3,4)(3,4): z=18−32−1=−15z = 18 - 32 - 1 = -15... let me recheck. z=−7+6(x−3)−8(y−4)z = -7 + 6(x-3) - 8(y-4). At (3,4)(3,4): z=−7+0+0=−7z = -7 + 0 + 0 = -7. Correct.

Problem 5: Multivariable Chain Rule

📝Chain Rule Application

Problem: Let w=f(x,y,z)w = f(x, y, z) where x=s+tx = s + t, y=s−ty = s - t, z=stz = st. Find ∂w∂s\frac{\partial w}{\partial s} and ∂w∂t\frac{\partial w}{\partial t}.

💡Solution

∂w∂s=∂f∂x∂x∂s+∂f∂y∂y∂s+∂f∂z∂z∂s=fx⋅1+fy⋅1+fz⋅t=fx+fy+tfz\frac{\partial w}{\partial s} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial s} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial s} + \frac{\partial f}{\partial z}\frac{\partial z}{\partial s} = f_x \cdot 1 + f_y \cdot 1 + f_z \cdot t = f_x + f_y + t f_z∂w∂t=∂f∂x∂x∂t+∂f∂y∂y∂t+∂f∂z∂z∂t=fx⋅1+fy⋅(−1)+fz⋅s=fx−fy+sfz\frac{\partial w}{\partial t} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial t} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial t} + \frac{\partial f}{\partial z}\frac{\partial z}{\partial t} = f_x \cdot 1 + f_y \cdot (-1) + f_z \cdot s = f_x - f_y + s f_z

Quick Reference

📋Key Takeaways

  • Partial Derivative: ∂f∂xi=lim⁥h→0f(xi+h)−f(xi)h\frac{\partial f}{\partial x_i} = \lim_{h \to 0}\frac{f(x_i+h) - f(x_i)}{h} — differentiate with respect to one variable, holding all others constant.
  • Gradient: ∇f=[∂f∂x1⋯∂f∂xn]T\nabla f = \begin{bmatrix}\frac{\partial f}{\partial x_1} & \cdots & \frac{\partial f}{\partial x_n}\end{bmatrix}^T — the vector of all partial derivatives; points toward steepest ascent.
  • Gradient Properties: ∇fâŠĨ\nabla f \perp level sets; âˆĨ∇fâˆĨ\|\nabla f\| = maximum rate of change; at extrema ∇f=0\nabla f = \mathbf{0}.
  • Directional Derivative: Du⃗f=∇f⋅u⃗=âˆĨ∇fâˆĨcos⁥θD_{\vec{u}}f = \nabla f \cdot \vec{u} = \|\nabla f\|\cos\theta — rate of change in direction u⃗\vec{u}.
  • Tangent Plane: z=f(x0,y0)+fx(x0,y0)(x−x0)+fy(x0,y0)(y−y0)z = f(x_0,y_0) + f_x(x_0,y_0)(x-x_0) + f_y(x_0,y_0)(y-y_0) — best linear approximation.
  • Total Derivative: df=∂f∂xdx+∂f∂ydydf = \frac{\partial f}{\partial x}dx + \frac{\partial f}{\partial y}dy — tracks all changes simultaneously.
  • Multivariable Chain Rule: ∂z∂x=∑i∂f∂ui∂ui∂x\frac{\partial z}{\partial x} = \sum_i \frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial x} — sum over all intermediate paths.
  • Clairaut's Theorem: If mixed partials are continuous, then fxy=fyxf_{xy} = f_{yx}.
  • Hessian: Hij=∂2f∂xi∂xjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} — matrix of second partials; determines curvature at critical points.
  • ML Connection: Gradient descent θ←θ−α∇L\theta \leftarrow \theta - \alpha \nabla\mathcal{L} uses partial derivatives to minimize loss; backpropagation computes these partials via the chain rule.

Cross-References

Lesson Progress27 / 100