Partial Derivatives and Gradients
âšī¸ Why It Matters
Machine learning models depend on multivariable functions â a neural network loss is a function of thousands (or billions) of parameters simultaneously. Partial derivatives let us measure how the loss changes when we tweak one parameter while holding all others fixed. The gradient vector, which stacks all partial derivatives, points in the direction of steepest ascent. Gradient descent â the algorithm that powers virtually all model training â follows the negative gradient to minimize the loss. Without a deep understanding of partial derivatives and gradients, you cannot understand backpropagation, optimization, or any modern ML algorithm.
What is a Partial Derivative
DfPartial Derivative
The partial derivative of a multivariable function with respect to one variable is the limit of the difference quotient, holding all other variables constant. Geometrically, it measures the slope of the function along the -axis while the other coordinates remain fixed.
Partial Derivative (Limit Definition)
Here,
- =The multivariable function
- =The variable being differentiated with respect to
- =An infinitesimally small increment in x_i
- =The partial derivative of f with respect to x_i
đĄ Intuition
Think of standing on a hilly terrain described by . The partial derivative tells you how steep the hill is if you walk purely in the east-west () direction. The partial derivative tells you the slope in the north-south () direction. Neither tells you the full picture â that is what the gradient is for.
â ī¸ Notation
The symbol (curly "d") is read "partial." It signals that depends on multiple variables and you are differentiating with respect to only one. Do not confuse with â the latter (total derivative) is used when depends on a single variable.
Computing Partial Derivatives
To compute , treat all variables except as constants and apply the standard single-variable differentiation rules.
đExample 1: Polynomial
Problem: Find all partial derivatives of .
Solution:
- (treat as a constant)
- (treat as a constant)
đExample 2: Trigonometric
Problem: Find and for .
Solution:
- (chain rule on : derivative of is , times )
- (symmetric, with playing the role of )
đExample 3: Quotient
Problem: Find for .
Solution: Using the quotient rule with and :
đExample 4: Exponential
Problem: Find all partial derivatives of .
Solution:
Higher-Order Partial Derivatives
Just as we can take derivatives of derivatives in single-variable calculus, we can take partial derivatives of partial derivatives.
DfSecond-Order Partial Derivatives
The second partial derivatives are obtained by differentiating a first partial derivative with respect to one of the variables. For a function , there are four second-order partials:
Second-Order Partial Derivatives
Here,
- =Second partial with respect to x (twice)
- =Second partial with respect to y (twice)
Mixed Partial Derivatives
Here,
- =First differentiate w.r.t. x, then w.r.t. y
- =First differentiate w.r.t. y, then w.r.t. x
ThClairaut's Theorem (Equality of Mixed Partials)
If and its partial derivatives , , , and are all continuous on a region containing the point , then the mixed partial derivatives are equal:
In Leibniz notation:
This holds for most functions encountered in practice. The theorem fails only for specially constructed pathological functions.
đVerifying Clairaut's Theorem
Problem: Verify that for .
Solution: First partials:
Mixed partials:
Confirmed: .
The Gradient
DfGradient
The gradient of a function is the vector of all its partial derivatives. It is denoted (read "nabla f" or "grad f") and is the most important vector in multivariable optimization.
Gradient Vector
Here,
- =The gradient of f (an n-dimensional column vector)
- =Partial derivative of f with respect to x_i
Key Properties of the Gradient:
- Points in the direction of steepest ascent of
- The magnitude equals the rate of steepest change in that direction
- The gradient is perpendicular (orthogonal) to level sets (contour lines) of
- At a local maximum or minimum, (the zero vector)
đCompute the Gradient
Problem: Find for .
Solution:
At the point :
This means the function increases most steeply in the direction at that point.
Directional Derivative
The gradient tells you the slope along each coordinate axis, but what if you want the slope in an arbitrary direction?
DfDirectional Derivative
The directional derivative of at a point in the direction of a unit vector measures the rate of change of as you move from that point in the direction .
Directional Derivative
Here,
- =The directional derivative in direction u
- =The gradient of f
- =A unit vector specifying the direction
- =Angle between the gradient and the direction vector
Important Facts:
- is maximized when (i.e., is parallel to ). The maximum value is .
- is minimized when (i.e., is antiparallel to ). The minimum value is .
- when , meaning you are moving along a level curve.
đDirectional Derivative Calculation
Problem: Find the directional derivative of at in the direction .
Solution: Step 1: Compute the gradient. , so .
Step 2: Normalize the direction vector. , so .
Step 3: Compute the dot product.
The function decreases at a rate of 2 per unit step in direction .
Tangent Plane
DfTangent Plane
The tangent plane to the surface at the point is the plane that best approximates near that point. It is the multivariable analog of the tangent line.
Tangent Plane Equation
Here,
- =The point of tangency
- =Function value at the point
- =Partial derivative with respect to x at the point
- =Partial derivative with respect to y at the point
đĄ Connection to Linear Approximation
The tangent plane is the first-order Taylor approximation of at :
This approximation is accurate near and becomes exact as .
đTangent Plane Example
Problem: Find the tangent plane to at .
Solution: , ,
Tangent plane:
Check: At , ... this does not equal 5. Let me recalculate.
Actually: . At : . Correct.
Total Derivative
DfTotal Derivative
The total derivative of a function accounts for how the function changes with respect to all variables simultaneously, including cases where the variables themselves depend on other parameters.
Total Derivative (Two Variables)
Here,
- =The total differential (infinitesimal change in f)
- =Infinitesimal change in x
- =Infinitesimal change in y
If and are both functions of a parameter , the chain rule for total derivatives gives:
Total Derivative via Chain Rule
Here,
- =Rate of change of f with respect to t
- =Partial derivative of f with respect to x
- =Rate of change of x with respect to t
đĄ Partial vs Total
A partial derivative holds all other variables constant. The total derivative tracks how everything changes simultaneously. The distinction matters when variables are not independent.
Chain Rule for Partial Derivatives
ThMultivariable Chain Rule
If where and , then:
This generalizes the single-variable chain rule to multiple variables by summing over all intermediate paths.
đMultivariable Chain Rule
Problem: Let where and . Find .
Solution: , , ,
Substituting:
â ī¸ Chain Rule Gotcha
When applying the multivariable chain rule, you must sum over all intermediate variables. Missing a term is the most common error. If depends on and each depends on , you need three terms in the sum, not two.
Python Implementation
Numerical Gradient (Finite Differences)
import numpy as np
def numerical_gradient(f, x, h=1e-7):
"""Compute the gradient of f at x using central differences."""
grad = np.zeros_like(x)
for i in range(len(x)):
x_plus = x.copy()
x_minus = x.copy()
x_plus[i] += h
x_minus[i] -= h
grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
return grad
# Example: f(x, y) = x^2 + y^2, gradient should be [2x, 2y]
f = lambda x: x[0]**2 + x[1]**2
x = np.array([1.0, 2.0])
print(f"Numerical gradient at (1,2): {numerical_gradient(f, x)}")
# Output: [2.0, 4.0]
Symbolic Differentiation with SymPy
from sympy import symbols, diff, sin, exp, pprint
x, y = symbols('x y')
# Define a multivariable function
f = x**2 * y + sin(x * y)
# Compute partial derivatives
fx = diff(f, x)
fy = diff(f, y)
print(f"f(x,y) = {f}")
print(f"df/dx = {fx}")
print(f"df/dy = {fy}")
# Second-order partials
fxx = diff(f, x, 2)
fxy = diff(f, x, y)
print(f"d2f/dx2 = {fxx}")
print(f"d2f/dxdy = {fxy}")
# Gradient at a point
grad_at = {x: 1, y: 2}
print(f"Gradient at (1,2): df/dx={fx.subs(grad_at)}, df/dy={fy.subs(grad_at)}")
Vectorized Gradient with NumPy
import numpy as np
def gradient_field(f, grid_x, grid_y, h=1e-7):
"""Compute gradient on a 2D grid."""
dz_dx = (f(grid_x + h, grid_y) - f(grid_x - h, grid_y)) / (2 * h)
dz_dy = (f(grid_x, grid_y + h) - f(grid_x, grid_y - h)) / (2 * h)
return dz_dx, dz_dy
# Example: f(x,y) = sin(x) * cos(y)
f = lambda x, y: np.sin(x) * np.cos(y)
x = np.linspace(-np.pi, np.pi, 5)
y = np.linspace(-np.pi, np.pi, 5)
X, Y = np.meshgrid(x, y)
gx, gy = gradient_field(f, X, Y)
print("Gradient at (pi/4, pi/4):")
print(f" df/dx = {np.cos(np.pi/4) * np.cos(np.pi/4):.4f}")
print(f" df/dy = {-np.sin(np.pi/4) * np.sin(np.pi/4):.4f}")
Directional Derivative in Python
import numpy as np
def directional_derivative(f, point, direction, h=1e-7):
"""Compute directional derivative of f at point in given direction."""
grad = np.zeros_like(point, dtype=float)
for i in range(len(point)):
p_plus = point.copy().astype(float)
p_minus = point.copy().astype(float)
p_plus[i] += h
p_minus[i] -= h
grad[i] = (f(p_plus) - f(p_minus)) / (2 * h)
unit_dir = direction / np.linalg.norm(direction)
return np.dot(grad, unit_dir), grad
# f(x,y) = x^2 - y^2 at (1,2) in direction (3,4)
f = lambda p: p[0]**2 - p[1]**2
point = np.array([1.0, 2.0])
direction = np.array([3.0, 4.0])
dd, grad = directional_derivative(f, point, direction)
print(f"Gradient: {grad}") # [2, -4]
print(f"Directional derivative: {dd}") # -2.0
Applications in AI/ML
Gradient Descent
The core training loop for neural networks:
Gradient Descent Update
Here,
- =Model parameters at iteration t
- =Learning rate (step size)
- =Gradient of the loss with respect to parameters
Each component tells the optimizer how to adjust parameter to reduce the loss. The negative sign means we move in the direction of steepest descent.
Backpropagation
Backpropagation is the chain rule applied to a deep neural network. For a network with layers :
Each factor is a Jacobian matrix of partial derivatives. The gradient is computed by chaining these Jacobians from output to input â this is why the chain rule is the backbone of deep learning.
Hessian in Optimization
The matrix of second-order partials (the Hessian) determines the curvature of the loss surface:
Hessian Matrix
Here,
- =The (i,j)-th entry of the Hessian
- =Second mixed partial derivative
- If is positive definite at a critical point, it is a local minimum
- Newton's method uses to converge faster than gradient descent
- In practice, is too large to compute for deep networks, so first-order methods (SGD, Adam) are used instead
Common Mistakes
| Mistake | Why It Is Wrong | Correct Approach |
|---|---|---|
| Forgetting to hold variables constant | Partial derivatives treat other variables as constants, not as functions of | When computing , freeze every other variable |
| Confusing partial and total derivatives | vs have different meanings | Use when the function has multiple independent variables |
| Mixing up vs notation | means differentiate w.r.t. first, then â the opposite of subscript order in some conventions | Clarify your convention; under Clairaut's theorem they are equal |
| Neglecting the chain rule on composite functions | Forgetting to multiply by when depends on | Always trace the dependency tree: every path from output to input contributes a term |
| Not normalizing direction vectors | The directional derivative formula requires | Always divide by before computing |
| Assuming the gradient is zero at all critical points for minima | A critical point () can be a saddle point | Check the Hessian (or use the second derivative test) to distinguish minima from saddle points |
Interview Questions
Q1: What does the gradient vector point toward?
đĄAnswer
The gradient points in the direction of steepest ascent â the direction in which increases the most rapidly. Its magnitude is the rate of increase in that direction. The negative gradient points toward steepest descent, which is why gradient descent uses .
Q2: When are mixed partial derivatives not equal?
đĄAnswer
Clairaut's theorem guarantees when , , , and are all continuous. They can fail to be equal when these partials are discontinuous. A classic counterexample is:
Here but . The function is continuous but the mixed partials are not continuous at the origin.
Q3: Why is the gradient orthogonal to level curves?
đĄAnswer
Along a level curve , the total derivative is zero: . This means for any direction tangent to the level curve. Since the gradient is orthogonal to every tangent direction, it is perpendicular to the level curve itself. This is why following the gradient moves you directly toward (or away from) higher contour lines.
Q4: How does gradient descent use partial derivatives in neural network training?
đĄAnswer
Each weight in the network is a variable in the loss function . The partial derivative measures how the loss changes when that specific weight changes. Backpropagation computes all these partial derivatives efficiently using the chain rule. The update rule adjusts every weight simultaneously, with each weight moving in the direction that most reduces the loss.
Q5: What is the difference between the directional derivative and the gradient?
đĄAnswer
The directional derivative is a scalar that gives the rate of change of in a specific direction . The gradient is a vector that encodes the directional derivative for every possible direction simultaneously. The maximum directional derivative equals (in the direction of ), and the minimum is (opposite to ).
Q6: Why can't we use the Hessian for large neural networks?
đĄAnswer
The Hessian is an matrix where is the number of parameters. For a model with 1 million parameters, the Hessian has entries â far too large to store or invert. Computing it requires second derivatives. This is why second-order methods (like Newton's method) are impractical for deep learning, and first-order methods (SGD, Adam) that only need the gradient ( entries) are used instead. Approximations like L-BFGS or natural gradient methods try to capture curvature information more cheaply.
Practice Problems
Problem 1: Partial Derivatives
đFind All Partial Derivatives
Problem: Find , , and for .
đĄSolution
Problem 2: Gradient and Directional Derivative
đDirectional Derivative
Problem: Let . Find the directional derivative at in the direction of .
đĄSolution
, so .
Normalize: .
.
Problem 3: Second-Order Partials and Clairaut's Theorem
đMixed Partials
Problem: Find all second-order partial derivatives of and verify Clairaut's theorem.
đĄSolution
First partials:
Second partials:
Confirmed: .
Problem 4: Tangent Plane
đFind the Tangent Plane
Problem: Find the tangent plane to at the point .
đĄSolution
Tangent plane:
Verification at : ... let me recheck. . At : . Correct.
Problem 5: Multivariable Chain Rule
đChain Rule Application
Problem: Let where , , . Find and .
đĄSolution
Quick Reference
đKey Takeaways
- Partial Derivative: â differentiate with respect to one variable, holding all others constant.
- Gradient: â the vector of all partial derivatives; points toward steepest ascent.
- Gradient Properties: level sets; = maximum rate of change; at extrema .
- Directional Derivative: â rate of change in direction .
- Tangent Plane: â best linear approximation.
- Total Derivative: â tracks all changes simultaneously.
- Multivariable Chain Rule: â sum over all intermediate paths.
- Clairaut's Theorem: If mixed partials are continuous, then .
- Hessian: â matrix of second partials; determines curvature at critical points.
- ML Connection: Gradient descent uses partial derivatives to minimize loss; backpropagation computes these partials via the chain rule.
Cross-References
- Derivatives: Prerequisite â single-variable differentiation rules â Derivatives and Differentiation
- Chain Rule: Single-variable chain rule and implicit differentiation â Chain Rule and Implicit Differentiation
- Multivariable Calculus: Extrema, Lagrange multipliers, Jacobians â Multivariable Calculus
- Optimization: Gradient descent, convergence, convexity â Optimization Fundamentals
- Gradient Descent: Practical GD variants and learning rate schedules â Gradient Descent
- Matrix Calculus: Derivatives of matrix-valued functions â Matrix Calculus
- Taylor Series: Polynomial approximations using higher-order derivatives â Taylor Series
- Newton's Method: Second-order optimization using the Hessian â Newton's Method