Partial Derivatives and Gradients

ℹ️ Why It Matters

Machine learning models depend on multivariable functions — a neural network loss is a function of thousands (or billions) of parameters simultaneously. Partial derivatives let us measure how the loss changes when we tweak one parameter while holding all others fixed. The gradient vector, which stacks all partial derivatives, points in the direction of steepest ascent. Gradient descent — the algorithm that powers virtually all model training — follows the negative gradient to minimize the loss. Without a deep understanding of partial derivatives and gradients, you cannot understand backpropagation, optimization, or any modern ML algorithm.

What is a Partial Derivative

DfPartial Derivative

The partial derivative of a multivariable function $f(x_1, x_2, \ldots, x_n)$ with respect to one variable $x_i$ is the limit of the difference quotient, holding all other variables constant. Geometrically, it measures the slope of the function along the $x_i$ -axis while the other coordinates remain fixed.

Partial Derivative (Limit Definition)

\frac{\partial f}{\partial x_i} = \lim_{h \to 0}\frac{f(x_1,\ldots,x_i+h,\ldots,x_n) - f(x_1,\ldots,x_i,\ldots,x_n)}{h}

Here,

$f$ =The multivariable function
$x_i$ =The variable being differentiated with respect to
$h$ =An infinitesimally small increment in x_i
$\frac{\partial f}{\partial x_i}$ =The partial derivative of f with respect to x_i

💡 Intuition

Think of standing on a hilly terrain described by $z = f(x, y)$ . The partial derivative $\frac{\partial f}{\partial x}$ tells you how steep the hill is if you walk purely in the east-west ( $x$ ) direction. The partial derivative $\frac{\partial f}{\partial y}$ tells you the slope in the north-south ( $y$ ) direction. Neither tells you the full picture — that is what the gradient is for.

⚠️ Notation

The symbol $\partial$ (curly "d") is read "partial." It signals that $f$ depends on multiple variables and you are differentiating with respect to only one. Do not confuse $\frac{\partial f}{\partial x}$ with $\frac{df}{dx}$ — the latter (total derivative) is used when $f$ depends on a single variable.

Computing Partial Derivatives

To compute $\frac{\partial f}{\partial x_i}$ , treat all variables except $x_i$ as constants and apply the standard single-variable differentiation rules.

📝Example 1: Polynomial

Problem: Find all partial derivatives of $f(x, y) = 3x^2y + y^3 - 2x$ .

Solution:

$\frac{\partial f}{\partial x} = 6xy - 2$ (treat $y$ as a constant)
$\frac{\partial f}{\partial y} = 3x^2 + 3y^2$ (treat $x$ as a constant)

📝Example 2: Trigonometric

Problem: Find $\frac{\partial f}{\partial x}$ and $\frac{\partial f}{\partial y}$ for $f(x,y) = \sin(xy) + e^{x+y}$ .

Solution:

$\frac{\partial f}{\partial x} = y\cos(xy) + e^{x+y}$ (chain rule on $\sin(xy)$ : derivative of $\sin(u)$ is $\cos(u)$ , times $\frac{\partial(xy)}{\partial x} = y$ )
$\frac{\partial f}{\partial y} = x\cos(xy) + e^{x+y}$ (symmetric, with $x$ playing the role of $y$ )

📝Example 3: Quotient

Problem: Find $\frac{\partial f}{\partial x}$ for $f(x,y) = \frac{x^2}{x^2 + y^2}$ .

Solution: Using the quotient rule with $u = x^2$ and $v = x^2 + y^2$ :

\frac{\partial f}{\partial x} = \frac{2x(x^2+y^2) - x^2 \cdot 2x}{(x^2+y^2)^2} = \frac{2x(x^2+y^2) - 2x^3}{(x^2+y^2)^2} = \frac{2xy^2}{(x^2+y^2)^2}

📝Example 4: Exponential

Problem: Find all partial derivatives of $f(x,y,z) = x^2 y e^{-z}$ .

Solution:

$\frac{\partial f}{\partial x} = 2xy e^{-z}$
$\frac{\partial f}{\partial y} = x^2 e^{-z}$
$\frac{\partial f}{\partial z} = -x^2 y e^{-z}$

Higher-Order Partial Derivatives

Just as we can take derivatives of derivatives in single-variable calculus, we can take partial derivatives of partial derivatives.

DfSecond-Order Partial Derivatives

The second partial derivatives are obtained by differentiating a first partial derivative with respect to one of the variables. For a function $f(x, y)$ , there are four second-order partials:

Second-Order Partial Derivatives

f_{xx} = \frac{\partial^2 f}{\partial x^2} = \frac{\partial}{\partial x}\left(\frac{\partial f}{\partial x}\right), \quad f_{yy} = \frac{\partial^2 f}{\partial y^2} = \frac{\partial}{\partial y}\left(\frac{\partial f}{\partial y}\right)

Here,

$f_{xx}$ =Second partial with respect to x (twice)
$f_{yy}$ =Second partial with respect to y (twice)

Mixed Partial Derivatives

f_{xy} = \frac{\partial^2 f}{\partial y\,\partial x} = \frac{\partial}{\partial y}\left(\frac{\partial f}{\partial x}\right), \quad f_{yx} = \frac{\partial^2 f}{\partial x\,\partial y} = \frac{\partial}{\partial x}\left(\frac{\partial f}{\partial y}\right)

Here,

$f_{xy}$ =First differentiate w.r.t. x, then w.r.t. y
$f_{yx}$ =First differentiate w.r.t. y, then w.r.t. x

ThClairaut's Theorem (Equality of Mixed Partials)

If $f$ and its partial derivatives $f_x$ , $f_y$ , $f_{xy}$ , and $f_{yx}$ are all continuous on a region containing the point $(a, b)$ , then the mixed partial derivatives are equal:

f_{xy}(a,b) = f_{yx}(a,b)

In Leibniz notation: $\frac{\partial^2 f}{\partial y\,\partial x} = \frac{\partial^2 f}{\partial x\,\partial y}$

This holds for most functions encountered in practice. The theorem fails only for specially constructed pathological functions.

📝Verifying Clairaut's Theorem

Problem: Verify that $f_{xy} = f_{yx}$ for $f(x,y) = x^3 y^2 + \sin(xy)$ .

Solution: First partials:

$f_x = 3x^2 y^2 + y\cos(xy)$
$f_y = 2x^3 y + x\cos(xy)$

Mixed partials:

$f_{xy} = \frac{\partial}{\partial y}(3x^2 y^2 + y\cos(xy)) = 6x^2 y + \cos(xy) - xy\sin(xy)$
$f_{yx} = \frac{\partial}{\partial x}(2x^3 y + x\cos(xy)) = 6x^2 y + \cos(xy) - xy\sin(xy)$

Confirmed: $f_{xy} = f_{yx}$ .

The Gradient

DfGradient

The gradient of a function $f: \mathbb{R}^n \to \mathbb{R}$ is the vector of all its partial derivatives. It is denoted $\nabla f$ (read "nabla f" or "grad f") and is the most important vector in multivariable optimization.

Gradient Vector

\nabla f = \begin{bmatrix}\frac{\partial f}{\partial x_1}\\\frac{\partial f}{\partial x_2}\\\vdots\\\frac{\partial f}{\partial x_n}\end{bmatrix}

Here,

$\nabla f$ =The gradient of f (an n-dimensional column vector)
$\frac{\partial f}{\partial x_i}$ =Partial derivative of f with respect to x_i

Key Properties of the Gradient:

Points in the direction of steepest ascent of $f$
The magnitude $\|\nabla f\|$ equals the rate of steepest change in that direction
The gradient is perpendicular (orthogonal) to level sets (contour lines) of $f$
At a local maximum or minimum, $\nabla f = \mathbf{0}$ (the zero vector)

📝Compute the Gradient

Problem: Find $\nabla f$ for $f(x,y,z) = x^2 y + y^2 z - z^3$ .

Solution:

\nabla f = \begin{bmatrix} 2xy \\ x^2 + 2yz \\ y^2 - 3z^2 \end{bmatrix}

At the point $(1, 2, 1)$ : $\nabla f(1,2,1) = \begin{bmatrix} 4 \\ 5 \\ 1 \end{bmatrix}$

This means the function increases most steeply in the direction $(4, 5, 1)$ at that point.

Directional Derivative

The gradient tells you the slope along each coordinate axis, but what if you want the slope in an arbitrary direction?

DfDirectional Derivative

The directional derivative of $f$ at a point in the direction of a unit vector $\vec{u}$ measures the rate of change of $f$ as you move from that point in the direction $\vec{u}$ .

Directional Derivative

D_{\vec{u}}f = \nabla f \cdot \vec{u} = \|\nabla f\| \|\vec{u}\| \cos\theta = \|\nabla f\| \cos\theta

Here,

$D_{\vec{u}}f$ =The directional derivative in direction u
$\nabla f$ =The gradient of f
$\vec{u}$ =A unit vector specifying the direction
$\theta$ =Angle between the gradient and the direction vector

Important Facts:

$D_{\vec{u}}f$ is maximized when $\theta = 0$ (i.e., $\vec{u}$ is parallel to $\nabla f$ ). The maximum value is $\|\nabla f\|$ .
$D_{\vec{u}}f$ is minimized when $\theta = \pi$ (i.e., $\vec{u}$ is antiparallel to $\nabla f$ ). The minimum value is $-\|\nabla f\|$ .
$D_{\vec{u}}f = 0$ when $\vec{u} \perp \nabla f$ , meaning you are moving along a level curve.

📝Directional Derivative Calculation

Problem: Find the directional derivative of $f(x,y) = x^2 - y^2$ at $(1, 2)$ in the direction $\vec{v} = \langle 3, 4 \rangle$ .

Solution: Step 1: Compute the gradient. $\nabla f = \langle 2x, -2y \rangle$ , so $\nabla f(1,2) = \langle 2, -4 \rangle$ .

Step 2: Normalize the direction vector. $\|\vec{v}\| = \sqrt{9 + 16} = 5$ , so $\vec{u} = \langle 3/5, 4/5 \rangle$ .

Step 3: Compute the dot product.

D_{\vec{u}}f = \nabla f \cdot \vec{u} = 2 \cdot \frac{3}{5} + (-4) \cdot \frac{4}{5} = \frac{6}{5} - \frac{16}{5} = -2

The function decreases at a rate of 2 per unit step in direction $(3, 4)$ .

Tangent Plane

DfTangent Plane

The tangent plane to the surface $z = f(x, y)$ at the point $(x_0, y_0, f(x_0, y_0))$ is the plane that best approximates $f$ near that point. It is the multivariable analog of the tangent line.

Tangent Plane Equation

z = f(x_0, y_0) + f_x(x_0, y_0)(x - x_0) + f_y(x_0, y_0)(y - y_0)

Here,

$(x_0, y_0)$ =The point of tangency
$f(x_0, y_0)$ =Function value at the point
$f_x(x_0, y_0)$ =Partial derivative with respect to x at the point
$f_y(x_0, y_0)$ =Partial derivative with respect to y at the point

💡 Connection to Linear Approximation

The tangent plane is the first-order Taylor approximation of $f$ at $(x_0, y_0)$ :

f(x, y) \approx f(x_0, y_0) + f_x(x_0, y_0)(x - x_0) + f_y(x_0, y_0)(y - y_0)

This approximation is accurate near $(x_0, y_0)$ and becomes exact as $(x, y) \to (x_0, y_0)$ .

📝Tangent Plane Example

Problem: Find the tangent plane to $f(x,y) = x^2 + y^2$ at $(1, 2)$ .

Solution: $f(1,2) = 5$ , $f_x(1,2) = 2(1) = 2$ , $f_y(1,2) = 2(2) = 4$

Tangent plane: $z = 5 + 2(x - 1) + 4(y - 2) = 2x + 4y - 1$

Check: At $(1, 2)$ , $z = 2 + 8 - 1 = 9$ ... this does not equal 5. Let me recalculate.

Actually: $z = 5 + 2(x-1) + 4(y-2) = 5 + 2x - 2 + 4y - 8 = 2x + 4y - 5$ . At $(1,2)$ : $z = 2 + 8 - 5 = 5$ . Correct.

Total Derivative

DfTotal Derivative

The total derivative of a function accounts for how the function changes with respect to all variables simultaneously, including cases where the variables themselves depend on other parameters.

Total Derivative (Two Variables)

df = \frac{\partial f}{\partial x}\,dx + \frac{\partial f}{\partial y}\,dy

Here,

$df$ =The total differential (infinitesimal change in f)
$dx$ =Infinitesimal change in x
$dy$ =Infinitesimal change in y

If $x$ and $y$ are both functions of a parameter $t$ , the chain rule for total derivatives gives:

Total Derivative via Chain Rule

\frac{df}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}

Here,

$\frac{df}{dt}$ =Rate of change of f with respect to t
$\frac{\partial f}{\partial x}$ =Partial derivative of f with respect to x
$\frac{dx}{dt}$ =Rate of change of x with respect to t

💡 Partial vs Total

A partial derivative $\frac{\partial f}{\partial x}$ holds all other variables constant. The total derivative $\frac{df}{dt}$ tracks how everything changes simultaneously. The distinction matters when variables are not independent.

Chain Rule for Partial Derivatives

ThMultivariable Chain Rule

If $z = f(u, v)$ where $u = g(x, y)$ and $v = h(x, y)$ , then:

\frac{\partial z}{\partial x} = \frac{\partial f}{\partial u}\frac{\partial u}{\partial x} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial x}

\frac{\partial z}{\partial y} = \frac{\partial f}{\partial u}\frac{\partial u}{\partial y} + \frac{\partial f}{\partial v}\frac{\partial v}{\partial y}

This generalizes the single-variable chain rule to multiple variables by summing over all intermediate paths.

📝Multivariable Chain Rule

Problem: Let $z = u^2 v$ where $u = x + y$ and $v = x - y$ . Find $\frac{\partial z}{\partial x}$ .

Solution: $\frac{\partial f}{\partial u} = 2uv$ , $\frac{\partial f}{\partial v} = u^2$ , $\frac{\partial u}{\partial x} = 1$ , $\frac{\partial v}{\partial x} = 1$

\frac{\partial z}{\partial x} = 2uv \cdot 1 + u^2 \cdot 1 = 2uv + u^2 = u(2v + u)

Substituting: $= (x+y)(2(x-y) + (x+y)) = (x+y)(3x - y)$

⚠️ Chain Rule Gotcha

When applying the multivariable chain rule, you must sum over all intermediate variables. Missing a term is the most common error. If $f$ depends on $u, v, w$ and each depends on $x$ , you need three terms in the sum, not two.

Python Implementation

Numerical Gradient (Finite Differences)

import numpy as np

def numerical_gradient(f, x, h=1e-7):
    """Compute the gradient of f at x using central differences."""
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[i] += h
        x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad

# Example: f(x, y) = x^2 + y^2, gradient should be [2x, 2y]
f = lambda x: x[0]**2 + x[1]**2
x = np.array([1.0, 2.0])
print(f"Numerical gradient at (1,2): {numerical_gradient(f, x)}")
# Output: [2.0, 4.0]

Symbolic Differentiation with SymPy

from sympy import symbols, diff, sin, exp, pprint

x, y = symbols('x y')

# Define a multivariable function
f = x**2 * y + sin(x * y)

# Compute partial derivatives
fx = diff(f, x)
fy = diff(f, y)

print(f"f(x,y)  = {f}")
print(f"df/dx   = {fx}")
print(f"df/dy   = {fy}")

# Second-order partials
fxx = diff(f, x, 2)
fxy = diff(f, x, y)
print(f"d2f/dx2 = {fxx}")
print(f"d2f/dxdy = {fxy}")

# Gradient at a point
grad_at = {x: 1, y: 2}
print(f"Gradient at (1,2): df/dx={fx.subs(grad_at)}, df/dy={fy.subs(grad_at)}")

Vectorized Gradient with NumPy

import numpy as np

def gradient_field(f, grid_x, grid_y, h=1e-7):
    """Compute gradient on a 2D grid."""
    dz_dx = (f(grid_x + h, grid_y) - f(grid_x - h, grid_y)) / (2 * h)
    dz_dy = (f(grid_x, grid_y + h) - f(grid_x, grid_y - h)) / (2 * h)
    return dz_dx, dz_dy

# Example: f(x,y) = sin(x) * cos(y)
f = lambda x, y: np.sin(x) * np.cos(y)
x = np.linspace(-np.pi, np.pi, 5)
y = np.linspace(-np.pi, np.pi, 5)
X, Y = np.meshgrid(x, y)

gx, gy = gradient_field(f, X, Y)
print("Gradient at (pi/4, pi/4):")
print(f"  df/dx = {np.cos(np.pi/4) * np.cos(np.pi/4):.4f}")
print(f"  df/dy = {-np.sin(np.pi/4) * np.sin(np.pi/4):.4f}")

Directional Derivative in Python

import numpy as np

def directional_derivative(f, point, direction, h=1e-7):
    """Compute directional derivative of f at point in given direction."""
    grad = np.zeros_like(point, dtype=float)
    for i in range(len(point)):
        p_plus = point.copy().astype(float)
        p_minus = point.copy().astype(float)
        p_plus[i] += h
        p_minus[i] -= h
        grad[i] = (f(p_plus) - f(p_minus)) / (2 * h)
    
    unit_dir = direction / np.linalg.norm(direction)
    return np.dot(grad, unit_dir), grad

# f(x,y) = x^2 - y^2 at (1,2) in direction (3,4)
f = lambda p: p[0]**2 - p[1]**2
point = np.array([1.0, 2.0])
direction = np.array([3.0, 4.0])

dd, grad = directional_derivative(f, point, direction)
print(f"Gradient: {grad}")       # [2, -4]
print(f"Directional derivative: {dd}")  # -2.0

Applications in AI/ML

Gradient Descent

The core training loop for neural networks:

Gradient Descent Update

\theta_{t+1} = \theta_t - \alpha \nabla_\theta \mathcal{L}(\theta_t)

Here,

$\theta_t$ =Model parameters at iteration t
$\alpha$ =Learning rate (step size)
$\nabla_\theta \mathcal{L}$ =Gradient of the loss with respect to parameters

Each component $\frac{\partial \mathcal{L}}{\partial \theta_i}$ tells the optimizer how to adjust parameter $\theta_i$ to reduce the loss. The negative sign means we move in the direction of steepest descent.

Backpropagation

Backpropagation is the chain rule applied to a deep neural network. For a network with layers $f_1, f_2, \ldots, f_L$ :

\frac{\partial \mathcal{L}}{\partial w^{(l)}} = \frac{\partial \mathcal{L}}{\partial a^{(L)}} \cdot \frac{\partial a^{(L)}}{\partial a^{(L-1)}} \cdots \frac{\partial a^{(l+1)}}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial w^{(l)}}

Each factor is a Jacobian matrix of partial derivatives. The gradient is computed by chaining these Jacobians from output to input — this is why the chain rule is the backbone of deep learning.

Hessian in Optimization

The matrix of second-order partials (the Hessian) determines the curvature of the loss surface:

Hessian Matrix

H_{ij} = \frac{\partial^2 \mathcal{L}}{\partial \theta_i \partial \theta_j}

Here,

$H_{ij}$ =The (i,j)-th entry of the Hessian
$\frac{\partial^2 \mathcal{L}}{\partial \theta_i \partial \theta_j}$ =Second mixed partial derivative

If $H$ is positive definite at a critical point, it is a local minimum
Newton's method uses $H^{-1}$ to converge faster than gradient descent
In practice, $H$ is too large to compute for deep networks, so first-order methods (SGD, Adam) are used instead

Common Mistakes

Mistake	Why It Is Wrong	Correct Approach
Forgetting to hold variables constant	Partial derivatives treat other variables as constants, not as functions of $x$	When computing $\frac{\partial f}{\partial x}$ , freeze every other variable
Confusing partial and total derivatives	$\frac{df}{dx}$ vs $\frac{\partial f}{\partial x}$ have different meanings	Use $\partial$ when the function has multiple independent variables
Mixing up $f_{xy}$ vs $f_{yx}$ notation	$f_{xy}$ means differentiate w.r.t. $x$ first, then $y$ — the opposite of subscript order in some conventions	Clarify your convention; under Clairaut's theorem they are equal
Neglecting the chain rule on composite functions	Forgetting to multiply by $\frac{\partial u}{\partial x}$ when $f$ depends on $u(x, y)$	Always trace the dependency tree: every path from output to input contributes a term
Not normalizing direction vectors	The directional derivative formula requires $\\|\vec{u}\\| = 1$	Always divide by $\\|\vec{v}\\|$ before computing $\nabla f \cdot \vec{u}$
Assuming the gradient is zero at all critical points for minima	A critical point ( $\nabla f = 0$ ) can be a saddle point	Check the Hessian (or use the second derivative test) to distinguish minima from saddle points

Interview Questions

Q1: What does the gradient vector point toward?

💡Answer

The gradient $\nabla f$ points in the direction of steepest ascent — the direction in which $f$ increases the most rapidly. Its magnitude $\|\nabla f\|$ is the rate of increase in that direction. The negative gradient $-\nabla f$ points toward steepest descent, which is why gradient descent uses $\theta \leftarrow \theta - \alpha \nabla f$ .

Q2: When are mixed partial derivatives not equal?

💡Answer

Clairaut's theorem guarantees $f_{xy} = f_{yx}$ when $f_x$ , $f_y$ , $f_{xy}$ , and $f_{yx}$ are all continuous. They can fail to be equal when these partials are discontinuous. A classic counterexample is:

f(x,y) = \begin{cases} \frac{xy(x^2 - y^2)}{x^2 + y^2} & (x,y) \neq (0,0) \\ 0 & (x,y) = (0,0) \end{cases}

Here $f_{xy}(0,0) = -1$ but $f_{yx}(0,0) = 1$ . The function is continuous but the mixed partials are not continuous at the origin.

Q3: Why is the gradient orthogonal to level curves?

💡Answer

Along a level curve $f(x,y) = c$ , the total derivative is zero: $\frac{\partial f}{\partial x}dx + \frac{\partial f}{\partial y}dy = 0$ . This means $\nabla f \cdot \langle dx, dy \rangle = 0$ for any direction tangent to the level curve. Since the gradient is orthogonal to every tangent direction, it is perpendicular to the level curve itself. This is why following the gradient moves you directly toward (or away from) higher contour lines.

Q4: How does gradient descent use partial derivatives in neural network training?

💡Answer

Each weight $w_{ij}^{(l)}$ in the network is a variable in the loss function $\mathcal{L}$ . The partial derivative $\frac{\partial \mathcal{L}}{\partial w_{ij}^{(l)}}$ measures how the loss changes when that specific weight changes. Backpropagation computes all these partial derivatives efficiently using the chain rule. The update rule $w_{ij}^{(l)} \leftarrow w_{ij}^{(l)} - \alpha \frac{\partial \mathcal{L}}{\partial w_{ij}^{(l)}}$ adjusts every weight simultaneously, with each weight moving in the direction that most reduces the loss.

Q5: What is the difference between the directional derivative and the gradient?

💡Answer

The directional derivative $D_{\vec{u}}f = \nabla f \cdot \vec{u}$ is a scalar that gives the rate of change of $f$ in a specific direction $\vec{u}$ . The gradient $\nabla f$ is a vector that encodes the directional derivative for every possible direction simultaneously. The maximum directional derivative equals $\|\nabla f\|$ (in the direction of $\nabla f$ ), and the minimum is $-\|\nabla f\|$ (opposite to $\nabla f$ ).

Q6: Why can't we use the Hessian for large neural networks?

💡Answer

The Hessian is an $n \times n$ matrix where $n$ is the number of parameters. For a model with 1 million parameters, the Hessian has $10^{12}$ entries — far too large to store or invert. Computing it requires $O(n^2)$ second derivatives. This is why second-order methods (like Newton's method) are impractical for deep learning, and first-order methods (SGD, Adam) that only need the gradient ( $O(n)$ entries) are used instead. Approximations like L-BFGS or natural gradient methods try to capture curvature information more cheaply.

Practice Problems

Problem 1: Partial Derivatives

📝Find All Partial Derivatives

Problem: Find $f_x$ , $f_y$ , and $f_z$ for $f(x,y,z) = x^2 y z + e^{xy} + \ln(yz)$ .

💡Solution

f_x = 2xyz + ye^{xy}

f_y = x^2 z + xe^{xy} + \frac{1}{y}

f_z = x^2 y + \frac{1}{z}

Problem 2: Gradient and Directional Derivative

📝Directional Derivative

Problem: Let $f(x,y) = x^2 + 3xy$ . Find the directional derivative at $(1, -1)$ in the direction of $\vec{v} = \langle 1, 1 \rangle$ .

💡Solution

$\nabla f = \langle 2x + 3y, 3x \rangle$ , so $\nabla f(1,-1) = \langle -1, 3 \rangle$ .

Normalize: $\vec{u} = \langle 1/\sqrt{2}, 1/\sqrt{2} \rangle$ .

$D_{\vec{u}}f = (-1)(1/\sqrt{2}) + (3)(1/\sqrt{2}) = \frac{2}{\sqrt{2}} = \sqrt{2}$ .

Problem 3: Second-Order Partials and Clairaut's Theorem

📝Mixed Partials

Problem: Find all second-order partial derivatives of $f(x,y) = x^3 y^2 + xy e^{x+y}$ and verify Clairaut's theorem.

💡Solution

First partials:

$f_x = 3x^2 y^2 + ye^{x+y} + xye^{x+y} = 3x^2 y^2 + ye^{x+y}(1 + x)$
$f_y = 2x^3 y + xe^{x+y} + xye^{x+y} = 2x^3 y + xe^{x+y}(1 + y)$

Second partials:

$f_{xx} = 6xy^2 + ye^{x+y}(2 + x)$
$f_{yy} = 2x^3 + xe^{x+y}(2 + y)$
$f_{xy} = 6x^2 y + e^{x+y}(1 + x) + ye^{x+y}(1 + x) = 6x^2 y + e^{x+y}(1 + x)(1 + y)$
$f_{yx} = 6x^2 y + e^{x+y}(1 + y) + xe^{x+y}(1 + y) = 6x^2 y + e^{x+y}(1 + x)(1 + y)$

Confirmed: $f_{xy} = f_{yx}$ .

Problem 4: Tangent Plane

📝Find the Tangent Plane

Problem: Find the tangent plane to $f(x,y) = x^2 - y^2$ at the point $(3, 4)$ .

💡Solution

f(3,4) = 9 - 16 = -7

f_x(3,4) = 2(3) = 6

f_y(3,4) = -2(4) = -8

Tangent plane: $z = -7 + 6(x - 3) - 8(y - 4) = 6x - 8y - 1$

Verification at $(3,4)$ : $z = 18 - 32 - 1 = -15$ ... let me recheck. $z = -7 + 6(x-3) - 8(y-4)$ . At $(3,4)$ : $z = -7 + 0 + 0 = -7$ . Correct.

Problem 5: Multivariable Chain Rule

📝Chain Rule Application

Problem: Let $w = f(x, y, z)$ where $x = s + t$ , $y = s - t$ , $z = st$ . Find $\frac{\partial w}{\partial s}$ and $\frac{\partial w}{\partial t}$ .

💡Solution

\frac{\partial w}{\partial s} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial s} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial s} + \frac{\partial f}{\partial z}\frac{\partial z}{\partial s} = f_x \cdot 1 + f_y \cdot 1 + f_z \cdot t = f_x + f_y + t f_z

\frac{\partial w}{\partial t} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial t} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial t} + \frac{\partial f}{\partial z}\frac{\partial z}{\partial t} = f_x \cdot 1 + f_y \cdot (-1) + f_z \cdot s = f_x - f_y + s f_z

Quick Reference

📋Key Takeaways

Partial Derivative: $\frac{\partial f}{\partial x_i} = \lim_{h \to 0}\frac{f(x_i+h) - f(x_i)}{h}$ — differentiate with respect to one variable, holding all others constant.
Gradient: $\nabla f = \begin{bmatrix}\frac{\partial f}{\partial x_1} & \cdots & \frac{\partial f}{\partial x_n}\end{bmatrix}^T$ — the vector of all partial derivatives; points toward steepest ascent.
Gradient Properties: $\nabla f \perp$ level sets; $\|\nabla f\|$ = maximum rate of change; at extrema $\nabla f = \mathbf{0}$ .
Directional Derivative: $D_{\vec{u}}f = \nabla f \cdot \vec{u} = \|\nabla f\|\cos\theta$ — rate of change in direction $\vec{u}$ .
Tangent Plane: $z = f(x_0,y_0) + f_x(x_0,y_0)(x-x_0) + f_y(x_0,y_0)(y-y_0)$ — best linear approximation.
Total Derivative: $df = \frac{\partial f}{\partial x}dx + \frac{\partial f}{\partial y}dy$ — tracks all changes simultaneously.
Multivariable Chain Rule: $\frac{\partial z}{\partial x} = \sum_i \frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial x}$ — sum over all intermediate paths.
Clairaut's Theorem: If mixed partials are continuous, then $f_{xy} = f_{yx}$ .
Hessian: $H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$ — matrix of second partials; determines curvature at critical points.
ML Connection: Gradient descent $\theta \leftarrow \theta - \alpha \nabla\mathcal{L}$ uses partial derivatives to minimize loss; backpropagation computes these partials via the chain rule.

Cross-References

Derivatives: Prerequisite — single-variable differentiation rules → Derivatives and Differentiation
Chain Rule: Single-variable chain rule and implicit differentiation → Chain Rule and Implicit Differentiation
Multivariable Calculus: Extrema, Lagrange multipliers, Jacobians → Multivariable Calculus
Optimization: Gradient descent, convergence, convexity → Optimization Fundamentals
Gradient Descent: Practical GD variants and learning rate schedules → Gradient Descent
Matrix Calculus: Derivatives of matrix-valued functions → Matrix Calculus
Taylor Series: Polynomial approximations using higher-order derivatives → Taylor Series
Newton's Method: Second-order optimization using the Hessian → Newton's Method