Chain Rule and Implicit Differentiation

ℹ️ Why It Matters

The chain rule is arguably the most important differentiation rule in all of machine learning. Every single training step of a neural network — from a simple logistic regression to a billion-parameter large language model — relies on the chain rule to compute gradients. Backpropagation, the algorithm that makes deep learning possible, is nothing more than an efficient application of the chain rule through compositions of functions. When you update a weight in layer 100 based on the loss at the output, you are chaining together 100 local derivatives. Without the chain rule, we cannot compute how a change in any parameter affects the final output, which means we cannot train models. This single rule connects the abstract calculus of composite functions to the practical engineering of gradient-based optimization. Mastering the chain rule means understanding the engine that drives all of modern AI.

What is the Chain Rule

DfChain Rule (Single Variable)

If $y = f(u)$ and $u = g(x)$ , so that $y = f(g(x))$ is a composite function, then the derivative of $y$ with respect to $x$ is the product of the derivative of the outer function (evaluated at the inner function) and the derivative of the inner function:

Chain Rule (Single Variable)

\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = f'(g(x)) \cdot g'(x)

Here,

$f(g(x))$ =The composite function — outer function f applied to inner function g
$f'(g(x))$ =Derivative of the outer function, evaluated at g(x)
$g'(x)$ =Derivative of the inner function

💡 Intuition

Think of the chain rule as a "derivative amplifier." If you have a chain of transformations, the total sensitivity of the output to the input is the product of all the local sensitivities along the chain. Each factor tells you how much the intermediate value changes given a small change in the previous value. Multiplying them together gives you the total effect.

Single Variable Chain Rule: Detailed Examples

ThDifferentiation by Parts

For a composition $y = f(g(x))$ , always differentiate from the outside inward, multiplying by the derivative of each inner function at each step.

📝Example 1: Power of a Trigonometric Function

Problem: Find $\frac{d}{dx}[\sin^3(x)]$

Solution:

Outer function: $u^3$ , inner function: $\sin(x)$
$\frac{d}{du}u^3 = 3u^2$ , $\frac{d}{dx}\sin(x) = \cos(x)$
Result: $3\sin^2(x) \cdot \cos(x)$

📝Example 2: Exponential of a Logarithm

Problem: Find $\frac{d}{dx}[e^{\ln(x^2+1)}]$

Solution:

Simplify first: $e^{\ln(x^2+1)} = x^2 + 1$ , so derivative is $2x$ .
Or apply chain rule directly: outer $e^u$ , inner $\ln(x^2+1)$ .
$\frac{d}{du}e^u = e^u$ , $\frac{d}{dx}\ln(x^2+1) = \frac{2x}{x^2+1}$
Result: $e^{\ln(x^2+1)} \cdot \frac{2x}{x^2+1} = (x^2+1) \cdot \frac{2x}{x^2+1} = 2x$ .

📝Example 3: Nested Square Root

Problem: Find $\frac{d}{dx}\sqrt{1 + \sqrt{x}}$

Solution:

Outer: $\sqrt{u}$ , inner: $1 + \sqrt{x}$
$\frac{d}{du}\sqrt{u} = \frac{1}{2\sqrt{u}}$ , $\frac{d}{dx}(1 + \sqrt{x}) = \frac{1}{2\sqrt{x}}$
Result: $\frac{1}{2\sqrt{1 + \sqrt{x}}} \cdot \frac{1}{2\sqrt{x}} = \frac{1}{4\sqrt{x}\sqrt{1 + \sqrt{x}}}$

📝Example 4: Trigonometric Composition

Problem: Find $\frac{d}{dx}[\tan(\cos(e^x))]$

Solution:

Three layers: outer $\tan(u)$ , middle $\cos(v)$ , inner $e^x$
$\frac{d}{du}\tan(u) = \sec^2(u)$ , $\frac{d}{dv}\cos(v) = -\sin(v)$ , $\frac{d}{dx}e^x = e^x$
Result: $\sec^2(\cos(e^x)) \cdot (-\sin(e^x)) \cdot e^x$
Simplified: $-e^x \sin(e^x) \sec^2(\cos(e^x))$

Multivariable Chain Rule

DfChain Rule (Multivariable)

If $z = f(x, y)$ where $x = x(t)$ and $y = y(t)$ are both functions of a single variable $t$ , then $z$ is a function of $t$ through the intermediate variables $x$ and $y$ .

Multivariable Chain Rule (Single Parameter)

\frac{dz}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}

Here,

$z = f(x, y)$ =The dependent variable as a function of x and y
$x(t), y(t)$ =Intermediate variables parameterized by t
$\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}$ =Partial derivatives of f with respect to x and y

⚠️ Sum Over All Paths

When a variable depends on multiple intermediate variables, the chain rule sums contributions through every path from the dependent variable to the independent variable. Each path contributes its own product of partial derivatives.

General Multivariable Chain Rule

\frac{\partial z}{\partial u} = \sum_{i=1}^{n} \frac{\partial z}{\partial x_i} \cdot \frac{\partial x_i}{\partial u}

Here,

$z = f(x_1, x_2, \ldots, x_n)$ =Function of n intermediate variables
$x_i = x_i(u, v, \ldots)$ =Each intermediate variable may depend on multiple independent variables

📝Multivariable Chain Rule Example

Problem: Let $z = x^2y$ where $x = \cos(t)$ and $y = \sin(t)$ . Find $\frac{dz}{dt}$ .

Solution:

$\frac{\partial z}{\partial x} = 2xy$ , $\frac{\partial z}{\partial y} = x^2$
$\frac{dx}{dt} = -\sin(t)$ , $\frac{dy}{dt} = \cos(t)$
$\frac{dz}{dt} = 2xy(-\sin(t)) + x^2\cos(t)$
Substitute: $= 2\cos(t)\sin(t)(-\sin(t)) + \cos^2(t)\cos(t)$
$= -2\cos(t)\sin^2(t) + \cos^3(t)$

📝Two-Parameter Multivariable Chain Rule

Problem: Let $z = f(x, y)$ where $x = u + v$ and $y = uv$ . Find $\frac{\partial z}{\partial u}$ and $\frac{\partial z}{\partial v}$ .

Solution:

$\frac{\partial z}{\partial u} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial u} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial u} = \frac{\partial f}{\partial x}(1) + \frac{\partial f}{\partial y}(v)$
$\frac{\partial z}{\partial v} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial v} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial v} = \frac{\partial f}{\partial x}(1) + \frac{\partial f}{\partial y}(u)$

Chain Rule with Nested Functions

ThChain Rule for Nested Compositions

For a function with $k$ nested layers $y = f_k(f_{k-1}(\cdots f_1(x) \cdots))$ , the derivative is the product of all inner derivatives:

Nested Chain Rule

\frac{dy}{dx} = f_k'(f_{k-1}(\cdots)) \cdot f_{k-1}'(f_{k-2}(\cdots)) \cdots f_2'(f_1(x)) \cdot f_1'(x)

Here,

$f_k$ =The outermost function
$f_1$ =The innermost function
$f_k' \cdot f_{k-1}' \cdots f_1'$ =Product of derivatives from outside inward

📝Three-Layer Nested Function

Problem: Find $\frac{d}{dx}[\sin(\ln(\tan(x)))]$ .

Solution:

Outermost: $\sin(u)$ , derivative $\cos(u)$
Middle: $\ln(v)$ , derivative $\frac{1}{v}$
Innermost: $\tan(x)$ , derivative $\sec^2(x)$
Result: $\cos(\ln(\tan(x))) \cdot \frac{1}{\tan(x)} \cdot \sec^2(x)$
Simplified: $\cos(\ln(\tan(x))) \cdot \frac{\sec^2(x)}{\tan(x)} = \frac{2\cos(\ln(\tan(x)))}{\sin(2x)}$

📝Four-Layer Nested Function

Problem: Find $\frac{d}{dx}[e^{\sqrt{\sin(\cos(x))}}]$ .

Solution:

Layer 4 (outermost): $e^u$ , derivative $e^u$
Layer 3: $\sqrt{v}$ , derivative $\frac{1}{2\sqrt{v}}$
Layer 2: $\sin(w)$ , derivative $\cos(w)$
Layer 1 (innermost): $\cos(x)$ , derivative $-\sin(x)$
Result: $e^{\sqrt{\sin(\cos(x))}} \cdot \frac{1}{2\sqrt{\sin(\cos(x))}} \cdot \cos(\cos(x)) \cdot (-\sin(x))$

Chain Rule for Implicit Functions

DfImplicit Differentiation

When $y$ is defined implicitly by an equation $F(x, y) = 0$ , we differentiate both sides with respect to $x$ treating $y$ as a function of $x$ , then solve for $\frac{dy}{dx}$ .

Implicit Derivative Formula

\frac{dy}{dx} = -\frac{F_x}{F_y} = -\frac{\frac{\partial F}{\partial x}}{\frac{\partial F}{\partial y}}

Here,

$F(x, y) = 0$ =The implicit equation defining y as a function of x
$F_x$ =Partial derivative of F with respect to x
$F_y$ =Partial derivative of F with respect to y

📝Circle Equation

Problem: Find $\frac{dy}{dx}$ for $x^2 + y^2 = 25$ .

Solution:

Let $F(x, y) = x^2 + y^2 - 25 = 0$
$F_x = 2x$ , $F_y = 2y$
$\frac{dy}{dx} = -\frac{2x}{2y} = -\frac{x}{y}$
This matches the geometric intuition: at point $(3, 4)$ , slope is $-\frac{3}{4}$ .

📝Ellipse with Implicit Differentiation

Problem: Find $\frac{dy}{dx}$ for $\frac{x^2}{4} + \frac{y^2}{9} = 1$ .

Solution:

Differentiate both sides: $\frac{2x}{4} + \frac{2y}{9}\frac{dy}{dx} = 0$
Solve: $\frac{dy}{dx} = -\frac{2x/4}{2y/9} = -\frac{9x}{4y}$

📝Higher-Order Implicit Derivatives

Problem: Find $\frac{d^2y}{dx^2}$ for $x^2 + y^2 = 25$ .

Solution:

First derivative: $\frac{dy}{dx} = -\frac{x}{y}$
Differentiate again: $\frac{d^2y}{dx^2} = \frac{d}{dx}\left(-\frac{x}{y}\right) = -\frac{y - x\frac{dy}{dx}}{y^2}$
Substitute $\frac{dy}{dx} = -\frac{x}{y}$ : $\frac{d^2y}{dx^2} = -\frac{y - x(-x/y)}{y^2} = -\frac{y^2 + x^2}{y^3} = -\frac{25}{y^3}$

Chain Rule for Partial Derivatives

DfPartial Derivative Chain Rule

When a function depends on intermediate variables that are themselves functions of multiple independent variables, we use the multivariable chain rule with partial derivatives.

Partial Derivative Chain Rule (Two Intermediate Variables)

\frac{\partial z}{\partial u} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial u} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial u}

Here,

$z = f(x, y)$ =Function of intermediate variables x and y
$x = x(u, v), y = y(u, v)$ =Intermediate variables as functions of independent variables u and v

Full Partial Derivative System

\begin{cases} \frac{\partial z}{\partial u} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial u} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial u} \\[8pt] \frac{\partial z}{\partial v} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial v} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial v} \end{cases}

Here,

$\frac{\partial z}{\partial u}$ =Partial derivative of z with respect to u through all paths
$\frac{\partial z}{\partial v}$ =Partial derivative of z with respect to v through all paths

ThChain Rule as Matrix Multiplication

The chain rule for multivariable functions can be expressed compactly using the Jacobian (Jacobian) matrix. If $\vec{z} = \vec{f}(\vec{x})$ and $\vec{x} = \vec{g}(\vec{u})$ , then:

Jacobian Chain Rule

J_{\vec{z} \circ \vec{g}} = J_{\vec{f}} \cdot J_{\vec{g}}

Here,

$J_{\vec{f}}$ =Jacobian of f with respect to x
$J_{\vec{g}}$ =Jacobian of g with respect to u
$J_{\vec{f}} \cdot J_{\vec{g}}$ =Matrix product gives the Jacobian of the composition

📝Partial Derivative Chain Rule Application

Problem: Let $z = e^{xy}$ where $x = u + v$ and $y = uv$ . Find $\frac{\partial z}{\partial u}$ .

Solution:

$\frac{\partial z}{\partial x} = ye^{xy}$ , $\frac{\partial z}{\partial y} = xe^{xy}$
$\frac{\partial x}{\partial u} = 1$ , $\frac{\partial y}{\partial u} = v$
$\frac{\partial z}{\partial u} = ye^{xy}(1) + xe^{xy}(v) = e^{xy}(y + xv)$
Substitute: $e^{(u+v)(uv)}(uv + (u+v)v) = e^{uv(u+v)} \cdot v(2u + v)$

Backpropagation: Full Derivation

ℹ️ Why Backpropagation Matters

Backpropagation is the algorithm that makes neural networks trainable. It computes the gradient of the loss function with respect to every weight in the network by applying the chain rule layer by layer in reverse order. Without it, training deep networks would be computationally intractable. Understanding backpropagation at the mathematical level is essential for debugging models, designing new architectures, and pushing the boundaries of AI.

ThChain Rule Through a Neural Network

Consider a simple feedforward neural network with one hidden layer. The forward pass computes:

Forward Pass (Single Hidden Layer)

\begin{aligned} z^{(1)} &= W^{(1)}x + b^{(1)} \\ a^{(1)} &= \sigma(z^{(1)}) \\ z^{(2)} &= W^{(2)}a^{(1)} + b^{(2)} \\ \hat{y} &= \sigma(z^{(2)}) \\ L &= \frac{1}{2}\|\hat{y} - y\|^2 \end{aligned}

Here,

$W^{(1)}, W^{(2)}$ =Weight matrices for layers 1 and 2
$b^{(1)}, b^{(2)}$ =Bias vectors for layers 1 and 2
$\sigma$ =Activation function (e.g., sigmoid)
$L$ =Loss function (mean squared error)

Backward Pass (Gradient Computation)

\begin{aligned} \frac{\partial L}{\partial \hat{y}} &= \hat{y} - y \\ \frac{\partial L}{\partial z^{(2)}} &= \frac{\partial L}{\partial \hat{y}} \cdot \sigma'(z^{(2)}) \\ \frac{\partial L}{\partial W^{(2)}} &= \frac{\partial L}{\partial z^{(2)}} \cdot (a^{(1)})^T \\ \frac{\partial L}{\partial z^{(1)}} &= (W^{(2)})^T \frac{\partial L}{\partial z^{(2)}} \cdot \sigma'(z^{(1)}) \\ \frac{\partial L}{\partial W^{(1)}} &= \frac{\partial L}{\partial z^{(1)}} \cdot x^T \end{aligned}

Here,

$\frac{\partial L}{\partial \hat{y}}$ =Gradient of loss with respect to output
$\frac{\partial L}{\partial z^{(2)}}$ =Error signal at the output layer
$\frac{\partial L}{\partial z^{(1)}}$ =Error signal at the hidden layer (propagated backward)

💡 Key Insight

The gradient at each layer is the product of: (1) the gradient from the layer above, (2) the derivative of the activation function, and (3) the weight matrix transpose. This is the chain rule in action — each layer receives an "error signal" from above, modifies it by the local derivative, and passes it further back.

📝Concrete Backpropagation: Two-Layer Network

Setup: $x = 2$ , $W^{(1)} = 0.5$ , $b^{(1)} = 0.1$ , $W^{(2)} = 0.8$ , $b^{(2)} = 0.2$ , $y = 1$ . Use sigmoid activation and MSE loss.

Forward Pass:

$z^{(1)} = 0.5 \cdot 2 + 0.1 = 1.1$
$a^{(1)} = \sigma(1.1) = 0.7503$
$z^{(2)} = 0.8 \cdot 0.7503 + 0.2 = 0.8002$
$\hat{y} = \sigma(0.8002) = 0.6899$
$L = \frac{1}{2}(0.6899 - 1)^2 = 0.0484$

Backward Pass (Chain Rule):

$\frac{\partial L}{\partial \hat{y}} = 0.6899 - 1 = -0.3101$
$\sigma'(z^{(2)}) = 0.6899(1 - 0.6899) = 0.2139$
$\frac{\partial L}{\partial z^{(2)}} = -0.3101 \cdot 0.2139 = -0.0663$
$\frac{\partial L}{\partial W^{(2)}} = -0.0663 \cdot 0.7503 = -0.0497$
$\frac{\partial L}{\partial z^{(1)}} = 0.8 \cdot (-0.0663) \cdot \sigma'(1.1) = 0.8 \cdot (-0.0663) \cdot 0.1876 = -0.00996$
$\frac{\partial L}{\partial W^{(1)}} = -0.00996 \cdot 2 = -0.0199$

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_grad(x):
    s = sigmoid(x)
    return s * (1 - s)

# Forward pass
x = np.array([2.0])
W1 = np.array([[0.5]])
b1 = np.array([0.1])
W2 = np.array([[0.8]])
b2 = np.array([0.2])
y_true = np.array([1.0])

z1 = W1 @ x + b1
a1 = sigmoid(z1)
z2 = W2 @ a1 + b2
a2 = sigmoid(z2)
loss = 0.5 * (a2 - y_true) ** 2

# Backward pass (chain rule)
dL_da2 = a2 - y_true
da2_dz2 = sigmoid_grad(z2)
dL_dz2 = dL_da2 * da2_dz2
dL_dW2 = dL_dz2 @ a1.T
dL_da1 = W2.T @ dL_dz2
da1_dz1 = sigmoid_grad(z1)
dL_dz1 = dL_da1 * da1_dz1
dL_dW1 = dL_dz1 @ x.T

print(f"Loss: {loss[0]:.4f}")
print(f"dL/dW2: {dL_dW2[0][0]:.4f}")
print(f"dL/dW1: {dL_dW1[0][0]:.4f}")

Common Chain Rule Patterns

Composition	Outer	Inner	Derivative
$e^{kx}$	$e^u$	$kx$	$ke^{kx}$
$\sin(ax+b)$	$\sin(u)$	$ax+b$	$a\cos(ax+b)$
$\ln(x^2)$	$\ln(u)$	$x^2$	$\frac{2}{x}$
$\sqrt{1-x^2}$	$\sqrt{u}$	$1-x^2$	$\frac{-x}{\sqrt{1-x^2}}$
$e^{-x^2/2}$	$e^u$	$-x^2/2$	$-xe^{-x^2/2}$
$\sigma(x)$ (sigmoid)	$\frac{1}{1+e^{-u}}$	$-x$	$\sigma(x)(1-\sigma(x))$
$\tanh(x)$	$\frac{e^u-e^{-u}}{e^u+e^{-u}}$	$x$	$1-\tanh^2(x)$
$\text{softmax}(x_i)$	$\frac{e^{x_i}}{\sum e^{x_j}}$	$x_i$	$s_i(1-s_i)$ if $i=j$ , $-s_is_j$ if $i\neq j$
$\text{ReLU}(x)$	$\max(0,u)$	$x$	$1$ if $x>0$ , $0$ otherwise
$\text{LayerNorm}(x)$	$\frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}$	$x$	Complex — see LayerNorm derivation

💡 Pattern Recognition

The key to mastering the chain rule is pattern recognition. When you see a function composed of familiar pieces, identify the outer and inner functions immediately. With practice, you will differentiate composite functions mentally without writing out each step. The most important patterns in ML are: sigmoid $\sigma'(x) = \sigma(x)(1-\sigma(x))$ , ReLU $\text{ReLU}'(x) = \mathbb{1}[x > 0]$ , and tanh $\tanh'(x) = 1 - \tanh^2(x)$ .

Python Implementation: Autograd Examples

ℹ️ Autograd and the Chain Rule

Modern deep learning frameworks like PyTorch and TensorFlow implement automatic differentiation (autograd), which applies the chain rule numerically through computation graphs. Understanding the manual chain rule helps you debug gradients, write custom backward passes, and reason about numerical stability.

import numpy as np

# ============================================
# Manual Chain Rule Implementation
# ============================================

def chain_rule_example():
    """Differentiate f(x) = sin(x^2) using the chain rule."""
    x = 1.5

    # Outer: sin(u), Inner: u = x^2
    u = x ** 2
    f = np.sin(u)

    # Derivatives
    df_du = np.cos(u)       # derivative of sin
    du_dx = 2 * x            # derivative of x^2
    df_dx = df_du * du_dx   # chain rule: multiply

    print(f"f({x}) = sin({x}^2) = {f:.4f}")
    print(f"f'({x}) = {df_dx:.4f}")

chain_rule_example()

# ============================================
# Numerical Gradient Verification
# ============================================

def numerical_gradient(f, x, h=1e-7):
    """Central difference approximation."""
    return (f(x + h) - f(x - h)) / (2 * h)

def analytical_chain_rule(x):
    """Derivative of sin(x^2) using chain rule."""
    return np.cos(x ** 2) * 2 * x

x_test = 1.5
numerical = numerical_gradient(lambda x: np.sin(x**2), x_test)
analytical = analytical_chain_rule(x_test)
print(f"Numerical:  {numerical:.6f}")
print(f"Analytical: {analytical:.6f}")

# ============================================
# Deep Learning: Manual Backward Pass
# ============================================

def manual_backprop():
    """Full backward pass for a 3-layer network."""
    np.random.seed(42)

    # Forward
    x = np.random.randn(4, 1)
    W1 = np.random.randn(8, 4) * 0.5
    b1 = np.zeros((8, 1))
    W2 = np.random.randn(4, 8) * 0.5
    b2 = np.zeros((4, 1))
    W3 = np.random.randn(1, 4) * 0.5
    b3 = np.zeros((1, 1))

    def sigmoid(z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    # Forward
    z1 = W1 @ x + b1
    a1 = sigmoid(z1)
    z2 = W2 @ a1 + b2
    a2 = sigmoid(z2)
    z3 = W3 @ a2 + b3
    a3 = sigmoid(z3)

    y_true = np.array([[1.0]])
    loss = 0.5 * (a3 - y_true) ** 2

    # Backward (chain rule layer by layer)
    dL_da3 = a3 - y_true
    da3_dz3 = a3 * (1 - a3)
    dL_dz3 = dL_da3 * da3_dz3

    dL_dW3 = dL_dz3 @ a2.T
    dL_db3 = dL_dz3
    dL_da2 = W3.T @ dL_dz3

    da2_dz2 = a2 * (1 - a2)
    dL_dz2 = dL_da2 * da2_dz2
    dL_dW2 = dL_dz2 @ a1.T
    dL_db2 = dL_dz2
    dL_da1 = W2.T @ dL_dz2

    da1_dz1 = a1 * (1 - a1)
    dL_dz1 = dL_da1 * da1_dz1
    dL_dW1 = dL_dz1 @ x.T
    dL_db1 = dL_dz1

    print(f"Loss: {loss[0][0]:.6f}")
    print(f"dL/dW1 shape: {dL_dW1.shape}")
    print(f"dL/dW2 shape: {dL_dW2.shape}")
    print(f"dL/dW3 shape: {dL_dW3.shape}")

manual_backprop()

# ============================================
# PyTorch Autograd (Same Computation)
# ============================================

try:
    import torch

    x_t = torch.tensor([1.5], requires_grad=True)
    f_t = torch.sin(x_t ** 2)
    f_t.backward()
    print(f"PyTorch grad: {x_t.grad.item():.6f}")
except ImportError:
    print("PyTorch not available")

Applications in AI/ML

ℹ️ Chain Rule in Deep Learning

The chain rule is not just a mathematical convenience — it is the computational backbone of all gradient-based learning. Every major breakthrough in deep learning, from AlexNet to GPT-4, was enabled by efficient chain rule computation through ever-deeper networks.

ThGradient Flow in Deep Networks

In a network with $L$ layers, the gradient of the loss with respect to a weight $W^{(l)}$ in layer $l$ is:

Layer-wise Gradient via Chain Rule

\frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial a^{(L)}} \cdot \prod_{k=l}^{L-1} \text{diag}(\sigma'(z^{(k+1)})) \cdot W^{(k+1)} \cdot \frac{\partial a^{(l)}}{\partial W^{(l)}}

Here,

$L$ =Total number of layers
$l$ =The layer whose gradient we are computing
$\prod_{k=l}^{L-1}$ =Product of Jacobians from layer l to the output

⚠️ Vanishing and Exploding Gradients

When the chain of derivatives contains many small factors (e.g., $\sigma'(z) < 0.25$ for sigmoid), the product shrinks exponentially — this is the vanishing gradient problem. Conversely, if factors are large, gradients explode. This is why architecture choices (residual connections, normalization, proper initialization) and activation function choices (ReLU instead of sigmoid) are critical for training deep networks.

Key Applications:

Application	How Chain Rule is Used
Backpropagation	Gradient of loss w.r.t. weights computed via chain rule through layers
Gradient Descent	Parameter update $\theta \leftarrow \theta - \alpha \nabla_\theta L$ requires $\nabla_\theta L$ from chain rule
Attention Mechanisms	Gradients flow through softmax, which requires the chain rule for softmax Jacobian
Normalization Layers	BatchNorm/LayerNorm backward pass uses chain rule through mean, variance, and affine transforms
Loss Functions	Cross-entropy + softmax combine via chain rule into a clean gradient: $\hat{y} - y$
Custom Operations	Writing custom autograd functions requires implementing the chain rule backward pass
Meta-Learning	MAML computes second-order gradients through the chain rule applied twice

Common Mistakes

Mistake	Incorrect	Correct	Why
Forgetting inner derivative	$\frac{d}{dx}\sin(x^2) = \cos(x^2)$	$\cos(x^2) \cdot 2x$	Must multiply by derivative of inner function
Wrong order of multiplication	$g'(x) \cdot f'(g(x))$	$f'(g(x)) \cdot g'(x)$	Order matters for matrix derivatives (dimensions)
Differentiating inner first	Differentiate $g(x)$ , then compose with $f'$	Evaluate $f'$ at $g(x)$ , multiply by $g'(x)$	Outer derivative is evaluated at inner, not differentiated
Missing chain in nested functions	Only one derivative factor	Product of ALL inner derivatives	Each nested layer contributes one factor
Forgetting partial derivatives	Only one path in multivariable case	Sum over ALL paths	Multiple intermediate variables each contribute
Confusing $\frac{d}{dx}$ and $\frac{\partial}{\partial x}$	Using partial when total derivative needed	Use total derivative for single-variable compositions	Partial derivative holds other variables constant
Not applying chain rule to activation	Using raw activation derivative	$\sigma'(z) = \sigma(z)(1-\sigma(z))$	The sigmoid derivative depends on the output

⚠️ The Most Dangerous Mistake

The most common and dangerous mistake is forgetting the inner derivative. In a neural network, if you compute the gradient of the loss with respect to a pre-activation $z$ but forget to multiply by the derivative of the activation function $\sigma'(z)$ , your gradient will be wrong and your model will not train correctly. Always verify that every intermediate variable has its derivative accounted for in the chain.

Interview Questions

📝Question 1: Chain Rule Fundamentals

Q: State the chain rule for $y = f(g(x))$ and explain when you would use it.

A: The chain rule states $\frac{dy}{dx} = f'(g(x)) \cdot g'(x)$ . You use it whenever differentiating a composite function — a function inside another function. In ML, this applies to every layer of a neural network: the loss is a function of the output, which is a function of pre-activations, which are functions of weights. The chain rule lets us decompose this complex derivative into manageable local derivatives.

📝Question 2: Multivariable Chain Rule

Q: How does the chain rule change when $z = f(x, y)$ and both $x$ and $y$ depend on $t$ ?

A: When multiple intermediate variables depend on the same variable, you sum contributions through each path: $\frac{dz}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}$ . Each term represents the partial effect through one intermediate variable. This extends to any number of intermediate variables: $\frac{dz}{dt} = \sum_i \frac{\partial f}{\partial x_i}\frac{dx_i}{dt}$ .

📝Question 3: Backpropagation Derivation

Q: Derive the gradient of the loss with respect to $W^{(1)}$ in a two-layer network.

A: Forward: $z^{(1)} = W^{(1)}x$ , $a^{(1)} = \sigma(z^{(1)})$ , $z^{(2)} = W^{(2)}a^{(1)}$ , $\hat{y} = \sigma(z^{(2)})$ , $L = \frac{1}{2}\|\hat{y}-y\|^2$ . Backward: $\frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial \hat{y}} \cdot \sigma'(z^{(2)}) \cdot W^{(2)} \cdot \sigma'(z^{(1)}) \cdot x^T$ . This is four chain rule multiplications, one per layer and activation.

📝Question 4: Why ReLU Over Sigmoid

Q: Explain why the chain rule makes ReLU preferred over sigmoid in deep networks.

A: For sigmoid, $\sigma'(z) = \sigma(z)(1-\sigma(z)) \leq 0.25$ , so each factor in the chain reduces the gradient by at least 75%. After $n$ layers, the gradient is at most $0.25^n$ , which vanishes exponentially. For ReLU, $\text{ReLU}'(z) = 1$ for $z > 0$ , so the gradient passes through unchanged (no multiplication by a small factor). This is why deep networks with ReLU can be trained while deep sigmoid networks suffer from vanishing gradients.

📝Question 5: Implicit Differentiation in ML

Q: When would you use implicit differentiation instead of explicit differentiation in machine learning?

A: Implicit differentiation is used when the relationship between variables is defined by an equation rather than an explicit function. Examples include: (1) computing the gradient of the optimal solution in bilevel optimization (e.g., hyperparameter optimization), (2) deriving the update rule for implicit SGD, (3) computing exact Hessians of the loss, and (4) solving for the fixed point of an iterative algorithm and differentiating through it. The implicit function theorem guarantees the derivative exists under mild conditions.

📝Question 6: Gradient Flow Analysis

Q: A network has 10 layers with sigmoid activations. Estimate how much the gradient is scaled at layer 1 compared to the output.

A: Each sigmoid activation scales the gradient by at most $0.25$ . Over 10 layers, the gradient is scaled by at most $0.25^{10} \approx 9.5 \times 10^{-7}$ . This means the gradient at layer 1 is roughly one million times smaller than at the output — essentially zero. This is the vanishing gradient problem and explains why deep sigmoid networks cannot be trained with vanilla gradient descent.

📝Question 7: Custom Backward Pass

Q: You implement a custom function $f(x) = \text{softplus}(x) = \ln(1 + e^x)$ . Write the backward pass.

A: Forward: $f(x) = \ln(1 + e^x)$ . Backward: using the chain rule, $\frac{df}{dx} = \frac{1}{1+e^x} \cdot e^x = \frac{e^x}{1+e^x} = \sigma(x)$ , where $\sigma$ is the sigmoid function. So the softplus gradient is the sigmoid — a beautiful relationship that connects two important ML functions through the chain rule.

Practice Problems

📝Problem 1: Basic Chain Rule

Compute $\frac{d}{dx}[e^{\sin(3x)}]$ .

💡Solution

Outer: $e^u$ , inner: $\sin(3x)$
$\frac{d}{dx}e^{\sin(3x)} = e^{\sin(3x)} \cdot \frac{d}{dx}\sin(3x) = e^{\sin(3x)} \cdot \cos(3x) \cdot 3$
Answer: $3\cos(3x) \cdot e^{\sin(3x)}$

📝Problem 2: Multivariable Chain Rule

Let $w = xy + yz$ where $x = \cos(t)$ , $y = \sin(t)$ , $z = t$ . Find $\frac{dw}{dt}$ .

💡Solution

$\frac{\partial w}{\partial x} = y$ , $\frac{\partial w}{\partial y} = x + z$ , $\frac{\partial w}{\partial z} = y$
$\frac{dx}{dt} = -\sin(t)$ , $\frac{dy}{dt} = \cos(t)$ , $\frac{dz}{dt} = 1$
$\frac{dw}{dt} = y(-\sin(t)) + (x+z)\cos(t) + y(1)$
$= -\sin^2(t) + (\cos(t) + t)\cos(t) + \sin(t)$
$= -\sin^2(t) + \cos^2(t) + t\cos(t) + \sin(t)$
$= \cos(2t) + t\cos(t) + \sin(t)$

📝Problem 3: Implicit Differentiation

Find $\frac{dy}{dx}$ for $e^{xy} = x + y$ .

💡Solution

Differentiate both sides with respect to $x$ (chain rule + product rule on left):
$e^{xy}(y + x\frac{dy}{dx}) = 1 + \frac{dy}{dx}$
Expand: $ye^{xy} + xe^{xy}\frac{dy}{dx} = 1 + \frac{dy}{dx}$
Collect $\frac{dy}{dx}$ terms: $\frac{dy}{dx}(xe^{xy} - 1) = 1 - ye^{xy}$
Answer: $\frac{dy}{dx} = \frac{1 - ye^{xy}}{xe^{xy} - 1}$

📝Problem 4: Backpropagation Gradient

For $L = (a - y)^2$ where $a = \sigma(Wx + b)$ , compute $\frac{\partial L}{\partial W}$ , $\frac{\partial L}{\partial b}$ , and $\frac{\partial L}{\partial x}$ .

💡Solution

$\frac{\partial L}{\partial a} = 2(a - y)$
$\frac{\partial a}{\partial z} = \sigma(z)(1 - \sigma(z)) = a(1-a)$ where $z = Wx + b$
$\frac{\partial L}{\partial z} = 2(a-y) \cdot a(1-a)$
$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \cdot x^T = 2(a-y) \cdot a(1-a) \cdot x^T$
$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial z} = 2(a-y) \cdot a(1-a)$
$\frac{\partial L}{\partial x} = W^T \cdot \frac{\partial L}{\partial z} = W^T \cdot 2(a-y) \cdot a(1-a)$

📝Problem 5: Higher-Order Chain Rule

Find $\frac{d^2}{dx^2}[\sin(x^2)]$ .

💡Solution

First derivative: $\frac{d}{dx}\sin(x^2) = 2x\cos(x^2)$ (chain rule)
Second derivative: $\frac{d}{dx}[2x\cos(x^2)]$ (product rule + chain rule)
$= 2\cos(x^2) + 2x \cdot (-\sin(x^2)) \cdot 2x$
$= 2\cos(x^2) - 4x^2\sin(x^2)$

Quick Reference

Topic	Formula	Key Idea
Single Variable	$\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$	Differentiate outside, multiply by inner derivative
Multivariable	$\frac{dz}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}$	Sum over all paths
General	$\frac{\partial z}{\partial u} = \sum_i \frac{\partial z}{\partial x_i}\frac{\partial x_i}{\partial u}$	Sum contributions from each intermediate variable
Nested (k layers)	$\frac{dy}{dx} = f_k' \cdot f_{k-1}' \cdots f_1'$	Product of all inner derivatives
Implicit	$\frac{dy}{dx} = -\frac{F_x}{F_y}$	Differentiate both sides, solve for $dy/dx$
Jacobian	$J_{\vec{f} \circ \vec{g}} = J_{\vec{f}} \cdot J_{\vec{g}}$	Matrix multiplication of Jacobians
Sigmoid	$\sigma'(x) = \sigma(x)(1-\sigma(x))$	Gradient expressed in terms of output
Tanh	$\tanh'(x) = 1 - \tanh^2(x)$	Gradient expressed in terms of output
ReLU	$\text{ReLU}'(x) = \mathbb{1}[x > 0]$	1 if active, 0 if dead
Backprop	$\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} \cdot (a^{(l-1)})^T$	Error signal times input transpose

Cross-References

Topic	Related Lesson
Derivatives and Differentiation	Calculus Derivatives
Partial Derivatives and Gradients	Calculus Partial
Matrix Calculus and Jacobians	Linear Algebra Matrix Calculus
Multivariable Calculus	Calculus Multivariable
Gradient Descent	Optimization Gradient Descent
Stochastic Gradient Descent	Optimization SGD
Newton's Method	Optimization Newton
Optimization Overview	Calculus Optimization
Lagrange Multipliers	Calculus Lagrange
Information Theory (Cross-Entropy)	Info Theory Cross Entropy
Probability (Bayes' Theorem)	Probability Bayes
Differential Equations	Calculus Differential Equations