← Math|26 of 100
Calculus

Chain Rule and Implicit Differentiation

Master the chain rule, implicit differentiation, and their critical role in backpropagation for neural networks.

📂 Differentiation📖 Lesson 26 of 100🎓 Free Course

Advertisement

Chain Rule and Implicit Differentiation

â„šī¸ Why It Matters

The chain rule is arguably the most important differentiation rule in all of machine learning. Every single training step of a neural network — from a simple logistic regression to a billion-parameter large language model — relies on the chain rule to compute gradients. Backpropagation, the algorithm that makes deep learning possible, is nothing more than an efficient application of the chain rule through compositions of functions. When you update a weight in layer 100 based on the loss at the output, you are chaining together 100 local derivatives. Without the chain rule, we cannot compute how a change in any parameter affects the final output, which means we cannot train models. This single rule connects the abstract calculus of composite functions to the practical engineering of gradient-based optimization. Mastering the chain rule means understanding the engine that drives all of modern AI.


What is the Chain Rule

DfChain Rule (Single Variable)

If y=f(u)y = f(u) and u=g(x)u = g(x), so that y=f(g(x))y = f(g(x)) is a composite function, then the derivative of yy with respect to xx is the product of the derivative of the outer function (evaluated at the inner function) and the derivative of the inner function:

Chain Rule (Single Variable)

dydx=dydu⋅dudx=f′(g(x))⋅g′(x)\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = f'(g(x)) \cdot g'(x)

Here,

  • f(g(x))f(g(x))=The composite function — outer function f applied to inner function g
  • f′(g(x))f'(g(x))=Derivative of the outer function, evaluated at g(x)
  • g′(x)g'(x)=Derivative of the inner function

💡 Intuition

Think of the chain rule as a "derivative amplifier." If you have a chain of transformations, the total sensitivity of the output to the input is the product of all the local sensitivities along the chain. Each factor tells you how much the intermediate value changes given a small change in the previous value. Multiplying them together gives you the total effect.


Single Variable Chain Rule: Detailed Examples

ThDifferentiation by Parts

For a composition y=f(g(x))y = f(g(x)), always differentiate from the outside inward, multiplying by the derivative of each inner function at each step.

📝Example 1: Power of a Trigonometric Function

Problem: Find ddx[sin⁥3(x)]\frac{d}{dx}[\sin^3(x)]

Solution:

  • Outer function: u3u^3, inner function: sin⁥(x)\sin(x)
  • dduu3=3u2\frac{d}{du}u^3 = 3u^2, ddxsin⁥(x)=cos⁥(x)\frac{d}{dx}\sin(x) = \cos(x)
  • Result: 3sin⁥2(x)⋅cos⁥(x)3\sin^2(x) \cdot \cos(x)

📝Example 2: Exponential of a Logarithm

Problem: Find ddx[eln⁥(x2+1)]\frac{d}{dx}[e^{\ln(x^2+1)}]

Solution:

  • Simplify first: eln⁥(x2+1)=x2+1e^{\ln(x^2+1)} = x^2 + 1, so derivative is 2x2x.
  • Or apply chain rule directly: outer eue^u, inner ln⁥(x2+1)\ln(x^2+1).
  • ddueu=eu\frac{d}{du}e^u = e^u, ddxln⁥(x2+1)=2xx2+1\frac{d}{dx}\ln(x^2+1) = \frac{2x}{x^2+1}
  • Result: eln⁥(x2+1)⋅2xx2+1=(x2+1)⋅2xx2+1=2xe^{\ln(x^2+1)} \cdot \frac{2x}{x^2+1} = (x^2+1) \cdot \frac{2x}{x^2+1} = 2x.

📝Example 3: Nested Square Root

Problem: Find ddx1+x\frac{d}{dx}\sqrt{1 + \sqrt{x}}

Solution:

  • Outer: u\sqrt{u}, inner: 1+x1 + \sqrt{x}
  • dduu=12u\frac{d}{du}\sqrt{u} = \frac{1}{2\sqrt{u}}, ddx(1+x)=12x\frac{d}{dx}(1 + \sqrt{x}) = \frac{1}{2\sqrt{x}}
  • Result: 121+x⋅12x=14x1+x\frac{1}{2\sqrt{1 + \sqrt{x}}} \cdot \frac{1}{2\sqrt{x}} = \frac{1}{4\sqrt{x}\sqrt{1 + \sqrt{x}}}

📝Example 4: Trigonometric Composition

Problem: Find ddx[tan⁥(cos⁥(ex))]\frac{d}{dx}[\tan(\cos(e^x))]

Solution:

  • Three layers: outer tan⁥(u)\tan(u), middle cos⁥(v)\cos(v), inner exe^x
  • ddutan⁥(u)=sec⁥2(u)\frac{d}{du}\tan(u) = \sec^2(u), ddvcos⁥(v)=−sin⁥(v)\frac{d}{dv}\cos(v) = -\sin(v), ddxex=ex\frac{d}{dx}e^x = e^x
  • Result: sec⁥2(cos⁥(ex))⋅(−sin⁥(ex))⋅ex\sec^2(\cos(e^x)) \cdot (-\sin(e^x)) \cdot e^x
  • Simplified: −exsin⁥(ex)sec⁥2(cos⁥(ex))-e^x \sin(e^x) \sec^2(\cos(e^x))

Multivariable Chain Rule

DfChain Rule (Multivariable)

If z=f(x,y)z = f(x, y) where x=x(t)x = x(t) and y=y(t)y = y(t) are both functions of a single variable tt, then zz is a function of tt through the intermediate variables xx and yy.

Multivariable Chain Rule (Single Parameter)

dzdt=∂f∂xdxdt+∂f∂ydydt\frac{dz}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}

Here,

  • z=f(x,y)z = f(x, y)=The dependent variable as a function of x and y
  • x(t),y(t)x(t), y(t)=Intermediate variables parameterized by t
  • ∂f∂x,∂f∂y\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}=Partial derivatives of f with respect to x and y

âš ī¸ Sum Over All Paths

When a variable depends on multiple intermediate variables, the chain rule sums contributions through every path from the dependent variable to the independent variable. Each path contributes its own product of partial derivatives.

General Multivariable Chain Rule

∂z∂u=∑i=1n∂z∂xi⋅∂xi∂u\frac{\partial z}{\partial u} = \sum_{i=1}^{n} \frac{\partial z}{\partial x_i} \cdot \frac{\partial x_i}{\partial u}

Here,

  • z=f(x1,x2,â€Ļ,xn)z = f(x_1, x_2, \ldots, x_n)=Function of n intermediate variables
  • xi=xi(u,v,â€Ļ)x_i = x_i(u, v, \ldots)=Each intermediate variable may depend on multiple independent variables

📝Multivariable Chain Rule Example

Problem: Let z=x2yz = x^2y where x=cos⁥(t)x = \cos(t) and y=sin⁥(t)y = \sin(t). Find dzdt\frac{dz}{dt}.

Solution:

  • ∂z∂x=2xy\frac{\partial z}{\partial x} = 2xy, ∂z∂y=x2\frac{\partial z}{\partial y} = x^2
  • dxdt=−sin⁥(t)\frac{dx}{dt} = -\sin(t), dydt=cos⁥(t)\frac{dy}{dt} = \cos(t)
  • dzdt=2xy(−sin⁥(t))+x2cos⁥(t)\frac{dz}{dt} = 2xy(-\sin(t)) + x^2\cos(t)
  • Substitute: =2cos⁥(t)sin⁥(t)(−sin⁥(t))+cos⁥2(t)cos⁥(t)= 2\cos(t)\sin(t)(-\sin(t)) + \cos^2(t)\cos(t)
  • =−2cos⁥(t)sin⁥2(t)+cos⁥3(t)= -2\cos(t)\sin^2(t) + \cos^3(t)

📝Two-Parameter Multivariable Chain Rule

Problem: Let z=f(x,y)z = f(x, y) where x=u+vx = u + v and y=uvy = uv. Find ∂z∂u\frac{\partial z}{\partial u} and ∂z∂v\frac{\partial z}{\partial v}.

Solution:

  • ∂z∂u=∂f∂x∂x∂u+∂f∂y∂y∂u=∂f∂x(1)+∂f∂y(v)\frac{\partial z}{\partial u} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial u} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial u} = \frac{\partial f}{\partial x}(1) + \frac{\partial f}{\partial y}(v)
  • ∂z∂v=∂f∂x∂x∂v+∂f∂y∂y∂v=∂f∂x(1)+∂f∂y(u)\frac{\partial z}{\partial v} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial v} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial v} = \frac{\partial f}{\partial x}(1) + \frac{\partial f}{\partial y}(u)

Chain Rule with Nested Functions

ThChain Rule for Nested Compositions

For a function with kk nested layers y=fk(fk−1(⋯f1(x)⋯ ))y = f_k(f_{k-1}(\cdots f_1(x) \cdots)), the derivative is the product of all inner derivatives:

Nested Chain Rule

dydx=fk′(fk−1(⋯ ))⋅fk−1′(fk−2(⋯ ))⋯f2′(f1(x))⋅f1′(x)\frac{dy}{dx} = f_k'(f_{k-1}(\cdots)) \cdot f_{k-1}'(f_{k-2}(\cdots)) \cdots f_2'(f_1(x)) \cdot f_1'(x)

Here,

  • fkf_k=The outermost function
  • f1f_1=The innermost function
  • fk′⋅fk−1â€˛â‹¯f1′f_k' \cdot f_{k-1}' \cdots f_1'=Product of derivatives from outside inward

📝Three-Layer Nested Function

Problem: Find ddx[sin⁥(ln⁥(tan⁥(x)))]\frac{d}{dx}[\sin(\ln(\tan(x)))].

Solution:

  • Outermost: sin⁥(u)\sin(u), derivative cos⁥(u)\cos(u)
  • Middle: ln⁥(v)\ln(v), derivative 1v\frac{1}{v}
  • Innermost: tan⁥(x)\tan(x), derivative sec⁥2(x)\sec^2(x)
  • Result: cos⁥(ln⁥(tan⁥(x)))⋅1tan⁥(x)⋅sec⁥2(x)\cos(\ln(\tan(x))) \cdot \frac{1}{\tan(x)} \cdot \sec^2(x)
  • Simplified: cos⁥(ln⁥(tan⁥(x)))⋅sec⁥2(x)tan⁥(x)=2cos⁥(ln⁥(tan⁥(x)))sin⁥(2x)\cos(\ln(\tan(x))) \cdot \frac{\sec^2(x)}{\tan(x)} = \frac{2\cos(\ln(\tan(x)))}{\sin(2x)}

📝Four-Layer Nested Function

Problem: Find ddx[esin⁥(cos⁥(x))]\frac{d}{dx}[e^{\sqrt{\sin(\cos(x))}}].

Solution:

  • Layer 4 (outermost): eue^u, derivative eue^u
  • Layer 3: v\sqrt{v}, derivative 12v\frac{1}{2\sqrt{v}}
  • Layer 2: sin⁥(w)\sin(w), derivative cos⁥(w)\cos(w)
  • Layer 1 (innermost): cos⁥(x)\cos(x), derivative −sin⁥(x)-\sin(x)
  • Result: esin⁥(cos⁥(x))⋅12sin⁥(cos⁥(x))⋅cos⁥(cos⁥(x))⋅(−sin⁥(x))e^{\sqrt{\sin(\cos(x))}} \cdot \frac{1}{2\sqrt{\sin(\cos(x))}} \cdot \cos(\cos(x)) \cdot (-\sin(x))

Chain Rule for Implicit Functions

DfImplicit Differentiation

When yy is defined implicitly by an equation F(x,y)=0F(x, y) = 0, we differentiate both sides with respect to xx treating yy as a function of xx, then solve for dydx\frac{dy}{dx}.

Implicit Derivative Formula

dydx=−FxFy=−∂F∂x∂F∂y\frac{dy}{dx} = -\frac{F_x}{F_y} = -\frac{\frac{\partial F}{\partial x}}{\frac{\partial F}{\partial y}}

Here,

  • F(x,y)=0F(x, y) = 0=The implicit equation defining y as a function of x
  • FxF_x=Partial derivative of F with respect to x
  • FyF_y=Partial derivative of F with respect to y

📝Circle Equation

Problem: Find dydx\frac{dy}{dx} for x2+y2=25x^2 + y^2 = 25.

Solution:

  • Let F(x,y)=x2+y2−25=0F(x, y) = x^2 + y^2 - 25 = 0
  • Fx=2xF_x = 2x, Fy=2yF_y = 2y
  • dydx=−2x2y=−xy\frac{dy}{dx} = -\frac{2x}{2y} = -\frac{x}{y}
  • This matches the geometric intuition: at point (3,4)(3, 4), slope is −34-\frac{3}{4}.

📝Ellipse with Implicit Differentiation

Problem: Find dydx\frac{dy}{dx} for x24+y29=1\frac{x^2}{4} + \frac{y^2}{9} = 1.

Solution:

  • Differentiate both sides: 2x4+2y9dydx=0\frac{2x}{4} + \frac{2y}{9}\frac{dy}{dx} = 0
  • Solve: dydx=−2x/42y/9=−9x4y\frac{dy}{dx} = -\frac{2x/4}{2y/9} = -\frac{9x}{4y}

📝Higher-Order Implicit Derivatives

Problem: Find d2ydx2\frac{d^2y}{dx^2} for x2+y2=25x^2 + y^2 = 25.

Solution:

  • First derivative: dydx=−xy\frac{dy}{dx} = -\frac{x}{y}
  • Differentiate again: d2ydx2=ddx(−xy)=−y−xdydxy2\frac{d^2y}{dx^2} = \frac{d}{dx}\left(-\frac{x}{y}\right) = -\frac{y - x\frac{dy}{dx}}{y^2}
  • Substitute dydx=−xy\frac{dy}{dx} = -\frac{x}{y}: d2ydx2=−y−x(−x/y)y2=−y2+x2y3=−25y3\frac{d^2y}{dx^2} = -\frac{y - x(-x/y)}{y^2} = -\frac{y^2 + x^2}{y^3} = -\frac{25}{y^3}

Chain Rule for Partial Derivatives

DfPartial Derivative Chain Rule

When a function depends on intermediate variables that are themselves functions of multiple independent variables, we use the multivariable chain rule with partial derivatives.

Partial Derivative Chain Rule (Two Intermediate Variables)

∂z∂u=∂z∂x∂x∂u+∂z∂y∂y∂u\frac{\partial z}{\partial u} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial u} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial u}

Here,

  • z=f(x,y)z = f(x, y)=Function of intermediate variables x and y
  • x=x(u,v),y=y(u,v)x = x(u, v), y = y(u, v)=Intermediate variables as functions of independent variables u and v

Full Partial Derivative System

{∂z∂u=∂z∂x∂x∂u+∂z∂y∂y∂u∂z∂v=∂z∂x∂x∂v+∂z∂y∂y∂v\begin{cases} \frac{\partial z}{\partial u} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial u} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial u} \\[8pt] \frac{\partial z}{\partial v} = \frac{\partial z}{\partial x}\frac{\partial x}{\partial v} + \frac{\partial z}{\partial y}\frac{\partial y}{\partial v} \end{cases}

Here,

  • ∂z∂u\frac{\partial z}{\partial u}=Partial derivative of z with respect to u through all paths
  • ∂z∂v\frac{\partial z}{\partial v}=Partial derivative of z with respect to v through all paths

ThChain Rule as Matrix Multiplication

The chain rule for multivariable functions can be expressed compactly using the Jacobian (Jacobian) matrix. If z⃗=f⃗(x⃗)\vec{z} = \vec{f}(\vec{x}) and x⃗=g⃗(u⃗)\vec{x} = \vec{g}(\vec{u}), then:

Jacobian Chain Rule

Jz⃗∘g⃗=Jf⃗⋅Jg⃗J_{\vec{z} \circ \vec{g}} = J_{\vec{f}} \cdot J_{\vec{g}}

Here,

  • Jf⃗J_{\vec{f}}=Jacobian of f with respect to x
  • Jg⃗J_{\vec{g}}=Jacobian of g with respect to u
  • Jf⃗⋅Jg⃗J_{\vec{f}} \cdot J_{\vec{g}}=Matrix product gives the Jacobian of the composition

📝Partial Derivative Chain Rule Application

Problem: Let z=exyz = e^{xy} where x=u+vx = u + v and y=uvy = uv. Find ∂z∂u\frac{\partial z}{\partial u}.

Solution:

  • ∂z∂x=yexy\frac{\partial z}{\partial x} = ye^{xy}, ∂z∂y=xexy\frac{\partial z}{\partial y} = xe^{xy}
  • ∂x∂u=1\frac{\partial x}{\partial u} = 1, ∂y∂u=v\frac{\partial y}{\partial u} = v
  • ∂z∂u=yexy(1)+xexy(v)=exy(y+xv)\frac{\partial z}{\partial u} = ye^{xy}(1) + xe^{xy}(v) = e^{xy}(y + xv)
  • Substitute: e(u+v)(uv)(uv+(u+v)v)=euv(u+v)⋅v(2u+v)e^{(u+v)(uv)}(uv + (u+v)v) = e^{uv(u+v)} \cdot v(2u + v)

Backpropagation: Full Derivation

â„šī¸ Why Backpropagation Matters

Backpropagation is the algorithm that makes neural networks trainable. It computes the gradient of the loss function with respect to every weight in the network by applying the chain rule layer by layer in reverse order. Without it, training deep networks would be computationally intractable. Understanding backpropagation at the mathematical level is essential for debugging models, designing new architectures, and pushing the boundaries of AI.

ThChain Rule Through a Neural Network

Consider a simple feedforward neural network with one hidden layer. The forward pass computes:

Forward Pass (Single Hidden Layer)

z(1)=W(1)x+b(1)a(1)=΃(z(1))z(2)=W(2)a(1)+b(2)y^=΃(z(2))L=12âˆĨy^−yâˆĨ2\begin{aligned} z^{(1)} &= W^{(1)}x + b^{(1)} \\ a^{(1)} &= \sigma(z^{(1)}) \\ z^{(2)} &= W^{(2)}a^{(1)} + b^{(2)} \\ \hat{y} &= \sigma(z^{(2)}) \\ L &= \frac{1}{2}\|\hat{y} - y\|^2 \end{aligned}

Here,

  • W(1),W(2)W^{(1)}, W^{(2)}=Weight matrices for layers 1 and 2
  • b(1),b(2)b^{(1)}, b^{(2)}=Bias vectors for layers 1 and 2
  • ΃\sigma=Activation function (e.g., sigmoid)
  • LL=Loss function (mean squared error)

Backward Pass (Gradient Computation)

∂L∂y^=y^−y∂L∂z(2)=∂L∂y^â‹…Īƒâ€˛(z(2))∂L∂W(2)=∂L∂z(2)⋅(a(1))T∂L∂z(1)=(W(2))T∂L∂z(2)â‹…Īƒâ€˛(z(1))∂L∂W(1)=∂L∂z(1)⋅xT\begin{aligned} \frac{\partial L}{\partial \hat{y}} &= \hat{y} - y \\ \frac{\partial L}{\partial z^{(2)}} &= \frac{\partial L}{\partial \hat{y}} \cdot \sigma'(z^{(2)}) \\ \frac{\partial L}{\partial W^{(2)}} &= \frac{\partial L}{\partial z^{(2)}} \cdot (a^{(1)})^T \\ \frac{\partial L}{\partial z^{(1)}} &= (W^{(2)})^T \frac{\partial L}{\partial z^{(2)}} \cdot \sigma'(z^{(1)}) \\ \frac{\partial L}{\partial W^{(1)}} &= \frac{\partial L}{\partial z^{(1)}} \cdot x^T \end{aligned}

Here,

  • ∂L∂y^\frac{\partial L}{\partial \hat{y}}=Gradient of loss with respect to output
  • ∂L∂z(2)\frac{\partial L}{\partial z^{(2)}}=Error signal at the output layer
  • ∂L∂z(1)\frac{\partial L}{\partial z^{(1)}}=Error signal at the hidden layer (propagated backward)

💡 Key Insight

The gradient at each layer is the product of: (1) the gradient from the layer above, (2) the derivative of the activation function, and (3) the weight matrix transpose. This is the chain rule in action — each layer receives an "error signal" from above, modifies it by the local derivative, and passes it further back.

📝Concrete Backpropagation: Two-Layer Network

Setup: x=2x = 2, W(1)=0.5W^{(1)} = 0.5, b(1)=0.1b^{(1)} = 0.1, W(2)=0.8W^{(2)} = 0.8, b(2)=0.2b^{(2)} = 0.2, y=1y = 1. Use sigmoid activation and MSE loss.

Forward Pass:

  • z(1)=0.5⋅2+0.1=1.1z^{(1)} = 0.5 \cdot 2 + 0.1 = 1.1
  • a(1)=΃(1.1)=0.7503a^{(1)} = \sigma(1.1) = 0.7503
  • z(2)=0.8⋅0.7503+0.2=0.8002z^{(2)} = 0.8 \cdot 0.7503 + 0.2 = 0.8002
  • y^=΃(0.8002)=0.6899\hat{y} = \sigma(0.8002) = 0.6899
  • L=12(0.6899−1)2=0.0484L = \frac{1}{2}(0.6899 - 1)^2 = 0.0484

Backward Pass (Chain Rule):

  • ∂L∂y^=0.6899−1=−0.3101\frac{\partial L}{\partial \hat{y}} = 0.6899 - 1 = -0.3101
  • Īƒâ€˛(z(2))=0.6899(1−0.6899)=0.2139\sigma'(z^{(2)}) = 0.6899(1 - 0.6899) = 0.2139
  • ∂L∂z(2)=−0.3101⋅0.2139=−0.0663\frac{\partial L}{\partial z^{(2)}} = -0.3101 \cdot 0.2139 = -0.0663
  • ∂L∂W(2)=−0.0663⋅0.7503=−0.0497\frac{\partial L}{\partial W^{(2)}} = -0.0663 \cdot 0.7503 = -0.0497
  • ∂L∂z(1)=0.8⋅(−0.0663)â‹…Īƒâ€˛(1.1)=0.8⋅(−0.0663)⋅0.1876=−0.00996\frac{\partial L}{\partial z^{(1)}} = 0.8 \cdot (-0.0663) \cdot \sigma'(1.1) = 0.8 \cdot (-0.0663) \cdot 0.1876 = -0.00996
  • ∂L∂W(1)=−0.00996⋅2=−0.0199\frac{\partial L}{\partial W^{(1)}} = -0.00996 \cdot 2 = -0.0199
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_grad(x):
    s = sigmoid(x)
    return s * (1 - s)

# Forward pass
x = np.array([2.0])
W1 = np.array([[0.5]])
b1 = np.array([0.1])
W2 = np.array([[0.8]])
b2 = np.array([0.2])
y_true = np.array([1.0])

z1 = W1 @ x + b1
a1 = sigmoid(z1)
z2 = W2 @ a1 + b2
a2 = sigmoid(z2)
loss = 0.5 * (a2 - y_true) ** 2

# Backward pass (chain rule)
dL_da2 = a2 - y_true
da2_dz2 = sigmoid_grad(z2)
dL_dz2 = dL_da2 * da2_dz2
dL_dW2 = dL_dz2 @ a1.T
dL_da1 = W2.T @ dL_dz2
da1_dz1 = sigmoid_grad(z1)
dL_dz1 = dL_da1 * da1_dz1
dL_dW1 = dL_dz1 @ x.T

print(f"Loss: {loss[0]:.4f}")
print(f"dL/dW2: {dL_dW2[0][0]:.4f}")
print(f"dL/dW1: {dL_dW1[0][0]:.4f}")

Common Chain Rule Patterns

CompositionOuterInnerDerivative
ekxe^{kx}eue^ukxkxkekxke^{kx}
sin⁥(ax+b)\sin(ax+b)sin⁥(u)\sin(u)ax+bax+bacos⁥(ax+b)a\cos(ax+b)
ln⁥(x2)\ln(x^2)ln⁥(u)\ln(u)x2x^22x\frac{2}{x}
1−x2\sqrt{1-x^2}u\sqrt{u}1−x21-x^2−x1−x2\frac{-x}{\sqrt{1-x^2}}
e−x2/2e^{-x^2/2}eue^u−x2/2-x^2/2−xe−x2/2-xe^{-x^2/2}
΃(x)\sigma(x) (sigmoid)11+e−u\frac{1}{1+e^{-u}}−x-x΃(x)(1âˆ’Īƒ(x))\sigma(x)(1-\sigma(x))
tanh⁡(x)\tanh(x)eu−e−ueu+e−u\frac{e^u-e^{-u}}{e^u+e^{-u}}xx1−tanh⁡2(x)1-\tanh^2(x)
softmax(xi)\text{softmax}(x_i)exi∑exj\frac{e^{x_i}}{\sum e^{x_j}}xix_isi(1−si)s_i(1-s_i) if i=ji=j, −sisj-s_is_j if i≠ji\neq j
ReLU(x)\text{ReLU}(x)max⁥(0,u)\max(0,u)xx11 if x>0x>0, 00 otherwise
LayerNorm(x)\text{LayerNorm}(x)x−Îŧ΃2+Īĩ\frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}xxComplex — see LayerNorm derivation

💡 Pattern Recognition

The key to mastering the chain rule is pattern recognition. When you see a function composed of familiar pieces, identify the outer and inner functions immediately. With practice, you will differentiate composite functions mentally without writing out each step. The most important patterns in ML are: sigmoid Īƒâ€˛(x)=΃(x)(1âˆ’Īƒ(x))\sigma'(x) = \sigma(x)(1-\sigma(x)), ReLU ReLU′(x)=1[x>0]\text{ReLU}'(x) = \mathbb{1}[x > 0], and tanh tanh⁡′(x)=1−tanh⁥2(x)\tanh'(x) = 1 - \tanh^2(x).


Python Implementation: Autograd Examples

â„šī¸ Autograd and the Chain Rule

Modern deep learning frameworks like PyTorch and TensorFlow implement automatic differentiation (autograd), which applies the chain rule numerically through computation graphs. Understanding the manual chain rule helps you debug gradients, write custom backward passes, and reason about numerical stability.

import numpy as np

# ============================================
# Manual Chain Rule Implementation
# ============================================

def chain_rule_example():
    """Differentiate f(x) = sin(x^2) using the chain rule."""
    x = 1.5

    # Outer: sin(u), Inner: u = x^2
    u = x ** 2
    f = np.sin(u)

    # Derivatives
    df_du = np.cos(u)       # derivative of sin
    du_dx = 2 * x            # derivative of x^2
    df_dx = df_du * du_dx   # chain rule: multiply

    print(f"f({x}) = sin({x}^2) = {f:.4f}")
    print(f"f'({x}) = {df_dx:.4f}")

chain_rule_example()

# ============================================
# Numerical Gradient Verification
# ============================================

def numerical_gradient(f, x, h=1e-7):
    """Central difference approximation."""
    return (f(x + h) - f(x - h)) / (2 * h)

def analytical_chain_rule(x):
    """Derivative of sin(x^2) using chain rule."""
    return np.cos(x ** 2) * 2 * x

x_test = 1.5
numerical = numerical_gradient(lambda x: np.sin(x**2), x_test)
analytical = analytical_chain_rule(x_test)
print(f"Numerical:  {numerical:.6f}")
print(f"Analytical: {analytical:.6f}")

# ============================================
# Deep Learning: Manual Backward Pass
# ============================================

def manual_backprop():
    """Full backward pass for a 3-layer network."""
    np.random.seed(42)

    # Forward
    x = np.random.randn(4, 1)
    W1 = np.random.randn(8, 4) * 0.5
    b1 = np.zeros((8, 1))
    W2 = np.random.randn(4, 8) * 0.5
    b2 = np.zeros((4, 1))
    W3 = np.random.randn(1, 4) * 0.5
    b3 = np.zeros((1, 1))

    def sigmoid(z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    # Forward
    z1 = W1 @ x + b1
    a1 = sigmoid(z1)
    z2 = W2 @ a1 + b2
    a2 = sigmoid(z2)
    z3 = W3 @ a2 + b3
    a3 = sigmoid(z3)

    y_true = np.array([[1.0]])
    loss = 0.5 * (a3 - y_true) ** 2

    # Backward (chain rule layer by layer)
    dL_da3 = a3 - y_true
    da3_dz3 = a3 * (1 - a3)
    dL_dz3 = dL_da3 * da3_dz3

    dL_dW3 = dL_dz3 @ a2.T
    dL_db3 = dL_dz3
    dL_da2 = W3.T @ dL_dz3

    da2_dz2 = a2 * (1 - a2)
    dL_dz2 = dL_da2 * da2_dz2
    dL_dW2 = dL_dz2 @ a1.T
    dL_db2 = dL_dz2
    dL_da1 = W2.T @ dL_dz2

    da1_dz1 = a1 * (1 - a1)
    dL_dz1 = dL_da1 * da1_dz1
    dL_dW1 = dL_dz1 @ x.T
    dL_db1 = dL_dz1

    print(f"Loss: {loss[0][0]:.6f}")
    print(f"dL/dW1 shape: {dL_dW1.shape}")
    print(f"dL/dW2 shape: {dL_dW2.shape}")
    print(f"dL/dW3 shape: {dL_dW3.shape}")

manual_backprop()

# ============================================
# PyTorch Autograd (Same Computation)
# ============================================

try:
    import torch

    x_t = torch.tensor([1.5], requires_grad=True)
    f_t = torch.sin(x_t ** 2)
    f_t.backward()
    print(f"PyTorch grad: {x_t.grad.item():.6f}")
except ImportError:
    print("PyTorch not available")

Applications in AI/ML

â„šī¸ Chain Rule in Deep Learning

The chain rule is not just a mathematical convenience — it is the computational backbone of all gradient-based learning. Every major breakthrough in deep learning, from AlexNet to GPT-4, was enabled by efficient chain rule computation through ever-deeper networks.

ThGradient Flow in Deep Networks

In a network with LL layers, the gradient of the loss with respect to a weight W(l)W^{(l)} in layer ll is:

Layer-wise Gradient via Chain Rule

∂L∂W(l)=∂L∂a(L)⋅∏k=lL−1diag(Īƒâ€˛(z(k+1)))⋅W(k+1)⋅∂a(l)∂W(l)\frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial a^{(L)}} \cdot \prod_{k=l}^{L-1} \text{diag}(\sigma'(z^{(k+1)})) \cdot W^{(k+1)} \cdot \frac{\partial a^{(l)}}{\partial W^{(l)}}

Here,

  • LL=Total number of layers
  • ll=The layer whose gradient we are computing
  • ∏k=lL−1\prod_{k=l}^{L-1}=Product of Jacobians from layer l to the output

âš ī¸ Vanishing and Exploding Gradients

When the chain of derivatives contains many small factors (e.g., Īƒâ€˛(z)<0.25\sigma'(z) < 0.25 for sigmoid), the product shrinks exponentially — this is the vanishing gradient problem. Conversely, if factors are large, gradients explode. This is why architecture choices (residual connections, normalization, proper initialization) and activation function choices (ReLU instead of sigmoid) are critical for training deep networks.

Key Applications:

ApplicationHow Chain Rule is Used
BackpropagationGradient of loss w.r.t. weights computed via chain rule through layers
Gradient DescentParameter update θ←θ−α∇θL\theta \leftarrow \theta - \alpha \nabla_\theta L requires ∇θL\nabla_\theta L from chain rule
Attention MechanismsGradients flow through softmax, which requires the chain rule for softmax Jacobian
Normalization LayersBatchNorm/LayerNorm backward pass uses chain rule through mean, variance, and affine transforms
Loss FunctionsCross-entropy + softmax combine via chain rule into a clean gradient: y^−y\hat{y} - y
Custom OperationsWriting custom autograd functions requires implementing the chain rule backward pass
Meta-LearningMAML computes second-order gradients through the chain rule applied twice

Common Mistakes

MistakeIncorrectCorrectWhy
Forgetting inner derivativeddxsin⁡(x2)=cos⁡(x2)\frac{d}{dx}\sin(x^2) = \cos(x^2)cos⁡(x2)⋅2x\cos(x^2) \cdot 2xMust multiply by derivative of inner function
Wrong order of multiplicationg′(x)⋅f′(g(x))g'(x) \cdot f'(g(x))f′(g(x))⋅g′(x)f'(g(x)) \cdot g'(x)Order matters for matrix derivatives (dimensions)
Differentiating inner firstDifferentiate g(x)g(x), then compose with f′f'Evaluate f′f' at g(x)g(x), multiply by g′(x)g'(x)Outer derivative is evaluated at inner, not differentiated
Missing chain in nested functionsOnly one derivative factorProduct of ALL inner derivativesEach nested layer contributes one factor
Forgetting partial derivativesOnly one path in multivariable caseSum over ALL pathsMultiple intermediate variables each contribute
Confusing ddx\frac{d}{dx} and ∂∂x\frac{\partial}{\partial x}Using partial when total derivative neededUse total derivative for single-variable compositionsPartial derivative holds other variables constant
Not applying chain rule to activationUsing raw activation derivativeĪƒâ€˛(z)=΃(z)(1âˆ’Īƒ(z))\sigma'(z) = \sigma(z)(1-\sigma(z))The sigmoid derivative depends on the output

âš ī¸ The Most Dangerous Mistake

The most common and dangerous mistake is forgetting the inner derivative. In a neural network, if you compute the gradient of the loss with respect to a pre-activation zz but forget to multiply by the derivative of the activation function Īƒâ€˛(z)\sigma'(z), your gradient will be wrong and your model will not train correctly. Always verify that every intermediate variable has its derivative accounted for in the chain.


Interview Questions

📝Question 1: Chain Rule Fundamentals

Q: State the chain rule for y=f(g(x))y = f(g(x)) and explain when you would use it.

A: The chain rule states dydx=f′(g(x))⋅g′(x)\frac{dy}{dx} = f'(g(x)) \cdot g'(x). You use it whenever differentiating a composite function — a function inside another function. In ML, this applies to every layer of a neural network: the loss is a function of the output, which is a function of pre-activations, which are functions of weights. The chain rule lets us decompose this complex derivative into manageable local derivatives.

📝Question 2: Multivariable Chain Rule

Q: How does the chain rule change when z=f(x,y)z = f(x, y) and both xx and yy depend on tt?

A: When multiple intermediate variables depend on the same variable, you sum contributions through each path: dzdt=∂f∂xdxdt+∂f∂ydydt\frac{dz}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}. Each term represents the partial effect through one intermediate variable. This extends to any number of intermediate variables: dzdt=∑i∂f∂xidxidt\frac{dz}{dt} = \sum_i \frac{\partial f}{\partial x_i}\frac{dx_i}{dt}.

📝Question 3: Backpropagation Derivation

Q: Derive the gradient of the loss with respect to W(1)W^{(1)} in a two-layer network.

A: Forward: z(1)=W(1)xz^{(1)} = W^{(1)}x, a(1)=΃(z(1))a^{(1)} = \sigma(z^{(1)}), z(2)=W(2)a(1)z^{(2)} = W^{(2)}a^{(1)}, y^=΃(z(2))\hat{y} = \sigma(z^{(2)}), L=12âˆĨy^−yâˆĨ2L = \frac{1}{2}\|\hat{y}-y\|^2. Backward: ∂L∂W(1)=∂L∂y^â‹…Īƒâ€˛(z(2))⋅W(2)â‹…Īƒâ€˛(z(1))⋅xT\frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial \hat{y}} \cdot \sigma'(z^{(2)}) \cdot W^{(2)} \cdot \sigma'(z^{(1)}) \cdot x^T. This is four chain rule multiplications, one per layer and activation.

📝Question 4: Why ReLU Over Sigmoid

Q: Explain why the chain rule makes ReLU preferred over sigmoid in deep networks.

A: For sigmoid, Īƒâ€˛(z)=΃(z)(1âˆ’Īƒ(z))≤0.25\sigma'(z) = \sigma(z)(1-\sigma(z)) \leq 0.25, so each factor in the chain reduces the gradient by at least 75%. After nn layers, the gradient is at most 0.25n0.25^n, which vanishes exponentially. For ReLU, ReLU′(z)=1\text{ReLU}'(z) = 1 for z>0z > 0, so the gradient passes through unchanged (no multiplication by a small factor). This is why deep networks with ReLU can be trained while deep sigmoid networks suffer from vanishing gradients.

📝Question 5: Implicit Differentiation in ML

Q: When would you use implicit differentiation instead of explicit differentiation in machine learning?

A: Implicit differentiation is used when the relationship between variables is defined by an equation rather than an explicit function. Examples include: (1) computing the gradient of the optimal solution in bilevel optimization (e.g., hyperparameter optimization), (2) deriving the update rule for implicit SGD, (3) computing exact Hessians of the loss, and (4) solving for the fixed point of an iterative algorithm and differentiating through it. The implicit function theorem guarantees the derivative exists under mild conditions.

📝Question 6: Gradient Flow Analysis

Q: A network has 10 layers with sigmoid activations. Estimate how much the gradient is scaled at layer 1 compared to the output.

A: Each sigmoid activation scales the gradient by at most 0.250.25. Over 10 layers, the gradient is scaled by at most 0.2510≈9.5×10−70.25^{10} \approx 9.5 \times 10^{-7}. This means the gradient at layer 1 is roughly one million times smaller than at the output — essentially zero. This is the vanishing gradient problem and explains why deep sigmoid networks cannot be trained with vanilla gradient descent.

📝Question 7: Custom Backward Pass

Q: You implement a custom function f(x)=softplus(x)=ln⁥(1+ex)f(x) = \text{softplus}(x) = \ln(1 + e^x). Write the backward pass.

A: Forward: f(x)=ln⁥(1+ex)f(x) = \ln(1 + e^x). Backward: using the chain rule, dfdx=11+ex⋅ex=ex1+ex=΃(x)\frac{df}{dx} = \frac{1}{1+e^x} \cdot e^x = \frac{e^x}{1+e^x} = \sigma(x), where ΃\sigma is the sigmoid function. So the softplus gradient is the sigmoid — a beautiful relationship that connects two important ML functions through the chain rule.


Practice Problems

📝Problem 1: Basic Chain Rule

Compute ddx[esin⁥(3x)]\frac{d}{dx}[e^{\sin(3x)}].

💡Solution

  • Outer: eue^u, inner: sin⁥(3x)\sin(3x)
  • ddxesin⁥(3x)=esin⁥(3x)⋅ddxsin⁥(3x)=esin⁥(3x)⋅cos⁥(3x)⋅3\frac{d}{dx}e^{\sin(3x)} = e^{\sin(3x)} \cdot \frac{d}{dx}\sin(3x) = e^{\sin(3x)} \cdot \cos(3x) \cdot 3
  • Answer: 3cos⁥(3x)⋅esin⁥(3x)3\cos(3x) \cdot e^{\sin(3x)}

📝Problem 2: Multivariable Chain Rule

Let w=xy+yzw = xy + yz where x=cos⁥(t)x = \cos(t), y=sin⁥(t)y = \sin(t), z=tz = t. Find dwdt\frac{dw}{dt}.

💡Solution

  • ∂w∂x=y\frac{\partial w}{\partial x} = y, ∂w∂y=x+z\frac{\partial w}{\partial y} = x + z, ∂w∂z=y\frac{\partial w}{\partial z} = y
  • dxdt=−sin⁥(t)\frac{dx}{dt} = -\sin(t), dydt=cos⁥(t)\frac{dy}{dt} = \cos(t), dzdt=1\frac{dz}{dt} = 1
  • dwdt=y(−sin⁥(t))+(x+z)cos⁥(t)+y(1)\frac{dw}{dt} = y(-\sin(t)) + (x+z)\cos(t) + y(1)
  • =−sin⁥2(t)+(cos⁥(t)+t)cos⁥(t)+sin⁥(t)= -\sin^2(t) + (\cos(t) + t)\cos(t) + \sin(t)
  • =−sin⁥2(t)+cos⁥2(t)+tcos⁥(t)+sin⁥(t)= -\sin^2(t) + \cos^2(t) + t\cos(t) + \sin(t)
  • =cos⁥(2t)+tcos⁥(t)+sin⁥(t)= \cos(2t) + t\cos(t) + \sin(t)

📝Problem 3: Implicit Differentiation

Find dydx\frac{dy}{dx} for exy=x+ye^{xy} = x + y.

💡Solution

  • Differentiate both sides with respect to xx (chain rule + product rule on left):
  • exy(y+xdydx)=1+dydxe^{xy}(y + x\frac{dy}{dx}) = 1 + \frac{dy}{dx}
  • Expand: yexy+xexydydx=1+dydxye^{xy} + xe^{xy}\frac{dy}{dx} = 1 + \frac{dy}{dx}
  • Collect dydx\frac{dy}{dx} terms: dydx(xexy−1)=1−yexy\frac{dy}{dx}(xe^{xy} - 1) = 1 - ye^{xy}
  • Answer: dydx=1−yexyxexy−1\frac{dy}{dx} = \frac{1 - ye^{xy}}{xe^{xy} - 1}

📝Problem 4: Backpropagation Gradient

For L=(a−y)2L = (a - y)^2 where a=΃(Wx+b)a = \sigma(Wx + b), compute ∂L∂W\frac{\partial L}{\partial W}, ∂L∂b\frac{\partial L}{\partial b}, and ∂L∂x\frac{\partial L}{\partial x}.

💡Solution

  • ∂L∂a=2(a−y)\frac{\partial L}{\partial a} = 2(a - y)
  • ∂a∂z=΃(z)(1âˆ’Īƒ(z))=a(1−a)\frac{\partial a}{\partial z} = \sigma(z)(1 - \sigma(z)) = a(1-a) where z=Wx+bz = Wx + b
  • ∂L∂z=2(a−y)⋅a(1−a)\frac{\partial L}{\partial z} = 2(a-y) \cdot a(1-a)
  • ∂L∂W=∂L∂z⋅xT=2(a−y)⋅a(1−a)⋅xT\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \cdot x^T = 2(a-y) \cdot a(1-a) \cdot x^T
  • ∂L∂b=∂L∂z=2(a−y)⋅a(1−a)\frac{\partial L}{\partial b} = \frac{\partial L}{\partial z} = 2(a-y) \cdot a(1-a)
  • ∂L∂x=WT⋅∂L∂z=WT⋅2(a−y)⋅a(1−a)\frac{\partial L}{\partial x} = W^T \cdot \frac{\partial L}{\partial z} = W^T \cdot 2(a-y) \cdot a(1-a)

📝Problem 5: Higher-Order Chain Rule

Find d2dx2[sin⁥(x2)]\frac{d^2}{dx^2}[\sin(x^2)].

💡Solution

  • First derivative: ddxsin⁥(x2)=2xcos⁥(x2)\frac{d}{dx}\sin(x^2) = 2x\cos(x^2) (chain rule)
  • Second derivative: ddx[2xcos⁥(x2)]\frac{d}{dx}[2x\cos(x^2)] (product rule + chain rule)
  • =2cos⁥(x2)+2x⋅(−sin⁥(x2))⋅2x= 2\cos(x^2) + 2x \cdot (-\sin(x^2)) \cdot 2x
  • =2cos⁥(x2)−4x2sin⁥(x2)= 2\cos(x^2) - 4x^2\sin(x^2)

Quick Reference

TopicFormulaKey Idea
Single Variableddxf(g(x))=f′(g(x))⋅g′(x)\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)Differentiate outside, multiply by inner derivative
Multivariabledzdt=∂f∂xdxdt+∂f∂ydydt\frac{dz}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}Sum over all paths
General∂z∂u=∑i∂z∂xi∂xi∂u\frac{\partial z}{\partial u} = \sum_i \frac{\partial z}{\partial x_i}\frac{\partial x_i}{\partial u}Sum contributions from each intermediate variable
Nested (k layers)dydx=fk′⋅fk−1â€˛â‹¯f1′\frac{dy}{dx} = f_k' \cdot f_{k-1}' \cdots f_1'Product of all inner derivatives
Implicitdydx=−FxFy\frac{dy}{dx} = -\frac{F_x}{F_y}Differentiate both sides, solve for dy/dxdy/dx
JacobianJf⃗∘g⃗=Jf⃗⋅Jg⃗J_{\vec{f} \circ \vec{g}} = J_{\vec{f}} \cdot J_{\vec{g}}Matrix multiplication of Jacobians
SigmoidĪƒâ€˛(x)=΃(x)(1âˆ’Īƒ(x))\sigma'(x) = \sigma(x)(1-\sigma(x))Gradient expressed in terms of output
Tanhtanh⁡′(x)=1−tanh⁡2(x)\tanh'(x) = 1 - \tanh^2(x)Gradient expressed in terms of output
ReLUReLU′(x)=1[x>0]\text{ReLU}'(x) = \mathbb{1}[x > 0]1 if active, 0 if dead
Backprop∂L∂W(l)=δ(l)⋅(a(l−1))T\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} \cdot (a^{(l-1)})^TError signal times input transpose

Cross-References

TopicRelated Lesson
Derivatives and DifferentiationCalculus Derivatives
Partial Derivatives and GradientsCalculus Partial
Matrix Calculus and JacobiansLinear Algebra Matrix Calculus
Multivariable CalculusCalculus Multivariable
Gradient DescentOptimization Gradient Descent
Stochastic Gradient DescentOptimization SGD
Newton's MethodOptimization Newton
Optimization OverviewCalculus Optimization
Lagrange MultipliersCalculus Lagrange
Information Theory (Cross-Entropy)Info Theory Cross Entropy
Probability (Bayes' Theorem)Probability Bayes
Differential EquationsCalculus Differential Equations
Lesson Progress26 / 100