Backpropagation Algorithm — Forward Pass, Backward Pass & Computational Graphs

FoundationsOptimizationFree Lesson

Advertisement

Backpropagation Algorithm — Forward Pass, Backward Pass & Computational Graphs

Backpropagation is the algorithm that enables neural networks to learn from data. It efficiently computes the gradient of the loss with respect to every parameter using the chain rule.

See our Neural Networks tutorial for the basics of perceptrons and multi-layer networks.


What Is Backpropagation?

DfBackpropagation

Backpropagation (backward propagation of errors) is an algorithm for computing the gradient of the loss function with respect to each weight in the network. It works by:

  1. Forward pass: Compute the output by propagating input through the network
  2. Compute loss: Measure the error between prediction and target
  3. Backward pass: Propagate the error backward, computing gradients via the chain rule
  4. Update weights: Adjust parameters to reduce the loss

The key insight: backpropagation computes all Lw\frac{\partial \mathcal{L}}{\partial w} in a single backward pass, reusing intermediate computations.


Computational Graphs

DfComputational Graph

A computational graph is a directed acyclic graph (DAG) where:

  • Nodes represent operations (add, multiply, activation)
  • Edges represent data flow (tensors)

Example: f(x,y)=(x+y)(xy)f(x, y) = (x + y) \cdot (x \cdot y)

Architecture Diagram
x ──→ [+] ──→ [×] ──→ f
x ──→ [×] ──→ [×] ──→ ↑
y ──→ [+]       ↑
y ──────────────┘

Each node computes a local function. Backpropagation traverses this graph in reverse to compute gradients.


Forward Pass

DfForward Pass

The forward pass computes the output of each layer sequentially:

z(l)=W(l)h(l1)+b(l)\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}
h(l)=σ(z(l))\mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})

where h(0)=x\mathbf{h}^{(0)} = \mathbf{x} (input), and σ\sigma is the activation function.

The forward pass also stores all intermediate values (activations, pre-activations) needed for the backward pass.

Forward Pass Equations

z(l)=W(l)h(l1)+b(l),h(l)=σ(z(l))\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}, \quad \mathbf{h}^{(l)} = \sigma(\mathbf{z}^{(l)})

Here,

  • W(l)\mathbf{W}^{(l)}=Weight matrix at layer l
  • b(l)\mathbf{b}^{(l)}=Bias vector at layer l
  • z(l)\mathbf{z}^{(l)}=Pre-activation at layer l
  • h(l)\mathbf{h}^{(l)}=Activation at layer l
  • σ\sigma=Activation function

Backward Pass

DfBackward Pass

The backward pass computes gradients starting from the output layer and moving backward. For each layer ll:

  1. Compute the error signal δ(l)=Lz(l)\delta^{(l)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}}
  2. Propagate error to previous layer: δ(l1)=(W(l))Tδ(l)σ(z(l1))\delta^{(l-1)} = (\mathbf{W}^{(l)})^T \delta^{(l)} \odot \sigma'(\mathbf{z}^{(l-1)})
  3. Compute parameter gradients: LW(l)=δ(l)(h(l1))T\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} (\mathbf{h}^{(l-1)})^T
δ(l)=(W(l+1)Tδ(l+1))σ(z(l))\delta^{(l)} = \left(\mathbf{W}^{(l+1)T} \delta^{(l+1)}\right) \odot \sigma'(\mathbf{z}^{(l)})

ThBackpropagation Correctness (Chain Rule Application)

Backpropagation is a specific application of the chain rule to neural network loss functions. For a network with LL layers:

LW(l)=δ(l)(h(l1))T\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} (\mathbf{h}^{(l-1)})^T

where the error signal satisfies the recurrence:

δ(l)=(W(l+1)Tδ(l+1))σ(z(l))\delta^{(l)} = \left(\mathbf{W}^{(l+1)T} \delta^{(l+1)}\right) \odot \sigma'(\mathbf{z}^{(l)})

This recurrence is mathematically equivalent to applying the chain rule sequentially from the output to each layer, but it is O(N)O(N) where NN is the number of parameters — the same complexity as a single forward pass.

Weight Gradient Computation

LW(l)=δ(l)(h(l1))T\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} (\mathbf{h}^{(l-1)})^T

Here,

  • δ(l)\delta^{(l)}=Error signal at layer l (gradient of loss w.r.t. pre-activation)
  • h(l1)\mathbf{h}^{(l-1)}=Activation from previous layer
  • W(l)\mathbf{W}^{(l)}=Weight matrix at layer l

Weight Update

DfGradient Descent Update

After computing gradients, update each parameter:

W(l)W(l)ηLW(l)\mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}

where η\eta is the learning rate. In practice, advanced optimizers (Adam, SGD with momentum) modify this update rule.

W(l)W(l)ηLW(l)\mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}

Backpropagation in PyTorch (Autograd)

📝Example: Backpropagation from Scratch

import torch

# ═══════════════════════════════════════════════════
# Manual Backpropagation
# ═══════════════════════════════════════════════════

# Define a 2-layer network manually
W1 = torch.randn(2, 4, requires_grad=True)
b1 = torch.zeros(4, requires_grad=True)
W2 = torch.randn(4, 1, requires_grad=True)
b2 = torch.zeros(1, requires_grad=True)

# Input and target
x = torch.tensor([[1.0, 2.0]])
y = torch.tensor([[3.0]])

# Forward pass
z1 = x @ W1 + b1          # Pre-activation layer 1
h1 = torch.relu(z1)       # Activation layer 1
z2 = h1 @ W2 + b2         # Pre-activation layer 2
y_pred = z2               # Linear output

# Loss (MSE)
loss = ((y_pred - y) ** 2).mean()

# Backward pass
loss.backward()

print("Gradients:")
print(f"W1: {W1.grad}")
print(f"b1: {b1.grad}")
print(f"W2: {W2.grad}")
print(f"b2: {b2.grad}")

📝Example: Using nn.Module with Autograd

import torch
import torch.nn as nn

# ═══════════════════════════════════════════════════
# PyTorch Autograd: Automatic Backpropagation
# ═══════════════════════════════════════════════════

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(10, 32)
        self.layer2 = nn.Linear(32, 16)
        self.layer3 = nn.Linear(16, 1)

    def forward(self, x):
        # Forward pass - PyTorch builds computational graph automatically
        x = torch.relu(self.layer1(x))
        x = torch.relu(self.layer2(x))
        return self.layer3(x)

model = SimpleNet()
x = torch.randn(32, 10)  # Batch of 32 samples
y = torch.randn(32, 1)   # Targets

# Forward pass
y_pred = model(x)
loss = nn.MSELoss()(y_pred, y)

# Backward pass - PyTorch computes all gradients automatically
loss.backward()

# Inspect gradients
for name, param in model.named_parameters():
    print(f"{name}: grad shape = {param.grad.shape}")

📝Example: Custom Backward Pass with torch.autograd

import torch

# ═══════════════════════════════════════════════════
# Custom Backward via autograd.grad
# ═══════════════════════════════════════════════════

x = torch.tensor([3.0, 4.0], requires_grad=True)
y = torch.tensor([1.0, 2.0], requires_grad=True)

# Compute loss
z = (x ** 2 + y ** 2).sum()

# Compute gradients
grads = torch.autograd.grad(z, [x, y])
print(f"dz/dx = {grads[0]}")  # Should be [6, 8]
print(f"dz/dy = {grads[1]}")  # Should be [2, 4]

# Higher-order derivatives
grad_z = torch.autograd.grad(
    z, x, create_graph=True
)[0]
print(f"\ndz/dx = {grad_z}")

# Second derivative
d2z_dx2 = torch.autograd.grad(
    grad_z, x
)[0]
print(f"d2z/dx2 = {d2z_dx2}")  # Should be [2, 2] (constant for x^2)

ℹ️ Computational Complexity of Backpropagation

The forward pass requires O(l=1Lnlnl1)O\left(\sum_{l=1}^{L} n_l \cdot n_{l-1}\right) operations (matrix multiplications). The backward pass has approximately the same computational cost. This means backpropagation is roughly 2x the cost of a single forward pass — an extremely efficient way to compute gradients for all parameters simultaneously.

💡 Zeroing Gradients

Always call optimizer.zero_grad() before loss.backward() in PyTorch. By default, gradients are accumulated (summed) across backward passes. Not zeroing gradients causes incorrect gradient updates and training instability.


Common Backpropagation Pitfalls

DfCommon Issues

  1. Vanishing Gradients: Gradients shrink exponentially through layers. Causes: deep networks, sigmoid/tanh activations. Solutions: ReLU, skip connections, batch normalization.

  2. Exploding Gradients: Gradients grow exponentially. Causes: deep networks, large learning rates. Solutions: gradient clipping, proper initialization.

  3. Dead Neurons: ReLU neurons that always output zero. Caused by large negative pre-activations. Solutions: Leaky ReLU, careful initialization.

  4. Numerical Instability: log(0)\log(0), exp(large)\exp(\text{large}), division by zero. Solutions: numerical stability tricks, mixed precision.

⚠️ Vanishing/Exploding Gradients

In a network with LL layers, the gradient contains a product of LL Jacobian matrices:

h(L)h(1)=l=1L1h(l+1)h(l)\frac{\partial \mathbf{h}^{(L)}}{\partial \mathbf{h}^{(1)}} = \prod_{l=1}^{L-1} \frac{\partial \mathbf{h}^{(l+1)}}{\partial \mathbf{h}^{(l)}}

If each Jacobian has spectral radius >1> 1, gradients explode. If <1< 1, they vanish. This is why architecture design (skip connections, normalization) is critical for deep networks.


Summary

📋Summary: Backpropagation Algorithm

  • Backpropagation computes gradients via the chain rule applied to computational graphs
  • Forward pass: Compute outputs layer by layer, store intermediates
  • Backward pass: Propagate error signals backward, compute parameter gradients
  • Complexity: Same order as forward pass — O(N)O(N) where NN is number of parameters
  • PyTorch autograd handles backward pass automatically via dynamic computational graphs
  • Always zero gradients before each backward pass
  • Vanishing/exploding gradients are the main challenges for deep networks
  • Solutions: ReLU, skip connections, batch norm, proper initialization, gradient clipping

Practice Exercises

  1. Conceptual: Draw the computational graph for f(x,y,z)=(x+y)(y+z)f(x, y, z) = (x + y) \cdot (y + z). Compute fx\frac{\partial f}{\partial x}, fy\frac{\partial f}{\partial y}, fz\frac{\partial f}{\partial z} manually using backpropagation.

  2. Coding: Implement a 3-layer MLP from scratch using only torch.tensor operations (no nn.Module). Compute forward and backward passes manually. Verify gradients match PyTorch's autograd.

  3. Debugging: Create a network that exhibits vanishing gradients. Monitor gradient magnitudes across layers using hooks. Then fix the issue with ReLU and batch normalization.

  4. Numerical Verification: Implement numerical gradient checking: fxf(x+ϵ)f(xϵ)2ϵ\frac{\partial f}{\partial x} \approx \frac{f(x + \epsilon) - f(x - \epsilon)}{2\epsilon}. Verify your manual backpropagation implementation against numerical gradients.

  5. Profiling: Profile the forward and backward pass times for a 10-layer network. How does the 2:1 ratio (backward:forward) hold up in practice?

Advertisement

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement