Backpropagation Algorithm — Forward Pass, Backward Pass & Computational Graphs
Backpropagation is the algorithm that enables neural networks to learn from data. It efficiently computes the gradient of the loss with respect to every parameter using the chain rule.
See our Neural Networks tutorial for the basics of perceptrons and multi-layer networks.
What Is Backpropagation?
DfBackpropagation
Backpropagation (backward propagation of errors) is an algorithm for computing the gradient of the loss function with respect to each weight in the network. It works by:
- Forward pass: Compute the output by propagating input through the network
- Compute loss: Measure the error between prediction and target
- Backward pass: Propagate the error backward, computing gradients via the chain rule
- Update weights: Adjust parameters to reduce the loss
The key insight: backpropagation computes all in a single backward pass, reusing intermediate computations.
Computational Graphs
DfComputational Graph
A computational graph is a directed acyclic graph (DAG) where:
- Nodes represent operations (add, multiply, activation)
- Edges represent data flow (tensors)
Example:
x ──→ [+] ──→ [×] ──→ f
x ──→ [×] ──→ [×] ──→ ↑
y ──→ [+] ↑
y ──────────────┘
Each node computes a local function. Backpropagation traverses this graph in reverse to compute gradients.
Forward Pass
DfForward Pass
The forward pass computes the output of each layer sequentially:
where (input), and is the activation function.
The forward pass also stores all intermediate values (activations, pre-activations) needed for the backward pass.
Forward Pass Equations
Here,
- =Weight matrix at layer l
- =Bias vector at layer l
- =Pre-activation at layer l
- =Activation at layer l
- =Activation function
Backward Pass
DfBackward Pass
The backward pass computes gradients starting from the output layer and moving backward. For each layer :
- Compute the error signal
- Propagate error to previous layer:
- Compute parameter gradients:
ThBackpropagation Correctness (Chain Rule Application)
Backpropagation is a specific application of the chain rule to neural network loss functions. For a network with layers:
where the error signal satisfies the recurrence:
This recurrence is mathematically equivalent to applying the chain rule sequentially from the output to each layer, but it is where is the number of parameters — the same complexity as a single forward pass.
Weight Gradient Computation
Here,
- =Error signal at layer l (gradient of loss w.r.t. pre-activation)
- =Activation from previous layer
- =Weight matrix at layer l
Weight Update
DfGradient Descent Update
After computing gradients, update each parameter:
where is the learning rate. In practice, advanced optimizers (Adam, SGD with momentum) modify this update rule.
Backpropagation in PyTorch (Autograd)
📝Example: Backpropagation from Scratch
import torch
# ═══════════════════════════════════════════════════
# Manual Backpropagation
# ═══════════════════════════════════════════════════
# Define a 2-layer network manually
W1 = torch.randn(2, 4, requires_grad=True)
b1 = torch.zeros(4, requires_grad=True)
W2 = torch.randn(4, 1, requires_grad=True)
b2 = torch.zeros(1, requires_grad=True)
# Input and target
x = torch.tensor([[1.0, 2.0]])
y = torch.tensor([[3.0]])
# Forward pass
z1 = x @ W1 + b1 # Pre-activation layer 1
h1 = torch.relu(z1) # Activation layer 1
z2 = h1 @ W2 + b2 # Pre-activation layer 2
y_pred = z2 # Linear output
# Loss (MSE)
loss = ((y_pred - y) ** 2).mean()
# Backward pass
loss.backward()
print("Gradients:")
print(f"W1: {W1.grad}")
print(f"b1: {b1.grad}")
print(f"W2: {W2.grad}")
print(f"b2: {b2.grad}")
📝Example: Using nn.Module with Autograd
import torch
import torch.nn as nn
# ═══════════════════════════════════════════════════
# PyTorch Autograd: Automatic Backpropagation
# ═══════════════════════════════════════════════════
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Linear(10, 32)
self.layer2 = nn.Linear(32, 16)
self.layer3 = nn.Linear(16, 1)
def forward(self, x):
# Forward pass - PyTorch builds computational graph automatically
x = torch.relu(self.layer1(x))
x = torch.relu(self.layer2(x))
return self.layer3(x)
model = SimpleNet()
x = torch.randn(32, 10) # Batch of 32 samples
y = torch.randn(32, 1) # Targets
# Forward pass
y_pred = model(x)
loss = nn.MSELoss()(y_pred, y)
# Backward pass - PyTorch computes all gradients automatically
loss.backward()
# Inspect gradients
for name, param in model.named_parameters():
print(f"{name}: grad shape = {param.grad.shape}")
📝Example: Custom Backward Pass with torch.autograd
import torch
# ═══════════════════════════════════════════════════
# Custom Backward via autograd.grad
# ═══════════════════════════════════════════════════
x = torch.tensor([3.0, 4.0], requires_grad=True)
y = torch.tensor([1.0, 2.0], requires_grad=True)
# Compute loss
z = (x ** 2 + y ** 2).sum()
# Compute gradients
grads = torch.autograd.grad(z, [x, y])
print(f"dz/dx = {grads[0]}") # Should be [6, 8]
print(f"dz/dy = {grads[1]}") # Should be [2, 4]
# Higher-order derivatives
grad_z = torch.autograd.grad(
z, x, create_graph=True
)[0]
print(f"\ndz/dx = {grad_z}")
# Second derivative
d2z_dx2 = torch.autograd.grad(
grad_z, x
)[0]
print(f"d2z/dx2 = {d2z_dx2}") # Should be [2, 2] (constant for x^2)
ℹ️ Computational Complexity of Backpropagation
The forward pass requires operations (matrix multiplications). The backward pass has approximately the same computational cost. This means backpropagation is roughly 2x the cost of a single forward pass — an extremely efficient way to compute gradients for all parameters simultaneously.
💡 Zeroing Gradients
Always call optimizer.zero_grad() before loss.backward() in PyTorch. By default, gradients are accumulated (summed) across backward passes. Not zeroing gradients causes incorrect gradient updates and training instability.
Common Backpropagation Pitfalls
DfCommon Issues
-
Vanishing Gradients: Gradients shrink exponentially through layers. Causes: deep networks, sigmoid/tanh activations. Solutions: ReLU, skip connections, batch normalization.
-
Exploding Gradients: Gradients grow exponentially. Causes: deep networks, large learning rates. Solutions: gradient clipping, proper initialization.
-
Dead Neurons: ReLU neurons that always output zero. Caused by large negative pre-activations. Solutions: Leaky ReLU, careful initialization.
-
Numerical Instability: , , division by zero. Solutions: numerical stability tricks, mixed precision.
⚠️ Vanishing/Exploding Gradients
In a network with layers, the gradient contains a product of Jacobian matrices:
If each Jacobian has spectral radius , gradients explode. If , they vanish. This is why architecture design (skip connections, normalization) is critical for deep networks.
Summary
📋Summary: Backpropagation Algorithm
- Backpropagation computes gradients via the chain rule applied to computational graphs
- Forward pass: Compute outputs layer by layer, store intermediates
- Backward pass: Propagate error signals backward, compute parameter gradients
- Complexity: Same order as forward pass — where is number of parameters
- PyTorch autograd handles backward pass automatically via dynamic computational graphs
- Always zero gradients before each backward pass
- Vanishing/exploding gradients are the main challenges for deep networks
- Solutions: ReLU, skip connections, batch norm, proper initialization, gradient clipping
Practice Exercises
-
Conceptual: Draw the computational graph for . Compute , , manually using backpropagation.
-
Coding: Implement a 3-layer MLP from scratch using only
torch.tensoroperations (nonn.Module). Compute forward and backward passes manually. Verify gradients match PyTorch's autograd. -
Debugging: Create a network that exhibits vanishing gradients. Monitor gradient magnitudes across layers using hooks. Then fix the issue with ReLU and batch normalization.
-
Numerical Verification: Implement numerical gradient checking: . Verify your manual backpropagation implementation against numerical gradients.
-
Profiling: Profile the forward and backward pass times for a 10-layer network. How does the 2:1 ratio (backward:forward) hold up in practice?