Optimizers for Deep Learning — SGD, Adam, AdamW & Learning Rate Schedules
Optimizers determine how neural network parameters are updated based on computed gradients. The choice of optimizer and learning rate schedule significantly impacts training speed and final performance.
See our Training Deep Networks tutorial for practical training tips and debugging strategies.
Gradient Descent Foundation
DfStochastic Gradient Descent
SGD updates parameters using mini-batch gradients:
where is the learning rate and is the gradient of the loss with respect to parameters.
SGD with Momentum
DfMomentum
Momentum accelerates convergence by accumulating velocity from past gradients:
Momentum (typically ) smooths out noisy gradients and accelerates convergence in consistent gradient directions.
SGD with Momentum
Here,
- =Velocity (accumulated gradient)
- =Momentum coefficient (typically 0.9)
- =Learning rate
- =Current gradient
AdaGrad
DfAdaGrad
AdaGrad adapts the learning rate per parameter based on historical gradients:
Parameters with large historical gradients get smaller updates, and vice versa. Good for sparse data but learning rate decays too aggressively.
AdaGrad Update Rule
Here,
- =Sum of squared gradients (accumulated)
- =Base learning rate
- =Small constant for numerical stability (1e-8)
RMSProp
DfRMSProp
RMSProp fixes AdaGrad's aggressive decay by using exponential moving average:
This prevents the learning rate from decaying to zero, making it suitable for non-stationary objectives.
RMSProp Update Rule
Here,
- =Exponential moving average of squared gradients
- =Decay rate (typically 0.9)
- =Learning rate
Adam
DfAdam Optimizer
Adam combines momentum and RMSProp with bias correction:
Default: , , . Adam is the most popular optimizer for deep learning.
ThAdam Convergence Properties
Adam converges under mild conditions: the learning rate must satisfy and . The bias correction terms ensure that early gradient estimates are not biased toward zero. However, Adam may not converge to the global optimum in non-convex settings — it can oscillate around local minima due to adaptive learning rates.
AdamW (Decoupled Weight Decay)
DfAdamW
AdamW decouples weight decay from the gradient update:
Unlike L2 regularization (which couples weight decay with adaptive learning rates), AdamW applies weight decay directly to the weights, leading to better generalization. AdamW is the current default optimizer for most deep learning tasks.
ℹ️ Adam vs. AdamW
Standard Adam with L2 regularization applies the penalty through the gradient, which gets scaled by the adaptive learning rate. AdamW applies weight decay directly to the parameters, independent of the adaptive scaling. This decoupling leads to better generalization and is why AdamW is preferred.
Learning Rate Schedules
Cosine Annealing
DfCosine Annealing
The learning rate follows a cosine curve from to :
Smooth decay, widely used with warmup. The most common schedule in modern deep learning.
Cosine Annealing Schedule
Here,
- =Maximum learning rate
- =Minimum learning rate
- =Current step
- =Total number of steps
Linear Warmup
DfLinear Warmup
Start with a small learning rate and linearly increase to the target over warmup steps:
Warmup stabilizes early training when gradients are noisy and the Adam moment estimates are biased.
💡 Warmup Strategy
Warmup is critical for large batch training and Transformers. Start with or , increase linearly to over 1-5% of total steps, then apply cosine annealing. This prevents early divergence and improves final performance.
Step Decay
DfStep Decay
Reduce the learning rate by a factor at fixed intervals:
where is the step interval and is the decay factor (typically 0.1). Simple but requires manual tuning of when to decay.
PyTorch Implementation
📝Example: Optimizers and Schedules
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import (
CosineAnnealingLR, LinearLR, SequentialLR,
StepLR, ReduceLROnPlateau
)
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
# ═══════════════════════════════════════════════════
# AdamW with Cosine Annealing + Warmup
# ═══════════════════════════════════════════════════
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
# Warmup: 1000 steps, then cosine annealing over 10000 steps
warmup_scheduler = LinearLR(
optimizer, start_factor=0.01, total_iters=1000
)
cosine_scheduler = CosineAnnealingLR(
optimizer, T_max=10000, eta_min=1e-6
)
scheduler = SequentialLR(
optimizer,
schedulers=[warmup_scheduler, cosine_scheduler],
milestones=[1000]
)
# Training loop
for step in range(11000):
# Forward pass (dummy data)
x = torch.randn(32, 784)
y = torch.randint(0, 10, (32,))
output = model(x)
loss = nn.CrossEntropyLoss()(output, y)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
scheduler.step()
if step % 2000 == 0:
print(f"Step {step}: LR={scheduler.get_last_lr()[0]:.6f}, Loss={loss.item():.4f}")
# ═══════════════════════════════════════════════════
# SGD with Momentum (for vision tasks)
# ═══════════════════════════════════════════════════
optimizer_sgd = optim.SGD(
model.parameters(),
lr=0.1,
momentum=0.9,
weight_decay=5e-4
)
# Step decay: reduce by 10x every 30 epochs
scheduler_sgd = StepLR(optimizer_sgd, step_size=30, gamma=0.1)
ℹ️ When to Use Which Optimizer
- AdamW: Default for most tasks, NLP, Transformers, GANs
- SGD + Momentum: Computer vision (ResNet training), often achieves better generalization
- Adam: Quick prototyping, when you don't want to tune learning rate carefully
- LAMB: Large batch training (>8K samples)
Optimizer Comparison
| Optimizer | Adaptive LR | Memory | Speed | Generalization |
|---|---|---|---|---|
| SGD | No | 1x | Fast | Best |
| SGD + Momentum | No | 2x | Fast | Best |
| AdaGrad | Yes | 2x | Fast | Good |
| RMSProp | Yes | 2x | Fast | Good |
| Adam | Yes | 3x | Fast | Good |
| AdamW | Yes | 3x | Fast | Best (adaptive) |
Summary
📋Summary: Optimizers for Deep Learning
- SGD + Momentum: Best generalization, slowest convergence, requires careful LR tuning
- Adam: Fast convergence, adaptive LR per parameter, default for prototyping
- AdamW: Adam with decoupled weight decay, current default for most tasks
- Cosine annealing + warmup: The standard learning rate schedule
- Weight decay: L2 regularization, apply through AdamW not L2 penalty
- Learning rate is the single most important hyperparameter
- Batch size and learning rate are coupled — scale LR with batch size
- Mixed precision training: Use
torch.ampfor 2x speedup with minimal accuracy loss
Practice Exercises
-
Conceptual: Explain why Adam's bias correction is necessary. What happens to the first few gradient estimates without it?
-
Experiment: Train ResNet-18 on CIFAR-10 with SGD (lr=0.1, momentum=0.9) vs. Adam (lr=1e-3). Which achieves better final accuracy? Plot the training curves.
-
Coding: Implement a custom optimizer class in PyTorch that combines Adam with gradient clipping. Test it on a simple regression task.
-
Visualization: Plot the learning rate schedules (cosine, step decay, linear warmup) for 10000 training steps. How do they differ?
-
Research: Read the AdamW paper (Loshchilov & Hutter, 2017). Why does decoupled weight decay lead to better generalization than L2 regularization?