Optimizers for Deep Learning — SGD, Adam, AdamW & Learning Rate Schedules

FoundationsOptimizationFree Lesson

Advertisement

Optimizers for Deep Learning — SGD, Adam, AdamW & Learning Rate Schedules

Optimizers determine how neural network parameters are updated based on computed gradients. The choice of optimizer and learning rate schedule significantly impacts training speed and final performance.

See our Training Deep Networks tutorial for practical training tips and debugging strategies.


Gradient Descent Foundation

DfStochastic Gradient Descent

SGD updates parameters using mini-batch gradients:

θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)

where η\eta is the learning rate and θL\nabla_\theta \mathcal{L} is the gradient of the loss with respect to parameters.

θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \nabla_{\theta} \mathcal{L}(\theta_t)

SGD with Momentum

DfMomentum

Momentum accelerates convergence by accumulating velocity from past gradients:

vt=βvt1+θL(θt)v_t = \beta v_{t-1} + \nabla_\theta \mathcal{L}(\theta_t)
θt+1=θtηvt\theta_{t+1} = \theta_t - \eta v_t

Momentum (typically β=0.9\beta = 0.9) smooths out noisy gradients and accelerates convergence in consistent gradient directions.

SGD with Momentum

vt=βvt1+θL(θt),θt+1=θtηvtv_t = \beta v_{t-1} + \nabla_{\theta} \mathcal{L}(\theta_t), \quad \theta_{t+1} = \theta_t - \eta v_t

Here,

  • vtv_t=Velocity (accumulated gradient)
  • β\beta=Momentum coefficient (typically 0.9)
  • η\eta=Learning rate
  • θL\nabla_{\theta} \mathcal{L}=Current gradient

AdaGrad

DfAdaGrad

AdaGrad adapts the learning rate per parameter based on historical gradients:

gt=gt1+(θL)2g_t = g_{t-1} + (\nabla_\theta \mathcal{L})^2
θt+1=θtηgt+ϵθL\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{g_t} + \epsilon} \nabla_\theta \mathcal{L}

Parameters with large historical gradients get smaller updates, and vice versa. Good for sparse data but learning rate decays too aggressively.

AdaGrad Update Rule

θt+1=θtηgt+ϵθL(θt)\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{g_t} + \epsilon} \nabla_{\theta} \mathcal{L}(\theta_t)

Here,

  • gtg_t=Sum of squared gradients (accumulated)
  • η\eta=Base learning rate
  • ϵ\epsilon=Small constant for numerical stability (1e-8)

RMSProp

DfRMSProp

RMSProp fixes AdaGrad's aggressive decay by using exponential moving average:

vt=βvt1+(1β)(θL)2v_t = \beta v_{t-1} + (1 - \beta)(\nabla_\theta \mathcal{L})^2
θt+1=θtηvt+ϵθL\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \nabla_\theta \mathcal{L}

This prevents the learning rate from decaying to zero, making it suitable for non-stationary objectives.

RMSProp Update Rule

vt=βvt1+(1β)(θL)2,θt+1=θtηvt+ϵθLv_t = \beta v_{t-1} + (1 - \beta)(\nabla_{\theta} \mathcal{L})^2, \quad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \nabla_{\theta} \mathcal{L}

Here,

  • vtv_t=Exponential moving average of squared gradients
  • β\beta=Decay rate (typically 0.9)
  • η\eta=Learning rate

Adam

DfAdam Optimizer

Adam combines momentum and RMSProp with bias correction:

mt=β1mt1+(1β1)θLm_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta \mathcal{L}
vt=β2vt1+(1β2)(θL)2v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla_\theta \mathcal{L})^2
m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
θt+1=θtηv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

Default: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}. Adam is the most popular optimizer for deep learning.

θt+1=θtηv^t+ϵm^t,wherem^t=mt1β1t,  v^t=vt1β2t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t, \quad \text{where} \quad \hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \; \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

ThAdam Convergence Properties

Adam converges under mild conditions: the learning rate must satisfy tηt=\sum_t \eta_t = \infty and tηt2<\sum_t \eta_t^2 < \infty. The bias correction terms ensure that early gradient estimates are not biased toward zero. However, Adam may not converge to the global optimum in non-convex settings — it can oscillate around local minima due to adaptive learning rates.


AdamW (Decoupled Weight Decay)

DfAdamW

AdamW decouples weight decay from the gradient update:

mt=β1mt1+(1β1)θLm_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta \mathcal{L}
vt=β2vt1+(1β2)(θL)2v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla_\theta \mathcal{L})^2
θt+1=(1ηλ)θtηv^t+ϵm^t\theta_{t+1} = (1 - \eta \lambda)\theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

Unlike L2 regularization (which couples weight decay with adaptive learning rates), AdamW applies weight decay directly to the weights, leading to better generalization. AdamW is the current default optimizer for most deep learning tasks.

θt+1=(1ηλ)θtηv^t+ϵm^t\theta_{t+1} = (1 - \eta \lambda)\theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

ℹ️ Adam vs. AdamW

Standard Adam with L2 regularization applies the penalty through the gradient, which gets scaled by the adaptive learning rate. AdamW applies weight decay directly to the parameters, independent of the adaptive scaling. This decoupling leads to better generalization and is why AdamW is preferred.


Learning Rate Schedules

Cosine Annealing

DfCosine Annealing

The learning rate follows a cosine curve from ηmax\eta_{\max} to ηmin\eta_{\min}:

ηt=ηmin+12(ηmaxηmin)(1+cos(πtT))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)

Smooth decay, widely used with warmup. The most common schedule in modern deep learning.

Cosine Annealing Schedule

ηt=ηmin+12(ηmaxηmin)(1+cos(πtT))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)

Here,

  • ηmax\eta_{\max}=Maximum learning rate
  • ηmin\eta_{\min}=Minimum learning rate
  • tt=Current step
  • TT=Total number of steps

Linear Warmup

DfLinear Warmup

Start with a small learning rate and linearly increase to the target over TwT_w warmup steps:

ηt=ηmaxtTwfor tTw\eta_t = \eta_{\max} \cdot \frac{t}{T_w} \quad \text{for } t \leq T_w

Warmup stabilizes early training when gradients are noisy and the Adam moment estimates are biased.

💡 Warmup Strategy

Warmup is critical for large batch training and Transformers. Start with η0=0\eta_0 = 0 or η0=ηmax/100\eta_0 = \eta_{\max}/100, increase linearly to ηmax\eta_{\max} over 1-5% of total steps, then apply cosine annealing. This prevents early divergence and improves final performance.

Step Decay

DfStep Decay

Reduce the learning rate by a factor γ\gamma at fixed intervals:

ηt=η0γt/S\eta_t = \eta_0 \cdot \gamma^{\lfloor t / S \rfloor}

where SS is the step interval and γ\gamma is the decay factor (typically 0.1). Simple but requires manual tuning of when to decay.


PyTorch Implementation

📝Example: Optimizers and Schedules

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import (
    CosineAnnealingLR, LinearLR, SequentialLR,
    StepLR, ReduceLROnPlateau
)

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

# ═══════════════════════════════════════════════════
# AdamW with Cosine Annealing + Warmup
# ═══════════════════════════════════════════════════

optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# Warmup: 1000 steps, then cosine annealing over 10000 steps
warmup_scheduler = LinearLR(
    optimizer, start_factor=0.01, total_iters=1000
)
cosine_scheduler = CosineAnnealingLR(
    optimizer, T_max=10000, eta_min=1e-6
)
scheduler = SequentialLR(
    optimizer,
    schedulers=[warmup_scheduler, cosine_scheduler],
    milestones=[1000]
)

# Training loop
for step in range(11000):
    # Forward pass (dummy data)
    x = torch.randn(32, 784)
    y = torch.randint(0, 10, (32,))
    output = model(x)
    loss = nn.CrossEntropyLoss()(output, y)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    scheduler.step()

    if step % 2000 == 0:
        print(f"Step {step}: LR={scheduler.get_last_lr()[0]:.6f}, Loss={loss.item():.4f}")

# ═══════════════════════════════════════════════════
# SGD with Momentum (for vision tasks)
# ═══════════════════════════════════════════════════

optimizer_sgd = optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=5e-4
)

# Step decay: reduce by 10x every 30 epochs
scheduler_sgd = StepLR(optimizer_sgd, step_size=30, gamma=0.1)

ℹ️ When to Use Which Optimizer

  • AdamW: Default for most tasks, NLP, Transformers, GANs
  • SGD + Momentum: Computer vision (ResNet training), often achieves better generalization
  • Adam: Quick prototyping, when you don't want to tune learning rate carefully
  • LAMB: Large batch training (>8K samples)

Optimizer Comparison

OptimizerAdaptive LRMemorySpeedGeneralization
SGDNo1xFastBest
SGD + MomentumNo2xFastBest
AdaGradYes2xFastGood
RMSPropYes2xFastGood
AdamYes3xFastGood
AdamWYes3xFastBest (adaptive)

Summary

📋Summary: Optimizers for Deep Learning

  • SGD + Momentum: Best generalization, slowest convergence, requires careful LR tuning
  • Adam: Fast convergence, adaptive LR per parameter, default for prototyping
  • AdamW: Adam with decoupled weight decay, current default for most tasks
  • Cosine annealing + warmup: The standard learning rate schedule
  • Weight decay: L2 regularization, apply through AdamW not L2 penalty
  • Learning rate is the single most important hyperparameter
  • Batch size and learning rate are coupled — scale LR with batch size
  • Mixed precision training: Use torch.amp for 2x speedup with minimal accuracy loss

Practice Exercises

  1. Conceptual: Explain why Adam's bias correction is necessary. What happens to the first few gradient estimates without it?

  2. Experiment: Train ResNet-18 on CIFAR-10 with SGD (lr=0.1, momentum=0.9) vs. Adam (lr=1e-3). Which achieves better final accuracy? Plot the training curves.

  3. Coding: Implement a custom optimizer class in PyTorch that combines Adam with gradient clipping. Test it on a simple regression task.

  4. Visualization: Plot the learning rate schedules (cosine, step decay, linear warmup) for 10000 training steps. How do they differ?

  5. Research: Read the AdamW paper (Loshchilov & Hutter, 2017). Why does decoupled weight decay lead to better generalization than L2 regularization?

Advertisement

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement