← Math|63 of 100
Optimization

Stochastic Gradient Descent & Adaptive Optimizers

Master SGD, mini-batch sampling, adaptive learning rates, Adam, and modern optimizer design for deep learning.

šŸ“‚ StochasticšŸ“– Lesson 63 of 100šŸŽ“ Free Course

Advertisement

Stochastic Gradient Descent

šŸ’” Why It Matters

Stochastic Gradient Descent (SGD) is the foundation of all modern deep learning optimization. Every neural network trained today — from GPT to ResNet to diffusion models — relies on SGD or one of its adaptive variants. Understanding SGD is non-negotiable for any machine learning practitioner.

In batch gradient descent, computing the full gradient over the entire dataset is prohibitively expensive when datasets contain millions of examples. SGD solves this by approximating the gradient using a small random subset of data, enabling scalable training of large models on massive datasets. The key insight is that a noisy but cheap gradient estimate is often better than an exact but expensive one.


SGD Algorithm

Stochastic Gradient Descent Update

xk+1=xkāˆ’Ī±k gkx_{k+1} = x_k - \alpha_k \, g_k

Here,

  • xkx_k=Parameter vector at iteration k
  • αk\alpha_k=Learning rate (step size) at iteration k
  • gkg_k=Stochastic gradient estimate at iteration k

DfStochastic Gradient

The full batch gradient requires computing āˆ‡f(x) = (1/n) Σᵢ āˆ‡fįµ¢(x) over all n samples. SGD replaces this with āˆ‡f_{i_k}(x) for a single random index i_k, reducing per-iteration cost from O(n) to O(1).


Mini-Batch SGD

In practice, purely stochastic (batch size = 1) gradients are too noisy. Mini-batch SGD strikes a balance by sampling a small batch B of size b from the training set:

Mini-Batch Gradient

gk=1bāˆ‘i∈Bkāˆ‡fi(xk)g_k = \frac{1}{b} \sum_{i \in B_k} \nabla f_i(x_k)

Here,

  • BkB_k=Mini-batch sampled uniformly at random from training set
  • bb=Mini-batch size (typically 32, 64, 128, 256)
  • āˆ‡fi(xk)\nabla f_i(x_k)=Gradient on the i-th training example

Mini-Batch SGD Update

xk+1=xkāˆ’Ī±kā‹…1bāˆ‘i∈Bkāˆ‡fi(xk)x_{k+1} = x_k - \alpha_k \cdot \frac{1}{b} \sum_{i \in B_k} \nabla f_i(x_k)

Here,

  • bb=Mini-batch size
  • αk\alpha_k=Learning rate
Batch SizeGradient QualityCompute CostGPU Utilization
1Very noisyLowPoor
32–256Good trade-offModerateHigh
512–4096Near-exactHighExcellent
All dataExactVery highWasteful

Common batch sizes in practice: 32, 64, 128, 256. Larger batches provide more accurate gradients but require more memory and may generalize worse.


Why SGD Works

šŸ’” Unbiased Gradient Estimate

A key property that makes SGD valid is that the stochastic gradient is an unbiased estimator of the true gradient. This means that on average, SGD points in the correct downhill direction.

ThUnbiasedness of SGD

Proof sketch: Since each data point i is sampled uniformly at random:

E[g_k | x_k] = E[(1/b) Σᵢ∈B āˆ‡fįµ¢(x_k)] = (1/n) Ī£įµ¢ā‚Œā‚āæ āˆ‡fįµ¢(x_k) = āˆ‡f(x_k)

This unbiasedness guarantee ensures that, given a constant learning rate α and Lipschitz continuous gradients, SGD converges to a neighborhood of the optimum with radius proportional to α. Decreasing the learning rate over time shrinks this neighborhood, enabling convergence to the exact solution.

šŸ’” Convergence Behavior

Unlike batch GD which converges to the exact minimizer, SGD with constant step size converges to a neighborhood of the optimum. The size of this neighborhood depends on the learning rate and the variance of the stochastic gradients. A decaying learning rate schedule α_k → 0 is required for exact convergence.


Variance of SGD Gradients

DfGradient Variance

The variance σ² directly affects SGD convergence speed. High variance means noisier updates, requiring smaller learning rates and more iterations. This motivates variance reduction techniques.

ThSGD Convergence Rate (Convex)

The first term O(1/K) decreases with more iterations. The second term ασ²/2 is the "noise floor" — the irreducible error due to gradient variance. Reducing α shrinks the noise floor but slows convergence, creating a fundamental trade-off.


Variance Reduction

Several techniques reduce SGD variance without requiring full-batch gradients:

SVRG (Stochastic Variance Reduced Gradient): Periodically computes a full gradient to anchor the stochastic estimate, reducing variance to zero near the solution.

SVRG Gradient Estimate

g~k=āˆ‡fik(xk)āˆ’āˆ‡fik(x~)+āˆ‡f(x~)\tilde{g}_k = \nabla f_{i_k}(x_k) - \nabla f_{i_k}(\tilde{x}) + \nabla f(\tilde{x})

Here,

  • x~\tilde{x}=Snapshot point where full gradient was computed
  • āˆ‡f(x~)\nabla f(\tilde{x})=Full gradient at snapshot

SAGA: Maintains a table of individual gradients, providing unbiased variance reduction with linear convergence for strongly convex functions.

Importance Sampling: Sampling data points with probability proportional to their gradient magnitude reduces variance compared to uniform sampling.

šŸ’” When Variance Reduction Matters

Variance reduction is most useful for small-to-medium datasets where the cost of occasional full-gradient computations is acceptable. For large-scale deep learning, mini-batch SGD with adaptive methods (Adam) is typically preferred.


Momentum

SGD with momentum accelerates convergence by accumulating velocity in directions of consistent gradient.

SGD with Momentum

vk=β vkāˆ’1+āˆ‡f(xk)v_k = \beta \, v_{k-1} + \nabla f(x_k)

Here,

  • vkv_k=Velocity (momentum buffer) at iteration k
  • β\beta=Momentum coefficient (typically 0.9)

Position Update with Momentum

xk+1=xkāˆ’Ī±k vkx_{k+1} = x_k - \alpha_k \, v_k

Here,

  • xkx_k=Current parameter vector
  • αk\alpha_k=Learning rate

DfNesterov Accelerated Gradient

Why momentum helps:

  • Accumulates velocity in consistent gradient directions, accelerating convergence along ravines
  • Dampens oscillations in directions with high curvature
  • Effective learning rate is amplified by factor 1/(1-β) ā‰ˆ 10 for β=0.9
Momentum βEffective LR MultiplierBehavior
0.01ƗVanilla SGD
0.910ƗStandard momentum
0.99100ƗHeavy momentum
0.01ƗNo momentum

AdaGrad

DfAdaGrad

AdaGrad Accumulator

Gk=Gkāˆ’1+gkāŠ™gkG_k = G_{k-1} + g_k \odot g_k

Here,

  • GkG_k=Diagonal matrix of accumulated squared gradients
  • gkg_k=Current gradient
  • ϵ\epsilon=Small constant for numerical stability (typically 1e-8)

Key properties:

  • Parameters with large accumulated gradients get smaller effective learning rates
  • Parameters with sparse gradients get larger effective learning rates
  • Well-suited for sparse data (NLP, recommender systems)

Critical limitation: The accumulator G_k only grows, causing the learning rate to decay aggressively and eventually approach zero. This can halt training prematurely.

class AdaGrad:
    def __init__(self, lr=0.01, eps=1e-8):
        self.lr = lr
        self.eps = eps
        self.G = None

    def step(self, x, grad):
        if self.G is None:
            self.G = np.zeros_like(x)
        self.G += grad ** 2
        return x - self.lr * grad / (np.sqrt(self.G) + self.eps)

RMSProp

DfRMSProp

RMSProp Update

xk+1=xkāˆ’Ī±vk+ĻµāŠ™gkx_{k+1} = x_k - \frac{\alpha}{\sqrt{v_k + \epsilon}} \odot g_k

Here,

  • vkv_k=Exponential moving average of squared gradients
  • β\beta=Decay rate (typically 0.9)
  • α\alpha=Global learning rate (typically 0.001)

RMSProp vs AdaGrad: RMSProp replaces the growing accumulator with an exponential moving average, preventing the learning rate from decaying to zero. The effective window is approximately 1/(1-β) ā‰ˆ 10 recent gradients.

PropertyAdaGradRMSProp
AccumulatorSum of all past g²Exponential moving average
LR decayAggressive, monotoneBounded, adaptive
Best forSparse, convexNon-stationary, RNNs
class RMSProp:
    def __init__(self, lr=0.001, beta=0.9, eps=1e-8):
        self.lr = lr
        self.beta = beta
        self.eps = eps
        self.v = None

    def step(self, x, grad):
        if self.v is None:
            self.v = np.zeros_like(x)
        self.v = self.beta * self.v + (1 - self.beta) * grad ** 2
        return x - self.lr * grad / (np.sqrt(self.v) + self.eps)

Adam Optimizer

DfAdam (Adaptive Moment Estimation)

Adam First Moment (Momentum)

mk=β1 mkāˆ’1+(1āˆ’Ī²1) gkm_k = \beta_1 \, m_{k-1} + (1 - \beta_1) \, g_k

Here,

  • mkm_k=Exponential moving average of gradients (first moment)
  • β1\beta_1=First moment decay rate (typically 0.9)

Adam Second Moment (Variance)

vk=β2 vkāˆ’1+(1āˆ’Ī²2) gk2v_k = \beta_2 \, v_{k-1} + (1 - \beta_2) \, g_k^2

Here,

  • vkv_k=Exponential moving average of squared gradients (second moment)
  • β2\beta_2=Second moment decay rate (typically 0.999)

Bias Correction

m^k=mk1āˆ’Ī²1k,v^k=vk1āˆ’Ī²2k\hat{m}_k = \frac{m_k}{1 - \beta_1^k}, \quad \hat{v}_k = \frac{v_k}{1 - \beta_2^k}

Here,

  • m^k\hat{m}_k=Bias-corrected first moment estimate
  • v^k\hat{v}_k=Bias-corrected second moment estimate
  • β1k\beta_1^k=Decay factor raised to power k

šŸ’” Why Bias Correction Matters

At initialization, mā‚€ = 0 and vā‚€ = 0. The EMA estimates are biased toward zero in early iterations. Dividing by (1 - βᵗ) corrects this bias. After ~10 iterations with β₁=0.9, the correction factor is ā‰ˆ 1/0.65 ā‰ˆ 1.54, which is substantial.

Adam Full Algorithm

Architecture Diagram
Algorithm: Adam Optimizer
─────────────────────────────────────────────
Require: Learning rate α (default: 0.001)
Require: Decay rates β₁ = 0.9, β₂ = 0.999
Require: Numerical stability ε = 1e-8
Require: Initial parameters xā‚€

1:  Initialize mā‚€ ← 0, vā‚€ ← 0, t ← 0
2:  repeat
3:      t ← t + 1
4:      Sample mini-batch B_k
5:      g_k ← (1/|B_k|) Σᵢ∈B_k āˆ‡fįµ¢(x_{t-1})
6:      m_t ← β₁ Ā· m_{t-1} + (1 - β₁) Ā· g_t       [Update biased first moment]
7:      v_t ← β₂ Ā· v_{t-1} + (1 - β₂) Ā· g_t²       [Update biased second moment]
8:      mĢ‚_t ← m_t / (1 - β₁ᵗ)                       [Bias correction]
9:      vĢ‚_t ← v_t / (1 - β₂ᵗ)                       [Bias correction]
10:     x_t ← x_{t-1} - α Ā· mĢ‚_t / (√vĢ‚_t + ε)       [Parameter update]
11: until convergence
─────────────────────────────────────────────
class Adam:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = None
        self.v = None
        self.t = 0

    def step(self, x, grad):
        self.t += 1
        if self.m is None:
            self.m = np.zeros_like(x)
            self.v = np.zeros_like(x)

        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2

        m_hat = self.m / (1 - self.beta1 ** self.t)
        v_hat = self.v / (1 - self.beta2 ** self.t)

        return x - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

AdamW: Decoupled Weight Decay

AdamW Update

xk+1=xkāˆ’Ī±(m^kv^k+ϵ+λ xk)x_{k+1} = x_k - \alpha \left( \frac{\hat{m}_k}{\sqrt{\hat{v}_k} + \epsilon} + \lambda \, x_k \right)

Here,

  • Ī»\lambda=Weight decay coefficient (typically 0.01)

šŸ’” Adam vs AdamW

In standard Adam with L2 regularization, weight decay is coupled with adaptive learning rates, weakening its regularization effect. AdamW decouples weight decay, applying it directly to parameters independent of the gradient, which provides better generalization.


Learning Rate Warmup

DfLinear Warmup

Why warmup is needed:

  • At initialization, Adam's second moment estimates vĢ‚ are noisy and unreliable
  • Large gradients in early training can cause divergent updates
  • Warmup allows moment estimates to stabilize before using full learning rates
  • Especially important for transformers and large batch training

Common warmup schedules:

ScheduleFormulaWhen to Use
Linear warmupα_t = α_target · t/T_warmupTransformers, large batches
Gradual warmupα_t = α_target · min(t/T, (1-t/T)·s + t/T)General purpose
No warmupα_t = α_targetSmall models, SGD with momentum
def warmup_lr(step, warmup_steps, target_lr):
    if step < warmup_steps:
        return target_lr * step / warmup_steps
    return target_lr

Python Implementation

import numpy as np

class SGD:
    def __init__(self, lr=0.01, momentum=0.0, weight_decay=0.0):
        self.lr = lr
        self.momentum = momentum
        self.weight_decay = weight_decay
        self.v = None

    def step(self, x, grad):
        if self.v is None:
            self.v = np.zeros_like(x)

        if self.weight_decay > 0:
            grad = grad + self.weight_decay * x

        self.v = self.momentum * self.v + grad
        return x - self.lr * self.v


class AdaGrad:
    def __init__(self, lr=0.01, eps=1e-8):
        self.lr = lr
        self.eps = eps
        self.G = None

    def step(self, x, grad):
        if self.G is None:
            self.G = np.zeros_like(x)
        self.G += grad ** 2
        return x - self.lr * grad / (np.sqrt(self.G) + self.eps)


class RMSProp:
    def __init__(self, lr=0.001, beta=0.9, eps=1e-8):
        self.lr = lr
        self.beta = beta
        self.eps = eps
        self.v = None

    def step(self, x, grad):
        if self.v is None:
            self.v = np.zeros_like(x)
        self.v = self.beta * self.v + (1 - self.beta) * grad ** 2
        return x - self.lr * grad / (np.sqrt(self.v) + self.eps)


class Adam:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = None
        self.v = None
        self.t = 0

    def step(self, x, grad):
        self.t += 1
        if self.m is None:
            self.m = np.zeros_like(x)
            self.v = np.zeros_like(x)

        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2

        m_hat = self.m / (1 - self.beta1 ** self.t)
        v_hat = self.v / (1 - self.beta2 ** self.t)

        return x - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)


class AdamW:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.weight_decay = weight_decay
        self.m = None
        self.v = None
        self.t = 0

    def step(self, x, grad):
        self.t += 1
        if self.m is None:
            self.m = np.zeros_like(x)
            self.v = np.zeros_like(x)

        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2

        m_hat = self.m / (1 - self.beta1 ** self.t)
        v_hat = self.v / (1 - self.beta2 ** self.t)

        return x - self.lr * (m_hat / (np.sqrt(v_hat) + self.eps) + self.weight_decay * x)


# --- Demo: Compare optimizers on Rosenbrock function ---
def rosenbrock(x):
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def rosenbrock_grad(x):
    dx0 = -2 * (1 - x[0]) + 200 * (x[1] - x[0]**2) * (-2 * x[0])
    dx1 = 200 * (x[1] - x[0]**2)
    return np.array([dx0, dx1])

optimizers = {
    "SGD": SGD(lr=0.001, momentum=0.9),
    "AdaGrad": AdaGrad(lr=0.05),
    "RMSProp": RMSProp(lr=0.001),
    "Adam": Adam(lr=0.005),
    "AdamW": AdamW(lr=0.005, weight_decay=0.01),
}

for name, opt in optimizers.items():
    x = np.array([-1.0, 1.0])
    for i in range(5000):
        grad = rosenbrock_grad(x)
        x = opt.step(x, grad)
    print(f"{name:10s} | Final loss: {rosenbrock(x):.6f} | Position: ({x[0]:.4f}, {x[1]:.4f})")

Applications in AI/ML

Deep Learning Training

SGD and its variants are used in virtually all neural network training:

ApplicationTypical OptimizerKey Consideration
Image classification (CNN)SGD + momentum (0.9) + cosine LRGeneralization gap vs Adam
Transformers (LLMs)AdamW + warmup + cosine LRLarge batch, stability
GANsAdam (β₁=0.0, β₂=0.9)Non-stationary objectives
Reinforcement learningAdamNon-stationary data distribution
NLP / EmbeddingsAdaGrad or AdamSparse gradients
Diffusion modelsAdamW + warmupLarge models, long training

šŸ’” SGD vs Adam for Generalization

SGD with momentum often achieves better generalization than Adam on vision tasks, despite slower convergence. This is because Adam's adaptive LR can overfit to sharp minima. Many practitioners train with Adam for speed, then fine-tune with SGD for final performance.

Learning Rate Schedules in Practice

Cosine Annealing:

Cosine Annealing

αt=αmin⁔+12(αmaxā”āˆ’Ī±min⁔)(1+cos⁔(Ļ€t/T))\alpha_t = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})(1 + \cos(\pi t / T))

Here,

  • αmin⁔\alpha_{\min}=Minimum learning rate (typically 1e-6)
  • αmax⁔\alpha_{\max}=Maximum learning rate
  • TT=Total number of training steps

One Cycle Policy:

  • Linearly warmup from low LR to peak LR for first ~30% of training
  • Cosine decay from peak to very low LR for remaining ~70%
  • Popularized by Leslie Smith; used in fast.ai course

Common Mistakes

MistakeSymptomFix
Learning rate too highLoss oscillates or divergesUse LR finder; start with 1e-3 for Adam
Learning rate too lowExtremely slow convergenceIncrease by 10Ɨ until loss decreases faster
No warmup with large batchTraining instability, NaN lossAdd linear warmup for 5–10% of training
Using Adam without weight decayOverfitting, poor generalizationSwitch to AdamW with weight_decay=0.01
Ignoring gradient clippingExploding gradients in RNNsClip gradients at norm 1.0
Wrong β for AdamSlow or unstable trainingUse defaults: β₁=0.9, β₂=0.999
Batch size too largePoor generalizationReduce batch size or increase LR linearly
Not shuffling dataBiased gradient estimatesShuffle training data each epoch
Constant learning rateFails to converge to sharp minimumUse cosine decay or step decay
Evaluating on training lossMisleading convergence assessmentAlways monitor validation loss

Interview Questions

Q1: Why is SGD preferred over batch gradient descent for deep learning? A: Batch GD requires computing the gradient over the entire dataset per update, which is infeasible for millions of samples. SGD uses mini-batches, enabling faster iteration, lower memory usage, and the noise helps escape saddle points. The noise also acts as implicit regularization, often improving generalization.

Q2: What is the role of the learning rate in SGD, and how do you choose it? A: The learning rate controls step size. Too large causes divergence; too small causes slow convergence. In practice, use a learning rate finder (sweep log-scale LR and plot loss), start with Adam lr=1e-3 or SGD lr=0.1, and apply cosine annealing or step decay schedules.

Q3: Explain bias correction in Adam. Why is it necessary? A: At initialization mā‚€=vā‚€=0, so early EMA estimates are biased toward zero. Dividing by (1-βᵗ) corrects this. For β₁=0.9, the correction factor at t=1 is 1/0.1=10, which is critical. After ~20 steps, the correction becomes negligible (<5%).

Q4: Why does AdamW outperform Adam with L2 regularization? A: In Adam, L2 regularization adds Ī»x to the gradient, which is then scaled by 1/√vĢ‚. This means the effective weight decay varies per-parameter inversely to gradient magnitude. AdamW decouples weight decay, applying Ī»x directly without adaptive scaling, providing uniform regularization.

Q5: When would you use SGD+momentum over Adam? A: SGD+momentum often achieves better final generalization on vision tasks (ImageNet, etc.) despite slower convergence. Adam converges faster but may converge to sharper minima. Common pattern: train with Adam for speed, then switch to SGD for fine-tuning.

Q6: What happens if you use β₂=0.999 with a very large batch size? A: The second moment estimate vĢ‚ becomes a very long-running average, potentially stale for non-stationary objectives. Consider reducing β₂ to 0.99 or 0.95 for GANs or fast-changing loss landscapes.

Q7: Explain the relationship between batch size and learning rate. A: Linear scaling rule: when batch size is multiplied by k, multiply LR by k. This maintains the same signal-to-noise ratio in gradient updates. Must be combined with warmup to avoid early instability. (Goyal et al., 2017)


Practice Problems

šŸ“Problem 1: SGD Convergence

Show that for a quadratic function f(x) = (1/2)xįµ€Ax - bįµ€x with A positive definite, the expected value of the SGD update after one step satisfies E[x₁] = (I - αA)xā‚€ + αb, which is identical to batch gradient descent.

šŸ’”Solution

Since E[g] = āˆ‡f(x) = Ax - b for quadratic functions, we have E[x₁] = xā‚€ - α·E[g] = xā‚€ - α(Axā‚€ - b) = (I - αA)xā‚€ + αb. The expected update matches batch GD exactly — the noise only affects higher moments, not the mean trajectory.

šŸ“Problem 2: Why Momentum Helps

Consider minimizing f(x) = x₁² + 100x₂² (a narrow valley). Explain why SGD oscillates and how momentum fixes this.

šŸ’”Solution

The gradient in xā‚‚ is 100Ɨ larger than in x₁, causing oscillations along xā‚‚. Momentum accumulates velocity: oscillating directions average out (β alternates sign), while consistent x₁ direction accumulates. Effective learning rate becomes α/(1-β) ā‰ˆ 10α in x₁, dramatically speeding convergence along the valley floor.

šŸ“Problem 3: AdaGrad Decay

In AdaGrad, if a parameter has constant gradient g, show that the effective learning rate decays as O(1/√t).

šŸ’”Solution

After t steps, Gā‚œ = Ī£įµ¢ā‚Œā‚įµ— g² = tĀ·g². The effective learning rate is α/√(tĀ·g²) = α/(g·√t), which decays as O(1/√t). This aggressive decay is why AdaGrad can halt training prematurely — learning rates vanish for parameters with consistently large gradients.

šŸ“Problem 4: Adam Hyperparameters

If you double β₂ in Adam from 0.999 to 0.9999, what happens to the bias correction factor and the effective window of the second moment estimate?

šŸ’”Solution

The bias correction factor 1/(1-β₂ᵗ) increases: at t=1000, 1/(1-0.999¹⁰⁰⁰) ā‰ˆ 1/0.632 ā‰ˆ 1.58 vs 1/(1-0.9999¹⁰⁰⁰) ā‰ˆ 1/0.095 ā‰ˆ 10.5. The effective window increases from ~1000 steps to ~10,000 steps. This makes vĢ‚ smoother but slower to adapt to changes in gradient variance.

šŸ“Problem 5: Batch Size Scaling

A model trains with batch size 128 and learning rate 0.01. If you increase batch size to 1024 (8Ɨ larger), what should the new learning rate be according to the linear scaling rule? What warmup strategy would you use?

šŸ’”Solution

Linear scaling: new LR = 0.01 Ɨ 8 = 0.08. Use linear warmup: start at LR = 0.01 for first ~5 epochs (or ~5% of total training), then linearly ramp to 0.08. This prevents instability from large initial gradients amplified by the large LR.

šŸ“Problem 6: Gradient Variance Reduction

If you increase mini-batch size from 32 to 512, by what factor does the variance of the gradient estimate decrease (assuming independent samples)?

šŸ’”Solution

For independent samples, variance scales as σ²/b. Increasing b from 32 to 512 (16Ɨ) decreases variance by factor 16. The standard deviation decreases by factor √16 = 4. This is why larger batches give smoother gradients but with diminishing returns.

šŸ“Problem 7: Adam vs SGD Generalization

You observe that Adam converges 5Ɨ faster than SGD on a computer vision task, but SGD achieves 2% higher accuracy on the test set. Propose two strategies to get the best of both worlds.

šŸ’”Solution

Strategy 1: Train with Adam for the first 80% of training for fast convergence, then switch to SGD with momentum for the final 20% to settle into a flatter minimum. Strategy 2: Use AdamW with cosine annealing for full training but increase weight decay (e.g., 0.05–0.1) to regularize sharper minima. Both strategies leverage Adam's speed while improving generalization.


Quick Reference

AlgorithmUpdate RuleAdaptive LRMomentumBest For
SGDx - αgNoOptionalVision, final training
AdaGradx - αg/√(G+ε)YesNoSparse features, NLP
RMSPropx - αg/√(v+ε)YesNoRNNs, non-stationary
Adamx - αmĢ‚/(√vĢ‚+ε)YesYesDefault optimizer
AdamWx - α(mĢ‚/(√vĢ‚+ε) + Ī»x)YesYesTransformers, general

Default hyperparameters:

ParameterSGDAdamAdamW
Learning rate0.10.0010.001
β₁ (momentum)0.90.90.9
β₂—0.9990.999
Weight decay000.01
Epsilon—1e-81e-8

Key formulas:

  • SGD: x ← x - α·g
  • Momentum: v ← βv + g; x ← x - αv
  • AdaGrad: G ← G + g²; x ← x - αg/√(G+ε)
  • RMSProp: v ← 0.9v + 0.1g²; x ← x - αg/√(v+ε)
  • Adam: m ← 0.9m + 0.1g; v ← 0.999v + 0.001g²; x ← x - α·mĢ‚/(√vĢ‚+ε)

Cross-References

  • Gradient Descent (062): The deterministic foundation; batch gradient descent converges exactly but is slow.
  • Newton's Method (064): Second-order methods that use curvature information for faster convergence.
  • Convex Optimization (061): Theoretical convergence guarantees for SGD in convex settings.
  • Constrained Optimization (065): Extends SGD to constrained problems via projected variants.
  • Hyperparameter Optimization (069): Finding optimal learning rates and batch sizes systematically.
  • Calculus: Partial Derivatives (027): Foundation for computing gradients in multi-dimensional optimization.
  • Calculus: Optimization (030): Classical unconstrained optimization theory and optimality conditions.
  • Linear Algebra: Norms (015): Measuring gradient magnitude and parameter distance in optimization.
  • Probability: Expectation (039): Expected value analysis of SGD convergence.
  • Probability: CLT (042): Central Limit Theorem explains why mini-batch gradients concentrate around the true gradient.
Lesson Progress63 / 100