Stochastic Gradient Descent
š” Why It Matters
Stochastic Gradient Descent (SGD) is the foundation of all modern deep learning optimization. Every neural network trained today ā from GPT to ResNet to diffusion models ā relies on SGD or one of its adaptive variants. Understanding SGD is non-negotiable for any machine learning practitioner.
In batch gradient descent, computing the full gradient over the entire dataset is prohibitively expensive when datasets contain millions of examples. SGD solves this by approximating the gradient using a small random subset of data, enabling scalable training of large models on massive datasets. The key insight is that a noisy but cheap gradient estimate is often better than an exact but expensive one.
SGD Algorithm
Stochastic Gradient Descent Update
Here,
- =Parameter vector at iteration k
- =Learning rate (step size) at iteration k
- =Stochastic gradient estimate at iteration k
DfStochastic Gradient
The full batch gradient requires computing āf(x) = (1/n) Σᵢ āfįµ¢(x) over all n samples. SGD replaces this with āf_{i_k}(x) for a single random index i_k, reducing per-iteration cost from O(n) to O(1).
Mini-Batch SGD
In practice, purely stochastic (batch size = 1) gradients are too noisy. Mini-batch SGD strikes a balance by sampling a small batch B of size b from the training set:
Mini-Batch Gradient
Here,
- =Mini-batch sampled uniformly at random from training set
- =Mini-batch size (typically 32, 64, 128, 256)
- =Gradient on the i-th training example
Mini-Batch SGD Update
Here,
- =Mini-batch size
- =Learning rate
| Batch Size | Gradient Quality | Compute Cost | GPU Utilization |
|---|---|---|---|
| 1 | Very noisy | Low | Poor |
| 32ā256 | Good trade-off | Moderate | High |
| 512ā4096 | Near-exact | High | Excellent |
| All data | Exact | Very high | Wasteful |
Common batch sizes in practice: 32, 64, 128, 256. Larger batches provide more accurate gradients but require more memory and may generalize worse.
Why SGD Works
š” Unbiased Gradient Estimate
A key property that makes SGD valid is that the stochastic gradient is an unbiased estimator of the true gradient. This means that on average, SGD points in the correct downhill direction.
ThUnbiasedness of SGD
Proof sketch: Since each data point i is sampled uniformly at random:
E[g_k | x_k] = E[(1/b) ΣᵢāB āfįµ¢(x_k)] = (1/n) Σᵢāāāæ āfįµ¢(x_k) = āf(x_k)
This unbiasedness guarantee ensures that, given a constant learning rate α and Lipschitz continuous gradients, SGD converges to a neighborhood of the optimum with radius proportional to α. Decreasing the learning rate over time shrinks this neighborhood, enabling convergence to the exact solution.
š” Convergence Behavior
Unlike batch GD which converges to the exact minimizer, SGD with constant step size converges to a neighborhood of the optimum. The size of this neighborhood depends on the learning rate and the variance of the stochastic gradients. A decaying learning rate schedule α_k ā 0 is required for exact convergence.
Variance of SGD Gradients
DfGradient Variance
The variance ϲ directly affects SGD convergence speed. High variance means noisier updates, requiring smaller learning rates and more iterations. This motivates variance reduction techniques.
ThSGD Convergence Rate (Convex)
The first term O(1/K) decreases with more iterations. The second term αϲ/2 is the "noise floor" ā the irreducible error due to gradient variance. Reducing α shrinks the noise floor but slows convergence, creating a fundamental trade-off.
Variance Reduction
Several techniques reduce SGD variance without requiring full-batch gradients:
SVRG (Stochastic Variance Reduced Gradient): Periodically computes a full gradient to anchor the stochastic estimate, reducing variance to zero near the solution.
SVRG Gradient Estimate
Here,
- =Snapshot point where full gradient was computed
- =Full gradient at snapshot
SAGA: Maintains a table of individual gradients, providing unbiased variance reduction with linear convergence for strongly convex functions.
Importance Sampling: Sampling data points with probability proportional to their gradient magnitude reduces variance compared to uniform sampling.
š” When Variance Reduction Matters
Variance reduction is most useful for small-to-medium datasets where the cost of occasional full-gradient computations is acceptable. For large-scale deep learning, mini-batch SGD with adaptive methods (Adam) is typically preferred.
Momentum
SGD with momentum accelerates convergence by accumulating velocity in directions of consistent gradient.
SGD with Momentum
Here,
- =Velocity (momentum buffer) at iteration k
- =Momentum coefficient (typically 0.9)
Position Update with Momentum
Here,
- =Current parameter vector
- =Learning rate
DfNesterov Accelerated Gradient
Why momentum helps:
- Accumulates velocity in consistent gradient directions, accelerating convergence along ravines
- Dampens oscillations in directions with high curvature
- Effective learning rate is amplified by factor 1/(1-β) ā 10 for β=0.9
| Momentum β | Effective LR Multiplier | Behavior |
|---|---|---|
| 0.0 | 1Ć | Vanilla SGD |
| 0.9 | 10Ć | Standard momentum |
| 0.99 | 100Ć | Heavy momentum |
| 0.0 | 1Ć | No momentum |
AdaGrad
DfAdaGrad
AdaGrad Accumulator
Here,
- =Diagonal matrix of accumulated squared gradients
- =Current gradient
- =Small constant for numerical stability (typically 1e-8)
Key properties:
- Parameters with large accumulated gradients get smaller effective learning rates
- Parameters with sparse gradients get larger effective learning rates
- Well-suited for sparse data (NLP, recommender systems)
Critical limitation: The accumulator G_k only grows, causing the learning rate to decay aggressively and eventually approach zero. This can halt training prematurely.
class AdaGrad:
def __init__(self, lr=0.01, eps=1e-8):
self.lr = lr
self.eps = eps
self.G = None
def step(self, x, grad):
if self.G is None:
self.G = np.zeros_like(x)
self.G += grad ** 2
return x - self.lr * grad / (np.sqrt(self.G) + self.eps)
RMSProp
DfRMSProp
RMSProp Update
Here,
- =Exponential moving average of squared gradients
- =Decay rate (typically 0.9)
- =Global learning rate (typically 0.001)
RMSProp vs AdaGrad: RMSProp replaces the growing accumulator with an exponential moving average, preventing the learning rate from decaying to zero. The effective window is approximately 1/(1-β) ā 10 recent gradients.
| Property | AdaGrad | RMSProp |
|---|---|---|
| Accumulator | Sum of all past g² | Exponential moving average |
| LR decay | Aggressive, monotone | Bounded, adaptive |
| Best for | Sparse, convex | Non-stationary, RNNs |
class RMSProp:
def __init__(self, lr=0.001, beta=0.9, eps=1e-8):
self.lr = lr
self.beta = beta
self.eps = eps
self.v = None
def step(self, x, grad):
if self.v is None:
self.v = np.zeros_like(x)
self.v = self.beta * self.v + (1 - self.beta) * grad ** 2
return x - self.lr * grad / (np.sqrt(self.v) + self.eps)
Adam Optimizer
DfAdam (Adaptive Moment Estimation)
Adam First Moment (Momentum)
Here,
- =Exponential moving average of gradients (first moment)
- =First moment decay rate (typically 0.9)
Adam Second Moment (Variance)
Here,
- =Exponential moving average of squared gradients (second moment)
- =Second moment decay rate (typically 0.999)
Bias Correction
Here,
- =Bias-corrected first moment estimate
- =Bias-corrected second moment estimate
- =Decay factor raised to power k
š” Why Bias Correction Matters
At initialization, mā = 0 and vā = 0. The EMA estimates are biased toward zero in early iterations. Dividing by (1 - βįµ) corrects this bias. After ~10 iterations with βā=0.9, the correction factor is ā 1/0.65 ā 1.54, which is substantial.
Adam Full Algorithm
Algorithm: Adam Optimizer
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Require: Learning rate α (default: 0.001)
Require: Decay rates βā = 0.9, βā = 0.999
Require: Numerical stability ε = 1e-8
Require: Initial parameters xā
1: Initialize mā ā 0, vā ā 0, t ā 0
2: repeat
3: t ā t + 1
4: Sample mini-batch B_k
5: g_k ā (1/|B_k|) ΣᵢāB_k āfįµ¢(x_{t-1})
6: m_t ā βā Ā· m_{t-1} + (1 - βā) Ā· g_t [Update biased first moment]
7: v_t ā βā Ā· v_{t-1} + (1 - βā) Ā· g_t² [Update biased second moment]
8: mĢ_t ā m_t / (1 - βāįµ) [Bias correction]
9: vĢ_t ā v_t / (1 - βāįµ) [Bias correction]
10: x_t ā x_{t-1} - α Ā· mĢ_t / (āvĢ_t + ε) [Parameter update]
11: until convergence
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
class Adam:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.m = None
self.v = None
self.t = 0
def step(self, x, grad):
self.t += 1
if self.m is None:
self.m = np.zeros_like(x)
self.v = np.zeros_like(x)
self.m = self.beta1 * self.m + (1 - self.beta1) * grad
self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2
m_hat = self.m / (1 - self.beta1 ** self.t)
v_hat = self.v / (1 - self.beta2 ** self.t)
return x - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
AdamW: Decoupled Weight Decay
AdamW Update
Here,
- =Weight decay coefficient (typically 0.01)
š” Adam vs AdamW
In standard Adam with L2 regularization, weight decay is coupled with adaptive learning rates, weakening its regularization effect. AdamW decouples weight decay, applying it directly to parameters independent of the gradient, which provides better generalization.
Learning Rate Warmup
DfLinear Warmup
Why warmup is needed:
- At initialization, Adam's second moment estimates vĢ are noisy and unreliable
- Large gradients in early training can cause divergent updates
- Warmup allows moment estimates to stabilize before using full learning rates
- Especially important for transformers and large batch training
Common warmup schedules:
| Schedule | Formula | When to Use |
|---|---|---|
| Linear warmup | α_t = α_target · t/T_warmup | Transformers, large batches |
| Gradual warmup | α_t = α_target · min(t/T, (1-t/T)·s + t/T) | General purpose |
| No warmup | α_t = α_target | Small models, SGD with momentum |
def warmup_lr(step, warmup_steps, target_lr):
if step < warmup_steps:
return target_lr * step / warmup_steps
return target_lr
Python Implementation
import numpy as np
class SGD:
def __init__(self, lr=0.01, momentum=0.0, weight_decay=0.0):
self.lr = lr
self.momentum = momentum
self.weight_decay = weight_decay
self.v = None
def step(self, x, grad):
if self.v is None:
self.v = np.zeros_like(x)
if self.weight_decay > 0:
grad = grad + self.weight_decay * x
self.v = self.momentum * self.v + grad
return x - self.lr * self.v
class AdaGrad:
def __init__(self, lr=0.01, eps=1e-8):
self.lr = lr
self.eps = eps
self.G = None
def step(self, x, grad):
if self.G is None:
self.G = np.zeros_like(x)
self.G += grad ** 2
return x - self.lr * grad / (np.sqrt(self.G) + self.eps)
class RMSProp:
def __init__(self, lr=0.001, beta=0.9, eps=1e-8):
self.lr = lr
self.beta = beta
self.eps = eps
self.v = None
def step(self, x, grad):
if self.v is None:
self.v = np.zeros_like(x)
self.v = self.beta * self.v + (1 - self.beta) * grad ** 2
return x - self.lr * grad / (np.sqrt(self.v) + self.eps)
class Adam:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.m = None
self.v = None
self.t = 0
def step(self, x, grad):
self.t += 1
if self.m is None:
self.m = np.zeros_like(x)
self.v = np.zeros_like(x)
self.m = self.beta1 * self.m + (1 - self.beta1) * grad
self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2
m_hat = self.m / (1 - self.beta1 ** self.t)
v_hat = self.v / (1 - self.beta2 ** self.t)
return x - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
class AdamW:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.weight_decay = weight_decay
self.m = None
self.v = None
self.t = 0
def step(self, x, grad):
self.t += 1
if self.m is None:
self.m = np.zeros_like(x)
self.v = np.zeros_like(x)
self.m = self.beta1 * self.m + (1 - self.beta1) * grad
self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2
m_hat = self.m / (1 - self.beta1 ** self.t)
v_hat = self.v / (1 - self.beta2 ** self.t)
return x - self.lr * (m_hat / (np.sqrt(v_hat) + self.eps) + self.weight_decay * x)
# --- Demo: Compare optimizers on Rosenbrock function ---
def rosenbrock(x):
return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
def rosenbrock_grad(x):
dx0 = -2 * (1 - x[0]) + 200 * (x[1] - x[0]**2) * (-2 * x[0])
dx1 = 200 * (x[1] - x[0]**2)
return np.array([dx0, dx1])
optimizers = {
"SGD": SGD(lr=0.001, momentum=0.9),
"AdaGrad": AdaGrad(lr=0.05),
"RMSProp": RMSProp(lr=0.001),
"Adam": Adam(lr=0.005),
"AdamW": AdamW(lr=0.005, weight_decay=0.01),
}
for name, opt in optimizers.items():
x = np.array([-1.0, 1.0])
for i in range(5000):
grad = rosenbrock_grad(x)
x = opt.step(x, grad)
print(f"{name:10s} | Final loss: {rosenbrock(x):.6f} | Position: ({x[0]:.4f}, {x[1]:.4f})")
Applications in AI/ML
Deep Learning Training
SGD and its variants are used in virtually all neural network training:
| Application | Typical Optimizer | Key Consideration |
|---|---|---|
| Image classification (CNN) | SGD + momentum (0.9) + cosine LR | Generalization gap vs Adam |
| Transformers (LLMs) | AdamW + warmup + cosine LR | Large batch, stability |
| GANs | Adam (βā=0.0, βā=0.9) | Non-stationary objectives |
| Reinforcement learning | Adam | Non-stationary data distribution |
| NLP / Embeddings | AdaGrad or Adam | Sparse gradients |
| Diffusion models | AdamW + warmup | Large models, long training |
š” SGD vs Adam for Generalization
SGD with momentum often achieves better generalization than Adam on vision tasks, despite slower convergence. This is because Adam's adaptive LR can overfit to sharp minima. Many practitioners train with Adam for speed, then fine-tune with SGD for final performance.
Learning Rate Schedules in Practice
Cosine Annealing:
Cosine Annealing
Here,
- =Minimum learning rate (typically 1e-6)
- =Maximum learning rate
- =Total number of training steps
One Cycle Policy:
- Linearly warmup from low LR to peak LR for first ~30% of training
- Cosine decay from peak to very low LR for remaining ~70%
- Popularized by Leslie Smith; used in fast.ai course
Common Mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| Learning rate too high | Loss oscillates or diverges | Use LR finder; start with 1e-3 for Adam |
| Learning rate too low | Extremely slow convergence | Increase by 10Ć until loss decreases faster |
| No warmup with large batch | Training instability, NaN loss | Add linear warmup for 5ā10% of training |
| Using Adam without weight decay | Overfitting, poor generalization | Switch to AdamW with weight_decay=0.01 |
| Ignoring gradient clipping | Exploding gradients in RNNs | Clip gradients at norm 1.0 |
| Wrong β for Adam | Slow or unstable training | Use defaults: βā=0.9, βā=0.999 |
| Batch size too large | Poor generalization | Reduce batch size or increase LR linearly |
| Not shuffling data | Biased gradient estimates | Shuffle training data each epoch |
| Constant learning rate | Fails to converge to sharp minimum | Use cosine decay or step decay |
| Evaluating on training loss | Misleading convergence assessment | Always monitor validation loss |
Interview Questions
Q1: Why is SGD preferred over batch gradient descent for deep learning? A: Batch GD requires computing the gradient over the entire dataset per update, which is infeasible for millions of samples. SGD uses mini-batches, enabling faster iteration, lower memory usage, and the noise helps escape saddle points. The noise also acts as implicit regularization, often improving generalization.
Q2: What is the role of the learning rate in SGD, and how do you choose it? A: The learning rate controls step size. Too large causes divergence; too small causes slow convergence. In practice, use a learning rate finder (sweep log-scale LR and plot loss), start with Adam lr=1e-3 or SGD lr=0.1, and apply cosine annealing or step decay schedules.
Q3: Explain bias correction in Adam. Why is it necessary? A: At initialization mā=vā=0, so early EMA estimates are biased toward zero. Dividing by (1-βįµ) corrects this. For βā=0.9, the correction factor at t=1 is 1/0.1=10, which is critical. After ~20 steps, the correction becomes negligible (<5%).
Q4: Why does AdamW outperform Adam with L2 regularization? A: In Adam, L2 regularization adds Ī»x to the gradient, which is then scaled by 1/āvĢ. This means the effective weight decay varies per-parameter inversely to gradient magnitude. AdamW decouples weight decay, applying Ī»x directly without adaptive scaling, providing uniform regularization.
Q5: When would you use SGD+momentum over Adam? A: SGD+momentum often achieves better final generalization on vision tasks (ImageNet, etc.) despite slower convergence. Adam converges faster but may converge to sharper minima. Common pattern: train with Adam for speed, then switch to SGD for fine-tuning.
Q6: What happens if you use βā=0.999 with a very large batch size? A: The second moment estimate vĢ becomes a very long-running average, potentially stale for non-stationary objectives. Consider reducing βā to 0.99 or 0.95 for GANs or fast-changing loss landscapes.
Q7: Explain the relationship between batch size and learning rate. A: Linear scaling rule: when batch size is multiplied by k, multiply LR by k. This maintains the same signal-to-noise ratio in gradient updates. Must be combined with warmup to avoid early instability. (Goyal et al., 2017)
Practice Problems
šProblem 1: SGD Convergence
Show that for a quadratic function f(x) = (1/2)xįµAx - bįµx with A positive definite, the expected value of the SGD update after one step satisfies E[xā] = (I - αA)xā + αb, which is identical to batch gradient descent.
š”Solution
Since E[g] = āf(x) = Ax - b for quadratic functions, we have E[xā] = xā - α·E[g] = xā - α(Axā - b) = (I - αA)xā + αb. The expected update matches batch GD exactly ā the noise only affects higher moments, not the mean trajectory.
šProblem 2: Why Momentum Helps
Consider minimizing f(x) = xā² + 100xā² (a narrow valley). Explain why SGD oscillates and how momentum fixes this.
š”Solution
The gradient in xā is 100Ć larger than in xā, causing oscillations along xā. Momentum accumulates velocity: oscillating directions average out (β alternates sign), while consistent xā direction accumulates. Effective learning rate becomes α/(1-β) ā 10α in xā, dramatically speeding convergence along the valley floor.
šProblem 3: AdaGrad Decay
In AdaGrad, if a parameter has constant gradient g, show that the effective learning rate decays as O(1/āt).
š”Solution
After t steps, Gā = Σᵢāāįµ g² = tĀ·g². The effective learning rate is α/ā(tĀ·g²) = α/(gĀ·āt), which decays as O(1/āt). This aggressive decay is why AdaGrad can halt training prematurely ā learning rates vanish for parameters with consistently large gradients.
šProblem 4: Adam Hyperparameters
If you double βā in Adam from 0.999 to 0.9999, what happens to the bias correction factor and the effective window of the second moment estimate?
š”Solution
The bias correction factor 1/(1-βāįµ) increases: at t=1000, 1/(1-0.999¹ā°ā°ā°) ā 1/0.632 ā 1.58 vs 1/(1-0.9999¹ā°ā°ā°) ā 1/0.095 ā 10.5. The effective window increases from ~1000 steps to ~10,000 steps. This makes vĢ smoother but slower to adapt to changes in gradient variance.
šProblem 5: Batch Size Scaling
A model trains with batch size 128 and learning rate 0.01. If you increase batch size to 1024 (8Ć larger), what should the new learning rate be according to the linear scaling rule? What warmup strategy would you use?
š”Solution
Linear scaling: new LR = 0.01 Ć 8 = 0.08. Use linear warmup: start at LR = 0.01 for first ~5 epochs (or ~5% of total training), then linearly ramp to 0.08. This prevents instability from large initial gradients amplified by the large LR.
šProblem 6: Gradient Variance Reduction
If you increase mini-batch size from 32 to 512, by what factor does the variance of the gradient estimate decrease (assuming independent samples)?
š”Solution
For independent samples, variance scales as ϲ/b. Increasing b from 32 to 512 (16Ć) decreases variance by factor 16. The standard deviation decreases by factor ā16 = 4. This is why larger batches give smoother gradients but with diminishing returns.
šProblem 7: Adam vs SGD Generalization
You observe that Adam converges 5Ć faster than SGD on a computer vision task, but SGD achieves 2% higher accuracy on the test set. Propose two strategies to get the best of both worlds.
š”Solution
Strategy 1: Train with Adam for the first 80% of training for fast convergence, then switch to SGD with momentum for the final 20% to settle into a flatter minimum. Strategy 2: Use AdamW with cosine annealing for full training but increase weight decay (e.g., 0.05ā0.1) to regularize sharper minima. Both strategies leverage Adam's speed while improving generalization.
Quick Reference
| Algorithm | Update Rule | Adaptive LR | Momentum | Best For |
|---|---|---|---|---|
| SGD | x - αg | No | Optional | Vision, final training |
| AdaGrad | x - αg/ā(G+ε) | Yes | No | Sparse features, NLP |
| RMSProp | x - αg/ā(v+ε) | Yes | No | RNNs, non-stationary |
| Adam | x - αmĢ/(āvĢ+ε) | Yes | Yes | Default optimizer |
| AdamW | x - α(mĢ/(āvĢ+ε) + Ī»x) | Yes | Yes | Transformers, general |
Default hyperparameters:
| Parameter | SGD | Adam | AdamW |
|---|---|---|---|
| Learning rate | 0.1 | 0.001 | 0.001 |
| βā (momentum) | 0.9 | 0.9 | 0.9 |
| βā | ā | 0.999 | 0.999 |
| Weight decay | 0 | 0 | 0.01 |
| Epsilon | ā | 1e-8 | 1e-8 |
Key formulas:
- SGD: x ā x - α·g
- Momentum: v ā βv + g; x ā x - αv
- AdaGrad: G ā G + g²; x ā x - αg/ā(G+ε)
- RMSProp: v ā 0.9v + 0.1g²; x ā x - αg/ā(v+ε)
- Adam: m ā 0.9m + 0.1g; v ā 0.999v + 0.001g²; x ā x - α·mĢ/(āvĢ+ε)
Cross-References
- Gradient Descent (062): The deterministic foundation; batch gradient descent converges exactly but is slow.
- Newton's Method (064): Second-order methods that use curvature information for faster convergence.
- Convex Optimization (061): Theoretical convergence guarantees for SGD in convex settings.
- Constrained Optimization (065): Extends SGD to constrained problems via projected variants.
- Hyperparameter Optimization (069): Finding optimal learning rates and batch sizes systematically.
- Calculus: Partial Derivatives (027): Foundation for computing gradients in multi-dimensional optimization.
- Calculus: Optimization (030): Classical unconstrained optimization theory and optimality conditions.
- Linear Algebra: Norms (015): Measuring gradient magnitude and parameter distance in optimization.
- Probability: Expectation (039): Expected value analysis of SGD convergence.
- Probability: CLT (042): Central Limit Theorem explains why mini-batch gradients concentrate around the true gradient.