Stochastic Gradient Descent & Adaptive Optimizers

Stochastic Gradient Descent

💡 Why It Matters

Stochastic Gradient Descent (SGD) is the foundation of all modern deep learning optimization. Every neural network trained today — from GPT to ResNet to diffusion models — relies on SGD or one of its adaptive variants. Understanding SGD is non-negotiable for any machine learning practitioner.

In batch gradient descent, computing the full gradient over the entire dataset is prohibitively expensive when datasets contain millions of examples. SGD solves this by approximating the gradient using a small random subset of data, enabling scalable training of large models on massive datasets. The key insight is that a noisy but cheap gradient estimate is often better than an exact but expensive one.

SGD Algorithm

Stochastic Gradient Descent Update

x_{k+1} = x_k - \alpha_k \, g_k

Here,

$x_k$ =Parameter vector at iteration k
$\alpha_k$ =Learning rate (step size) at iteration k
$g_k$ =Stochastic gradient estimate at iteration k

DfStochastic Gradient

The full batch gradient requires computing ∇f(x) = (1/n) Σᵢ ∇fᵢ(x) over all n samples. SGD replaces this with ∇f_{i_k}(x) for a single random index i_k, reducing per-iteration cost from O(n) to O(1).

Mini-Batch SGD

In practice, purely stochastic (batch size = 1) gradients are too noisy. Mini-batch SGD strikes a balance by sampling a small batch B of size b from the training set:

Mini-Batch Gradient

g_k = \frac{1}{b} \sum_{i \in B_k} \nabla f_i(x_k)

Here,

$B_k$ =Mini-batch sampled uniformly at random from training set
$b$ =Mini-batch size (typically 32, 64, 128, 256)
$\nabla f_i(x_k)$ =Gradient on the i-th training example

Mini-Batch SGD Update

x_{k+1} = x_k - \alpha_k \cdot \frac{1}{b} \sum_{i \in B_k} \nabla f_i(x_k)

Here,

$b$ =Mini-batch size
$\alpha_k$ =Learning rate

Batch Size	Gradient Quality	Compute Cost	GPU Utilization
1	Very noisy	Low	Poor
32–256	Good trade-off	Moderate	High
512–4096	Near-exact	High	Excellent
All data	Exact	Very high	Wasteful

Common batch sizes in practice: 32, 64, 128, 256. Larger batches provide more accurate gradients but require more memory and may generalize worse.

Why SGD Works

💡 Unbiased Gradient Estimate

A key property that makes SGD valid is that the stochastic gradient is an unbiased estimator of the true gradient. This means that on average, SGD points in the correct downhill direction.

ThUnbiasedness of SGD

Proof sketch: Since each data point i is sampled uniformly at random:

E[g_k | x_k] = E[(1/b) Σᵢ∈B ∇fᵢ(x_k)] = (1/n) Σᵢ₌₁ⁿ ∇fᵢ(x_k) = ∇f(x_k)

This unbiasedness guarantee ensures that, given a constant learning rate α and Lipschitz continuous gradients, SGD converges to a neighborhood of the optimum with radius proportional to α. Decreasing the learning rate over time shrinks this neighborhood, enabling convergence to the exact solution.

💡 Convergence Behavior

Unlike batch GD which converges to the exact minimizer, SGD with constant step size converges to a neighborhood of the optimum. The size of this neighborhood depends on the learning rate and the variance of the stochastic gradients. A decaying learning rate schedule α_k → 0 is required for exact convergence.

Variance of SGD Gradients

DfGradient Variance

The variance σ² directly affects SGD convergence speed. High variance means noisier updates, requiring smaller learning rates and more iterations. This motivates variance reduction techniques.

ThSGD Convergence Rate (Convex)

The first term O(1/K) decreases with more iterations. The second term ασ²/2 is the "noise floor" — the irreducible error due to gradient variance. Reducing α shrinks the noise floor but slows convergence, creating a fundamental trade-off.

Variance Reduction

Several techniques reduce SGD variance without requiring full-batch gradients:

SVRG (Stochastic Variance Reduced Gradient): Periodically computes a full gradient to anchor the stochastic estimate, reducing variance to zero near the solution.

SVRG Gradient Estimate

\tilde{g}_k = \nabla f_{i_k}(x_k) - \nabla f_{i_k}(\tilde{x}) + \nabla f(\tilde{x})

Here,

$\tilde{x}$ =Snapshot point where full gradient was computed
$\nabla f(\tilde{x})$ =Full gradient at snapshot

SAGA: Maintains a table of individual gradients, providing unbiased variance reduction with linear convergence for strongly convex functions.

Importance Sampling: Sampling data points with probability proportional to their gradient magnitude reduces variance compared to uniform sampling.

💡 When Variance Reduction Matters

Variance reduction is most useful for small-to-medium datasets where the cost of occasional full-gradient computations is acceptable. For large-scale deep learning, mini-batch SGD with adaptive methods (Adam) is typically preferred.

Momentum

SGD with momentum accelerates convergence by accumulating velocity in directions of consistent gradient.

SGD with Momentum

v_k = \beta \, v_{k-1} + \nabla f(x_k)

Here,

$v_k$ =Velocity (momentum buffer) at iteration k
$\beta$ =Momentum coefficient (typically 0.9)

Position Update with Momentum

x_{k+1} = x_k - \alpha_k \, v_k

Here,

$x_k$ =Current parameter vector
$\alpha_k$ =Learning rate

DfNesterov Accelerated Gradient

Why momentum helps:

Accumulates velocity in consistent gradient directions, accelerating convergence along ravines
Dampens oscillations in directions with high curvature
Effective learning rate is amplified by factor 1/(1-β) ≈ 10 for β=0.9

Momentum β	Effective LR Multiplier	Behavior
0.0	1×	Vanilla SGD
0.9	10×	Standard momentum
0.99	100×	Heavy momentum
0.0	1×	No momentum

AdaGrad

DfAdaGrad

AdaGrad Accumulator

G_k = G_{k-1} + g_k \odot g_k

Here,

$G_k$ =Diagonal matrix of accumulated squared gradients
$g_k$ =Current gradient
$\epsilon$ =Small constant for numerical stability (typically 1e-8)

Key properties:

Parameters with large accumulated gradients get smaller effective learning rates
Parameters with sparse gradients get larger effective learning rates
Well-suited for sparse data (NLP, recommender systems)

Critical limitation: The accumulator G_k only grows, causing the learning rate to decay aggressively and eventually approach zero. This can halt training prematurely.

class AdaGrad:
    def __init__(self, lr=0.01, eps=1e-8):
        self.lr = lr
        self.eps = eps
        self.G = None

    def step(self, x, grad):
        if self.G is None:
            self.G = np.zeros_like(x)
        self.G += grad ** 2
        return x - self.lr * grad / (np.sqrt(self.G) + self.eps)

RMSProp

DfRMSProp

RMSProp Update

x_{k+1} = x_k - \frac{\alpha}{\sqrt{v_k + \epsilon}} \odot g_k

Here,

$v_k$ =Exponential moving average of squared gradients
$\beta$ =Decay rate (typically 0.9)
$\alpha$ =Global learning rate (typically 0.001)

RMSProp vs AdaGrad: RMSProp replaces the growing accumulator with an exponential moving average, preventing the learning rate from decaying to zero. The effective window is approximately 1/(1-β) ≈ 10 recent gradients.

Property	AdaGrad	RMSProp
Accumulator	Sum of all past g²	Exponential moving average
LR decay	Aggressive, monotone	Bounded, adaptive
Best for	Sparse, convex	Non-stationary, RNNs

class RMSProp:
    def __init__(self, lr=0.001, beta=0.9, eps=1e-8):
        self.lr = lr
        self.beta = beta
        self.eps = eps
        self.v = None

    def step(self, x, grad):
        if self.v is None:
            self.v = np.zeros_like(x)
        self.v = self.beta * self.v + (1 - self.beta) * grad ** 2
        return x - self.lr * grad / (np.sqrt(self.v) + self.eps)

Adam Optimizer

DfAdam (Adaptive Moment Estimation)

Adam First Moment (Momentum)

m_k = \beta_1 \, m_{k-1} + (1 - \beta_1) \, g_k

Here,

$m_k$ =Exponential moving average of gradients (first moment)
$\beta_1$ =First moment decay rate (typically 0.9)

Adam Second Moment (Variance)

v_k = \beta_2 \, v_{k-1} + (1 - \beta_2) \, g_k^2

Here,

$v_k$ =Exponential moving average of squared gradients (second moment)
$\beta_2$ =Second moment decay rate (typically 0.999)

Bias Correction

\hat{m}_k = \frac{m_k}{1 - \beta_1^k}, \quad \hat{v}_k = \frac{v_k}{1 - \beta_2^k}

Here,

$\hat{m}_k$ =Bias-corrected first moment estimate
$\hat{v}_k$ =Bias-corrected second moment estimate
$\beta_1^k$ =Decay factor raised to power k

💡 Why Bias Correction Matters

At initialization, m₀ = 0 and v₀ = 0. The EMA estimates are biased toward zero in early iterations. Dividing by (1 - βᵗ) corrects this bias. After ~10 iterations with β₁=0.9, the correction factor is ≈ 1/0.65 ≈ 1.54, which is substantial.

Adam Full Algorithm

Architecture Diagram

Algorithm: Adam Optimizer
─────────────────────────────────────────────
Require: Learning rate α (default: 0.001)
Require: Decay rates β₁ = 0.9, β₂ = 0.999
Require: Numerical stability ε = 1e-8
Require: Initial parameters x₀

1:  Initialize m₀ ← 0, v₀ ← 0, t ← 0
2:  repeat
3:      t ← t + 1
4:      Sample mini-batch B_k
5:      g_k ← (1/|B_k|) Σᵢ∈B_k ∇fᵢ(x_{t-1})
6:      m_t ← β₁ · m_{t-1} + (1 - β₁) · g_t       [Update biased first moment]
7:      v_t ← β₂ · v_{t-1} + (1 - β₂) · g_t²       [Update biased second moment]
8:      m̂_t ← m_t / (1 - β₁ᵗ)                       [Bias correction]
9:      v̂_t ← v_t / (1 - β₂ᵗ)                       [Bias correction]
10:     x_t ← x_{t-1} - α · m̂_t / (√v̂_t + ε)       [Parameter update]
11: until convergence
─────────────────────────────────────────────

class Adam:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = None
        self.v = None
        self.t = 0

    def step(self, x, grad):
        self.t += 1
        if self.m is None:
            self.m = np.zeros_like(x)
            self.v = np.zeros_like(x)

        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2

        m_hat = self.m / (1 - self.beta1 ** self.t)
        v_hat = self.v / (1 - self.beta2 ** self.t)

        return x - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

AdamW: Decoupled Weight Decay

AdamW Update

x_{k+1} = x_k - \alpha \left( \frac{\hat{m}_k}{\sqrt{\hat{v}_k} + \epsilon} + \lambda \, x_k \right)

Here,

$\lambda$ =Weight decay coefficient (typically 0.01)

💡 Adam vs AdamW

In standard Adam with L2 regularization, weight decay is coupled with adaptive learning rates, weakening its regularization effect. AdamW decouples weight decay, applying it directly to parameters independent of the gradient, which provides better generalization.

Learning Rate Warmup

DfLinear Warmup

Why warmup is needed:

At initialization, Adam's second moment estimates v̂ are noisy and unreliable
Large gradients in early training can cause divergent updates
Warmup allows moment estimates to stabilize before using full learning rates
Especially important for transformers and large batch training

Common warmup schedules:

Schedule	Formula	When to Use
Linear warmup	α_t = α_target · t/T_warmup	Transformers, large batches
Gradual warmup	α_t = α_target · min(t/T, (1-t/T)·s + t/T)	General purpose
No warmup	α_t = α_target	Small models, SGD with momentum

def warmup_lr(step, warmup_steps, target_lr):
    if step < warmup_steps:
        return target_lr * step / warmup_steps
    return target_lr

Python Implementation

import numpy as np

class SGD:
    def __init__(self, lr=0.01, momentum=0.0, weight_decay=0.0):
        self.lr = lr
        self.momentum = momentum
        self.weight_decay = weight_decay
        self.v = None

    def step(self, x, grad):
        if self.v is None:
            self.v = np.zeros_like(x)

        if self.weight_decay > 0:
            grad = grad + self.weight_decay * x

        self.v = self.momentum * self.v + grad
        return x - self.lr * self.v


class AdaGrad:
    def __init__(self, lr=0.01, eps=1e-8):
        self.lr = lr
        self.eps = eps
        self.G = None

    def step(self, x, grad):
        if self.G is None:
            self.G = np.zeros_like(x)
        self.G += grad ** 2
        return x - self.lr * grad / (np.sqrt(self.G) + self.eps)


class RMSProp:
    def __init__(self, lr=0.001, beta=0.9, eps=1e-8):
        self.lr = lr
        self.beta = beta
        self.eps = eps
        self.v = None

    def step(self, x, grad):
        if self.v is None:
            self.v = np.zeros_like(x)
        self.v = self.beta * self.v + (1 - self.beta) * grad ** 2
        return x - self.lr * grad / (np.sqrt(self.v) + self.eps)


class Adam:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = None
        self.v = None
        self.t = 0

    def step(self, x, grad):
        self.t += 1
        if self.m is None:
            self.m = np.zeros_like(x)
            self.v = np.zeros_like(x)

        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2

        m_hat = self.m / (1 - self.beta1 ** self.t)
        v_hat = self.v / (1 - self.beta2 ** self.t)

        return x - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)


class AdamW:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.weight_decay = weight_decay
        self.m = None
        self.v = None
        self.t = 0

    def step(self, x, grad):
        self.t += 1
        if self.m is None:
            self.m = np.zeros_like(x)
            self.v = np.zeros_like(x)

        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2

        m_hat = self.m / (1 - self.beta1 ** self.t)
        v_hat = self.v / (1 - self.beta2 ** self.t)

        return x - self.lr * (m_hat / (np.sqrt(v_hat) + self.eps) + self.weight_decay * x)


# --- Demo: Compare optimizers on Rosenbrock function ---
def rosenbrock(x):
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def rosenbrock_grad(x):
    dx0 = -2 * (1 - x[0]) + 200 * (x[1] - x[0]**2) * (-2 * x[0])
    dx1 = 200 * (x[1] - x[0]**2)
    return np.array([dx0, dx1])

optimizers = {
    "SGD": SGD(lr=0.001, momentum=0.9),
    "AdaGrad": AdaGrad(lr=0.05),
    "RMSProp": RMSProp(lr=0.001),
    "Adam": Adam(lr=0.005),
    "AdamW": AdamW(lr=0.005, weight_decay=0.01),
}

for name, opt in optimizers.items():
    x = np.array([-1.0, 1.0])
    for i in range(5000):
        grad = rosenbrock_grad(x)
        x = opt.step(x, grad)
    print(f"{name:10s} | Final loss: {rosenbrock(x):.6f} | Position: ({x[0]:.4f}, {x[1]:.4f})")

Applications in AI/ML

Deep Learning Training

SGD and its variants are used in virtually all neural network training:

Application	Typical Optimizer	Key Consideration
Image classification (CNN)	SGD + momentum (0.9) + cosine LR	Generalization gap vs Adam
Transformers (LLMs)	AdamW + warmup + cosine LR	Large batch, stability
GANs	Adam (β₁=0.0, β₂=0.9)	Non-stationary objectives
Reinforcement learning	Adam	Non-stationary data distribution
NLP / Embeddings	AdaGrad or Adam	Sparse gradients
Diffusion models	AdamW + warmup	Large models, long training

💡 SGD vs Adam for Generalization

SGD with momentum often achieves better generalization than Adam on vision tasks, despite slower convergence. This is because Adam's adaptive LR can overfit to sharp minima. Many practitioners train with Adam for speed, then fine-tune with SGD for final performance.

Learning Rate Schedules in Practice

Cosine Annealing:

Cosine Annealing

\alpha_t = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})(1 + \cos(\pi t / T))

Here,

$\alpha_{\min}$ =Minimum learning rate (typically 1e-6)
$\alpha_{\max}$ =Maximum learning rate
$T$ =Total number of training steps

One Cycle Policy:

Linearly warmup from low LR to peak LR for first ~30% of training
Cosine decay from peak to very low LR for remaining ~70%
Popularized by Leslie Smith; used in fast.ai course

Common Mistakes

Mistake	Symptom	Fix
Learning rate too high	Loss oscillates or diverges	Use LR finder; start with 1e-3 for Adam
Learning rate too low	Extremely slow convergence	Increase by 10× until loss decreases faster
No warmup with large batch	Training instability, NaN loss	Add linear warmup for 5–10% of training
Using Adam without weight decay	Overfitting, poor generalization	Switch to AdamW with weight_decay=0.01
Ignoring gradient clipping	Exploding gradients in RNNs	Clip gradients at norm 1.0
Wrong β for Adam	Slow or unstable training	Use defaults: β₁=0.9, β₂=0.999
Batch size too large	Poor generalization	Reduce batch size or increase LR linearly
Not shuffling data	Biased gradient estimates	Shuffle training data each epoch
Constant learning rate	Fails to converge to sharp minimum	Use cosine decay or step decay
Evaluating on training loss	Misleading convergence assessment	Always monitor validation loss

Interview Questions

Q1: Why is SGD preferred over batch gradient descent for deep learning? A: Batch GD requires computing the gradient over the entire dataset per update, which is infeasible for millions of samples. SGD uses mini-batches, enabling faster iteration, lower memory usage, and the noise helps escape saddle points. The noise also acts as implicit regularization, often improving generalization.

Q2: What is the role of the learning rate in SGD, and how do you choose it? A: The learning rate controls step size. Too large causes divergence; too small causes slow convergence. In practice, use a learning rate finder (sweep log-scale LR and plot loss), start with Adam lr=1e-3 or SGD lr=0.1, and apply cosine annealing or step decay schedules.

Q3: Explain bias correction in Adam. Why is it necessary? A: At initialization m₀=v₀=0, so early EMA estimates are biased toward zero. Dividing by (1-βᵗ) corrects this. For β₁=0.9, the correction factor at t=1 is 1/0.1=10, which is critical. After ~20 steps, the correction becomes negligible (<5%).

Q4: Why does AdamW outperform Adam with L2 regularization? A: In Adam, L2 regularization adds λx to the gradient, which is then scaled by 1/√v̂. This means the effective weight decay varies per-parameter inversely to gradient magnitude. AdamW decouples weight decay, applying λx directly without adaptive scaling, providing uniform regularization.

Q5: When would you use SGD+momentum over Adam? A: SGD+momentum often achieves better final generalization on vision tasks (ImageNet, etc.) despite slower convergence. Adam converges faster but may converge to sharper minima. Common pattern: train with Adam for speed, then switch to SGD for fine-tuning.

Q6: What happens if you use β₂=0.999 with a very large batch size? A: The second moment estimate v̂ becomes a very long-running average, potentially stale for non-stationary objectives. Consider reducing β₂ to 0.99 or 0.95 for GANs or fast-changing loss landscapes.

Q7: Explain the relationship between batch size and learning rate. A: Linear scaling rule: when batch size is multiplied by k, multiply LR by k. This maintains the same signal-to-noise ratio in gradient updates. Must be combined with warmup to avoid early instability. (Goyal et al., 2017)

Practice Problems

📝Problem 1: SGD Convergence

Show that for a quadratic function f(x) = (1/2)xᵀAx - bᵀx with A positive definite, the expected value of the SGD update after one step satisfies E[x₁] = (I - αA)x₀ + αb, which is identical to batch gradient descent.

💡Solution

Since E[g] = ∇f(x) = Ax - b for quadratic functions, we have E[x₁] = x₀ - α·E[g] = x₀ - α(Ax₀ - b) = (I - αA)x₀ + αb. The expected update matches batch GD exactly — the noise only affects higher moments, not the mean trajectory.

📝Problem 2: Why Momentum Helps

Consider minimizing f(x) = x₁² + 100x₂² (a narrow valley). Explain why SGD oscillates and how momentum fixes this.

💡Solution

The gradient in x₂ is 100× larger than in x₁, causing oscillations along x₂. Momentum accumulates velocity: oscillating directions average out (β alternates sign), while consistent x₁ direction accumulates. Effective learning rate becomes α/(1-β) ≈ 10α in x₁, dramatically speeding convergence along the valley floor.

📝Problem 3: AdaGrad Decay

In AdaGrad, if a parameter has constant gradient g, show that the effective learning rate decays as O(1/√t).

💡Solution

After t steps, Gₜ = Σᵢ₌₁ᵗ g² = t·g². The effective learning rate is α/√(t·g²) = α/(g·√t), which decays as O(1/√t). This aggressive decay is why AdaGrad can halt training prematurely — learning rates vanish for parameters with consistently large gradients.

📝Problem 4: Adam Hyperparameters

If you double β₂ in Adam from 0.999 to 0.9999, what happens to the bias correction factor and the effective window of the second moment estimate?

💡Solution

The bias correction factor 1/(1-β₂ᵗ) increases: at t=1000, 1/(1-0.999¹⁰⁰⁰) ≈ 1/0.632 ≈ 1.58 vs 1/(1-0.9999¹⁰⁰⁰) ≈ 1/0.095 ≈ 10.5. The effective window increases from ~1000 steps to ~10,000 steps. This makes v̂ smoother but slower to adapt to changes in gradient variance.

📝Problem 5: Batch Size Scaling

A model trains with batch size 128 and learning rate 0.01. If you increase batch size to 1024 (8× larger), what should the new learning rate be according to the linear scaling rule? What warmup strategy would you use?

💡Solution

Linear scaling: new LR = 0.01 × 8 = 0.08. Use linear warmup: start at LR = 0.01 for first ~5 epochs (or ~5% of total training), then linearly ramp to 0.08. This prevents instability from large initial gradients amplified by the large LR.

📝Problem 6: Gradient Variance Reduction

If you increase mini-batch size from 32 to 512, by what factor does the variance of the gradient estimate decrease (assuming independent samples)?

💡Solution

For independent samples, variance scales as σ²/b. Increasing b from 32 to 512 (16×) decreases variance by factor 16. The standard deviation decreases by factor √16 = 4. This is why larger batches give smoother gradients but with diminishing returns.

📝Problem 7: Adam vs SGD Generalization

You observe that Adam converges 5× faster than SGD on a computer vision task, but SGD achieves 2% higher accuracy on the test set. Propose two strategies to get the best of both worlds.

💡Solution

Strategy 1: Train with Adam for the first 80% of training for fast convergence, then switch to SGD with momentum for the final 20% to settle into a flatter minimum. Strategy 2: Use AdamW with cosine annealing for full training but increase weight decay (e.g., 0.05–0.1) to regularize sharper minima. Both strategies leverage Adam's speed while improving generalization.

Quick Reference

Algorithm	Update Rule	Adaptive LR	Momentum	Best For
SGD	x - αg	No	Optional	Vision, final training
AdaGrad	x - αg/√(G+ε)	Yes	No	Sparse features, NLP
RMSProp	x - αg/√(v+ε)	Yes	No	RNNs, non-stationary
Adam	x - αm̂/(√v̂+ε)	Yes	Yes	Default optimizer
AdamW	x - α(m̂/(√v̂+ε) + λx)	Yes	Yes	Transformers, general

Default hyperparameters:

Parameter	SGD	Adam	AdamW
Learning rate	0.1	0.001	0.001
β₁ (momentum)	0.9	0.9	0.9
β₂	—	0.999	0.999
Weight decay	0	0	0.01
Epsilon	—	1e-8	1e-8

Key formulas:

SGD: x ← x - α·g
Momentum: v ← βv + g; x ← x - αv
AdaGrad: G ← G + g²; x ← x - αg/√(G+ε)
RMSProp: v ← 0.9v + 0.1g²; x ← x - αg/√(v+ε)
Adam: m ← 0.9m + 0.1g; v ← 0.999v + 0.001g²; x ← x - α·m̂/(√v̂+ε)

Cross-References

Gradient Descent (062): The deterministic foundation; batch gradient descent converges exactly but is slow.
Newton's Method (064): Second-order methods that use curvature information for faster convergence.
Convex Optimization (061): Theoretical convergence guarantees for SGD in convex settings.
Constrained Optimization (065): Extends SGD to constrained problems via projected variants.
Hyperparameter Optimization (069): Finding optimal learning rates and batch sizes systematically.
Calculus: Partial Derivatives (027): Foundation for computing gradients in multi-dimensional optimization.
Calculus: Optimization (030): Classical unconstrained optimization theory and optimality conditions.
Linear Algebra: Norms (015): Measuring gradient magnitude and parameter distance in optimization.
Probability: Expectation (039): Expected value analysis of SGD convergence.
Probability: CLT (042): Central Limit Theorem explains why mini-batch gradients concentrate around the true gradient.