Gradient Descent
💡 Why It Matters
Gradient descent is the foundational optimization algorithm that powers nearly all of modern machine learning. From training a simple linear regression to fine-tuning billion-parameter large language models, gradient descent and its variants are the engine behind every weight update. Understanding its mechanics, convergence properties, and practical pitfalls is essential for any ML practitioner or researcher.
At its core, gradient descent is an iterative method for finding local minima of differentiable functions. The intuition is beautifully simple: if you are standing on a mountain in thick fog and want to reach the valley, feel the slope under your feet and take a step downhill. Repeat until you reach the bottom. Despite this simplicity, the algorithm's interplay with learning rate choice, batch size, and loss landscape geometry creates deep and rich behavior that demands careful study.
The Algorithm
Gradient Descent Update Rule
Here,
- =Current parameter vector at iteration k
- =Learning rate (step size) at iteration k
- =Gradient of the objective function f at x_k
- =Updated parameter vector after one step
ℹ️ Gradient Direction
The negative gradient points in the direction of steepest descent. Moving opposite the gradient is guaranteed to decrease for sufficiently small step sizes, since whenever . This is the fundamental justification for the algorithm.
Learning Rate
DfLearning Rate
The learning rate is the step size that controls how far we move along the negative gradient direction at each iteration. It is the single most important hyperparameter in gradient descent. The choice of determines whether the algorithm converges, diverges, or oscillates without reaching a minimum.
Choosing the Learning Rate
⚠️ Too Large
If is too large, gradient descent overshoots the minimum and may diverge. For a quadratic , the update becomes . If , the iterates grow without bound. The algorithm oscillates with increasing amplitude and never converges.
ℹ️ Too Small
If is too small, convergence is painfully slow. Each step makes negligible progress, and the algorithm may require millions of iterations to reach the minimum. This wastes compute and can get stuck in plateaus or shallow local minima before reaching the true optimum.
💡 Just Right
A well-chosen learning rate balances speed and stability. The iterates make large enough progress to converge in reasonable time while remaining stable enough to approach the minimum without overshooting. In practice, the optimal depends on the curvature of the loss landscape and is typically found via learning rate schedules or adaptive methods.
Batch Gradient Descent
Batch Gradient Descent (Full Gradient)
Here,
- =Total number of training samples
- =Loss for the i-th training sample
- =Full gradient averaged over the entire dataset
Batch gradient descent computes the exact gradient using all training samples before each update. This guarantees the gradient direction is accurate but becomes prohibitively expensive for large datasets. Computing the full gradient requires forward and backward passes per iteration, which is infeasible when is in the millions.
Stochastic and Mini-batch Variants
Stochastic Gradient Descent (SGD)
Here,
- =Randomly sampled index from {1, ..., n}
- =Gradient from a single sample (noisy estimate)
Mini-batch Gradient Descent
Here,
- =Random mini-batch of size b sampled at iteration k
- =Mini-batch size (typically 32 to 512)
| Variant | Batch Size | Gradient Noise | Per-Iteration Cost | Use Case |
|---|---|---|---|---|
| Batch GD | All | None | Small datasets, convex problems | |
| Stochastic GD | 1 | High | Online learning, large-scale | |
| Mini-batch GD | Moderate | Deep learning (default) |
Convergence Analysis
ThConvergence Rate for Convex Functions
Let be -Lipschitz continuous and -strongly convex. Then gradient descent with constant step size satisfies:
This gives a convergence rate of — sublinear convergence. The iterates decrease the optimality gap at a rate inversely proportional to the number of iterations.
ThLinear Convergence for Strongly Convex Functions
If is -strongly convex and -smooth, gradient descent with step size satisfies:
The condition number governs the convergence rate: . When (perfect conditioning), convergence is immediate. When (ill-conditioned), convergence is extremely slow — the iterates zigzag through narrow valleys.
ℹ️ Non-convex Landscape
For non-convex problems (like neural network training), gradient descent is not guaranteed to find the global minimum. It converges to a stationary point where , which could be a local minimum, global minimum, or saddle point. In high-dimensional spaces, saddle points are exponentially more prevalent than local minima, and saddle points with negative curvature are often the primary obstacle to convergence.
Line Search Methods
DfExact Line Search
After computing the search direction , exact line search finds the step size that minimizes along that direction:
This is optimal but expensive — it requires evaluating at many points along the ray. In practice, exact line search is rarely used because each evaluation costs and the minimization itself is an inner optimization problem.
DfBacktracking Line Search
Backtracking is a practical alternative. Starting with an initial step size (e.g., ), it repeatedly shrinks by a factor until the Armijo condition is satisfied:
where is a constant (typically ). The parameter (typically ) controls how aggressively we shrink. This guarantees sufficient decrease at each step while avoiding the cost of exact minimization.
Momentum
Momentum Update Rule
Here,
- =Velocity (accumulated gradient) at iteration k
- =Momentum coefficient (typically 0.9)
- =Learning rate
💡 Physics Analogy
Imagine a ball rolling downhill on a frictionless surface. It accumulates velocity as it rolls, gaining speed in directions of consistent descent and slowing down through oscillations. The momentum term retains a fraction of the previous velocity, so the ball speeds up in directions where the gradient consistently points downhill and decelerates where the gradient flips sign.
Why momentum helps: In narrow valleys, vanilla gradient descent oscillates wildly because the gradient alternates direction. Momentum averages these oscillations, smoothing the trajectory. On flat plateaus, momentum accumulates velocity and accelerates through regions where vanilla GD stalls. The effective learning rate is amplified by in consistent directions — with , this is a 10x speedup along persistent gradient directions.
Nesterov Accelerated Gradient
Nesterov Momentum
Here,
- =Gradient evaluated at the 'look-ahead' position
Nesterov momentum evaluates the gradient not at the current position but at the position we would arrive at if we continued with the current velocity. This "look-ahead" correction provides better convergence guarantees: for convex functions compared to for standard gradient descent. It is the theoretical foundation behind many practical accelerations.
Learning Rate Schedules
Step Decay Schedule
Here,
- =Initial learning rate
- =Decay factor (typically 0.1 or 0.5)
- =Step size — decay every s iterations
- =Floor division (integer part)
Cosine Annealing Schedule
Here,
- =Minimum learning rate (floor)
- =Maximum learning rate (typically = \alpha_0)
- =Total number of iterations for one cycle
- =Current iteration number
Cosine Annealing with Warm Restarts
Here,
- =Iteration within the current cycle (resets at each restart)
Warm restarts periodically reset the learning rate to a high value. The high learning rate helps escape sharp minima and saddle points, while the annealing within each cycle allows fine-grained convergence to flat, generalizable minima.
| Schedule | Formula | Key Benefit |
|---|---|---|
| Constant | Simple baseline | |
| Step decay | Adapts to progress | |
| Exponential decay | Smooth, monotonic decrease | |
| Cosine annealing | Smooth, periodic | |
| Warm restart | Periodic cosine reset | Escapes local minima |
| Linear warmup | for | Stabilizes early training |
Python Implementation
import numpy as np
import matplotlib.pyplot as plt
def gradient_descent(f, grad_f, x0, lr=0.01, n_iters=1000, tol=1e-6):
"""Vanilla gradient descent with convergence check."""
x = x0.copy()
history = [x.copy()]
losses = [f(x)]
for i in range(n_iters):
grad = grad_f(x)
x = x - lr * grad
history.append(x.copy())
losses.append(f(x))
if np.linalg.norm(grad) < tol:
break
return x, np.array(history), losses
def momentum_gd(f, grad_f, x0, lr=0.01, beta=0.9, n_iters=1000, tol=1e-6):
"""Gradient descent with momentum."""
x = x0.copy()
v = np.zeros_like(x)
history = [x.copy()]
losses = [f(x)]
for i in range(n_iters):
grad = grad_f(x)
v = beta * v + grad
x = x - lr * v
history.append(x.copy())
losses.append(f(x))
if np.linalg.norm(grad) < tol:
break
return x, np.array(history), losses
def nesterov_gd(f, grad_f, x0, lr=0.01, beta=0.9, n_iters=1000, tol=1e-6):
"""Gradient descent with Nesterov momentum."""
x = x0.copy()
v = np.zeros_like(x)
history = [x.copy()]
losses = [f(x)]
for i in range(n_iters):
grad = grad_f(x + lr * beta * (-v))
v = beta * v + grad
x = x - lr * v
history.append(x.copy())
losses.append(f(x))
if np.linalg.norm(grad) < tol:
break
return x, np.array(history), losses
def cosine_annealing_lr(epoch, lr_min, lr_max, T):
"""Cosine annealing learning rate schedule."""
return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * epoch / T))
def step_decay_lr(epoch, lr_0, gamma, step_size):
"""Step decay learning rate schedule."""
return lr_0 * (gamma ** (epoch // step_size))
# --- Example usage ---
f = lambda x: (x[0] - 2) ** 2 + (x[1] - 3) ** 2
grad_f = lambda x: np.array([2 * (x[0] - 2), 2 * (x[1] - 3)])
x0 = np.array([0.0, 0.0])
x_vanilla, hist_vanilla, losses_vanilla = gradient_descent(
f, grad_f, x0, lr=0.1, n_iters=100
)
x_mom, hist_mom, losses_mom = momentum_gd(
f, grad_f, x0, lr=0.1, beta=0.9, n_iters=100
)
x_nest, hist_nest, losses_nest = nesterov_gd(
f, grad_f, x0, lr=0.1, beta=0.9, n_iters=100
)
print(f"Vanilla GD optimum: {x_vanilla}, iterations: {len(losses_vanilla)}")
print(f"Momentum GD optimum: {x_mom}, iterations: {len(losses_mom)}")
print(f"Nesterov GD optimum: {x_nest}, iterations: {len(losses_nest)}")
# Learning rate schedule visualization
epochs = range(200)
lr_cosine = [cosine_annealing_lr(e, 1e-4, 1e-1, 200) for e in epochs]
lr_step = [step_decay_lr(e, 1e-1, 0.5, 50) for e in epochs]
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(epochs, lr_cosine, label="Cosine Annealing")
plt.plot(epochs, lr_step, label="Step Decay")
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.legend()
plt.title("Learning Rate Schedules")
plt.subplot(1, 2, 2)
plt.plot(losses_vanilla, label="Vanilla GD")
plt.plot(losses_mom, label="Momentum GD")
plt.plot(losses_nest, label="Nesterov GD")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.legend()
plt.title("Convergence Comparison")
plt.tight_layout()
plt.savefig("gd_comparison.png", dpi=150)
plt.show()
Applications in AI/ML
Neural Network Training
ℹ️ Backpropagation and Gradient Descent
Training a neural network is a gradient descent problem over millions (or billions) of parameters. The loss measures the discrepancy between predictions and labels. Backpropagation computes efficiently via the chain rule. Each training step updates all weights in the direction that reduces the loss: .
In practice, deep learning uses mini-batch SGD with momentum or Adam because:
- Full-batch gradients are too expensive to compute for large datasets.
- The noise in mini-batch gradients acts as implicit regularization.
- Momentum accelerates convergence through consistent gradient directions.
- Adam's per-parameter adaptive learning rates handle sparse features and varying curvatures.
Convex Optimization Connection
For convex problems like linear regression and SVMs, gradient descent has strong theoretical guarantees. The objective function has no local minima that are not global minima, and the convergence rate depends on the condition number . Gradient descent with exact line search achieves linear convergence for strongly convex functions.
Regularized Optimization
Regularized Loss Minimization
Here,
- =Per-sample loss function (e.g., cross-entropy, MSE)
- =Regularization term (L1, L2, or other penalty)
- =Regularization strength (trade-off parameter)
Gradient descent naturally handles regularized objectives. The gradient becomes . For L2 regularization (), the update becomes , which shrinks weights toward zero at every step — a phenomenon called weight decay.
Online Learning
ℹ️ Online Gradient Descent
In online learning, data arrives one sample (or one mini-batch) at a time, and we must update the model before seeing the next sample. Online gradient descent processes each sample immediately: , where is the loss on the -th sample. The regret — cumulative loss compared to the best fixed decision in hindsight — grows as with appropriately decaying step sizes . This sublinear regret means the algorithm "almost" converges to the best decision, making it a cornerstone of online learning theory.
Common Mistakes
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| Using a constant high learning rate | Causes divergence and oscillation around the minimum | Use a learning rate schedule (cosine, step decay) |
| Ignoring gradient magnitude | A gradient norm of 0.001 and 100 require very different step sizes | Use adaptive methods (Adam) or gradient clipping |
| Not shuffling training data | Creates systematic bias in mini-batch gradients | Shuffle the dataset at each epoch |
| Setting batch size too small | Excessive gradient noise prevents convergence | Use batch sizes of 32–512 for stable training |
| Training without momentum | Slow convergence in ill-conditioned loss landscapes | Add momentum () as a default |
| Using vanilla GD for deep learning | Full-batch computation is prohibitively expensive | Use mini-batch SGD or Adam |
| Skipping learning rate warmup | Large initial gradients can destabilize early training | Warm up learning rate over the first few epochs |
| Not monitoring loss curves | Missing divergence, plateaus, or overfitting | Plot training and validation loss every epoch |
| Assuming convergence to global minimum | Non-convex problems may have many local minima and saddle points | Use multiple restarts or annealing to escape poor minima |
| Confusing gradient descent with Newton's method | GD uses first-order information only; Newton uses second-order curvature | Use Newton/BFGS for small problems, GD for large-scale |
Interview Questions
Q1: Why is the learning rate the most important hyperparameter in gradient descent?
💡Answer
The learning rate controls the step size at each iteration. If it's too large, the iterates overshoot and diverge. If it's too small, convergence is impractically slow. Unlike other hyperparameters that mainly affect model capacity or regularization, the learning rate directly determines whether the optimization converges at all. For a quadratic , the iteration becomes , which converges only if , i.e., . This critical dependence makes learning rate tuning essential.
Q2: What is the difference between batch gradient descent, SGD, and mini-batch GD?
💡Answer
Batch GD computes the exact gradient over all samples per update — stable but per step. SGD uses a single random sample — fast but noisy. Mini-batch GD uses samples — the practical compromise. The noise in SGD/mini-batch is not a bug but a feature: it provides implicit regularization, helps escape saddle points, and enables training on data streams. Modern deep learning universally uses mini-batch GD with .
Q3: How does momentum improve gradient descent?
💡Answer
Momentum accumulates past gradients: , then updates . This has two key benefits: (1) It averages out oscillations in narrow valleys, smoothing the trajectory toward the minimum. (2) It accelerates through flat regions where the gradient is small but consistent, as velocity accumulates over iterations. The effective learning rate in consistent directions is amplified by when . Nesterov momentum improves this further by evaluating the gradient at a look-ahead position, achieving convergence for convex functions.
Q4: Explain the convergence guarantees of gradient descent for convex functions.
💡Answer
For -Lipschitz smooth functions, GD with achieves convergence: . For -strongly convex and -smooth functions, GD with achieves linear convergence at rate per iteration, where is the condition number. For non-convex functions, GD converges to a stationary point () in iterations, but there is no guarantee of reaching the global minimum.
Q5: When would you choose SGD over Adam?
💡Answer
SGD with momentum often achieves better generalization than Adam in the final stages of training. Adam's adaptive per-parameter learning rates can converge faster initially but may generalize worse because they escape sharp minima less effectively. The common practice is: use Adam for initial rapid convergence, then switch to SGD with momentum (and a decaying learning rate) for the final training phase. This "warm start with Adam, finish with SGD" strategy combines the speed of adaptive methods with the generalization of SGD.
Second-Order Effects: Condition Number
DfCondition Number
The condition number measures the ratio of the largest to smallest curvature of the function. For with eigenvalues and , . A high condition number means the function has very different curvatures in different directions — creating narrow, elongated valleys where gradient descent oscillates.
⚠️ Ill-conditioned Problems
When , gradient descent requires iterations to reach accuracy . For example, if , you need roughly 1000 times more iterations than if . Preconditioning (transforming coordinates to reduce ) or using second-order methods is essential for such problems.
Practice Problems
Problem 1: Manual Gradient Descent Step
📝One Step of Gradient Descent
Problem: Given , starting point , and learning rate , perform one gradient descent update.
💡Solution
Gradient:
At :
Update:
New point:
Check: , . The function decreased, confirming the step was in the right direction.
Problem 2: Learning Rate Divergence
📝Divergent Learning Rate
Problem: For with and gradient descent , find the range of for convergence.
💡Solution
The update is . This is a geometric sequence: .
Convergence condition: .
- If : for all (optimal, converges in one step).
- If : (converges slowly).
- If : (oscillates, does not converge).
- If : (diverges).
Optimal: gives fastest convergence for this quadratic.
Problem 3: Momentum vs Vanilla GD
📝Momentum Acceleration
Problem: Consider a function with narrow valley along where . Explain why momentum helps and compute the effective step size after 5 iterations with and constant gradient .
💡Solution
The function has (from the -direction) and (from the -direction), so . Vanilla GD oscillates wildly in the -direction due to the large curvature.
Momentum with constant gradient : ,
After iterations:
The effective step in the -direction is — about the gradient magnitude. As , , amplifying the consistent direction by .
Problem 4: Cosine Annealing Schedule
📝Learning Rate at Specific Epochs
Problem: For cosine annealing with , , and epochs, compute the learning rate at epochs .
💡Solution
- :
- :
- :
- :
- :
The schedule starts near maximum, decays smoothly through the middle, and ends at the minimum — exactly the behavior needed for training neural networks.
Quick Reference
📋Key Takeaways
- Update Rule: — move opposite the gradient by step size .
- Learning Rate: The most critical hyperparameter. Too large diverges, too small is slow. Use schedules (cosine, step decay) instead of constant values.
- Batch vs. SGD vs. Mini-batch: Batch GD uses all data (stable but expensive), SGD uses one sample (fast but noisy), mini-batch GD (–) is the practical default for deep learning.
- Convergence Rate: for convex, linear convergence for strongly convex, to reach for non-convex.
- Line Search: Backtracking with Armijo condition provides adaptive step sizes without exact minimization.
- Momentum: with averages oscillations and accelerates through consistent gradient directions by .
- Nesterov Momentum: Evaluate gradient at the look-ahead position for convergence — the theoretical foundation for accelerated methods.
- Schedules: Cosine annealing is the most popular schedule in deep learning. Add warm restarts to escape poor minima.
- Deep Learning Practice: Start with Adam (lr=0.001) for fast convergence; switch to SGD+momentum for final training. Use gradient clipping when gradients explode.
- Python: Use
scipy.optimize.minimize()for small convex problems; implement custom loops with NumPy for full control over training dynamics.
Cross-References
- Stochastic Gradient Descent: Adam, RMSProp, and adaptive learning rate methods → SGD
- Newton's Method: Second-order optimization with Hessian information → Newton's Method
- Constrained Optimization: KKT conditions, penalty methods, and interior-point methods → Constrained Optimization
- Convex Optimization: Global optimality guarantees and convex function properties → Convex Optimization
- Hyperparameter Optimization: Learning rate tuning via grid search, Bayesian optimization → Hyperparameter Optimization
- Calculus Optimization: First and second derivative tests, unconstrained optimization → Calculus Optimization
- Partial Derivatives: Gradients and Jacobians needed for computing → Partial Derivatives
- Linear Algebra Norms: Vector norms used in convergence analysis → Linear Algebra Norms