← Math|62 of 100
Optimization

Gradient Descent

Master gradient descent variants, learning rate schedules, convergence analysis, and their applications in machine learning.

📂 First-order📖 Lesson 62 of 100🎓 Free Course

Advertisement

Gradient Descent

💡 Why It Matters

Gradient descent is the foundational optimization algorithm that powers nearly all of modern machine learning. From training a simple linear regression to fine-tuning billion-parameter large language models, gradient descent and its variants are the engine behind every weight update. Understanding its mechanics, convergence properties, and practical pitfalls is essential for any ML practitioner or researcher.

At its core, gradient descent is an iterative method for finding local minima of differentiable functions. The intuition is beautifully simple: if you are standing on a mountain in thick fog and want to reach the valley, feel the slope under your feet and take a step downhill. Repeat until you reach the bottom. Despite this simplicity, the algorithm's interplay with learning rate choice, batch size, and loss landscape geometry creates deep and rich behavior that demands careful study.


The Algorithm

Gradient Descent Update Rule

xk+1=xkαkf(xk)x_{k+1} = x_k - \alpha_k \nabla f(x_k)

Here,

  • xkx_k=Current parameter vector at iteration k
  • αk\alpha_k=Learning rate (step size) at iteration k
  • f(xk)\nabla f(x_k)=Gradient of the objective function f at x_k
  • xk+1x_{k+1}=Updated parameter vector after one step

ℹ️ Gradient Direction

The negative gradient f(xk)-\nabla f(x_k) points in the direction of steepest descent. Moving opposite the gradient is guaranteed to decrease ff for sufficiently small step sizes, since f(xk)T[f(xk)]=f(xk)2<0\nabla f(x_k)^T \cdot [-\nabla f(x_k)] = -\|\nabla f(x_k)\|^2 < 0 whenever f(xk)0\nabla f(x_k) \neq 0. This is the fundamental justification for the algorithm.


Learning Rate

DfLearning Rate

The learning rate α>0\alpha > 0 is the step size that controls how far we move along the negative gradient direction at each iteration. It is the single most important hyperparameter in gradient descent. The choice of α\alpha determines whether the algorithm converges, diverges, or oscillates without reaching a minimum.

Choosing the Learning Rate

⚠️ Too Large

If α\alpha is too large, gradient descent overshoots the minimum and may diverge. For a quadratic f(x)=12ax2f(x) = \frac{1}{2}ax^2, the update becomes xk+1=xk(1αa)x_{k+1} = x_k(1 - \alpha a). If 1αa>1|1 - \alpha a| > 1, the iterates grow without bound. The algorithm oscillates with increasing amplitude and never converges.

ℹ️ Too Small

If α\alpha is too small, convergence is painfully slow. Each step makes negligible progress, and the algorithm may require millions of iterations to reach the minimum. This wastes compute and can get stuck in plateaus or shallow local minima before reaching the true optimum.

💡 Just Right

A well-chosen learning rate balances speed and stability. The iterates make large enough progress to converge in reasonable time while remaining stable enough to approach the minimum without overshooting. In practice, the optimal α\alpha depends on the curvature of the loss landscape and is typically found via learning rate schedules or adaptive methods.


Batch Gradient Descent

Batch Gradient Descent (Full Gradient)

xk+1=xkα1ni=1nfi(xk)x_{k+1} = x_k - \alpha \cdot \frac{1}{n} \sum_{i=1}^{n} \nabla f_i(x_k)

Here,

  • nn=Total number of training samples
  • fi(x)f_i(x)=Loss for the i-th training sample
  • 1ni=1nfi(xk)\frac{1}{n} \sum_{i=1}^{n} \nabla f_i(x_k)=Full gradient averaged over the entire dataset

Batch gradient descent computes the exact gradient using all nn training samples before each update. This guarantees the gradient direction is accurate but becomes prohibitively expensive for large datasets. Computing the full gradient requires O(n)O(n) forward and backward passes per iteration, which is infeasible when nn is in the millions.

Stochastic and Mini-batch Variants

Stochastic Gradient Descent (SGD)

xk+1=xkαfik(xk)x_{k+1} = x_k - \alpha \cdot \nabla f_{i_k}(x_k)

Here,

  • iki_k=Randomly sampled index from {1, ..., n}
  • fik(xk)\nabla f_{i_k}(x_k)=Gradient from a single sample (noisy estimate)

Mini-batch Gradient Descent

xk+1=xkα1BkiBkfi(xk)x_{k+1} = x_k - \alpha \cdot \frac{1}{|B_k|} \sum_{i \in B_k} \nabla f_i(x_k)

Here,

  • BkB_k=Random mini-batch of size b sampled at iteration k
  • Bk=b|B_k| = b=Mini-batch size (typically 32 to 512)
VariantBatch SizeGradient NoisePer-Iteration CostUse Case
Batch GDAll nnNoneO(n)O(n)Small datasets, convex problems
Stochastic GD1HighO(1)O(1)Online learning, large-scale
Mini-batch GDbbModerateO(b)O(b)Deep learning (default)

Convergence Analysis

ThConvergence Rate for Convex Functions

Let ff be LL-Lipschitz continuous and μ\mu-strongly convex. Then gradient descent with constant step size α=1L\alpha = \frac{1}{L} satisfies:

f(xk)f(x)Lx0x22kf(x_k) - f(x^*) \leq \frac{L \|x_0 - x^*\|^2}{2k}

This gives a convergence rate of O(1/k)O(1/k) — sublinear convergence. The iterates decrease the optimality gap at a rate inversely proportional to the number of iterations.

ThLinear Convergence for Strongly Convex Functions

If ff is μ\mu-strongly convex and LL-smooth, gradient descent with step size α=2L+μ\alpha = \frac{2}{L + \mu} satisfies:

f(xk)f(x)(LμL+μ)2kx0x2f(x_k) - f(x^*) \leq \left(\frac{L - \mu}{L + \mu}\right)^{2k} \|x_0 - x^*\|^2

The condition number κ=L/μ\kappa = L/\mu governs the convergence rate: LμL+μ=κ1κ+1\frac{L - \mu}{L + \mu} = \frac{\kappa - 1}{\kappa + 1}. When κ=1\kappa = 1 (perfect conditioning), convergence is immediate. When κ1\kappa \gg 1 (ill-conditioned), convergence is extremely slow — the iterates zigzag through narrow valleys.

ℹ️ Non-convex Landscape

For non-convex problems (like neural network training), gradient descent is not guaranteed to find the global minimum. It converges to a stationary point where f(x)=0\nabla f(x) = 0, which could be a local minimum, global minimum, or saddle point. In high-dimensional spaces, saddle points are exponentially more prevalent than local minima, and saddle points with negative curvature are often the primary obstacle to convergence.


Line Search Methods

DfExact Line Search

After computing the search direction dk=f(xk)d_k = -\nabla f(x_k), exact line search finds the step size that minimizes ff along that direction:

αk=argminα>0f(xk+αdk)\alpha_k = \arg\min_{\alpha > 0} f(x_k + \alpha d_k)

This is optimal but expensive — it requires evaluating ff at many points along the ray. In practice, exact line search is rarely used because each evaluation costs O(n)O(n) and the minimization itself is an inner optimization problem.

DfBacktracking Line Search

Backtracking is a practical alternative. Starting with an initial step size (e.g., α=1\alpha = 1), it repeatedly shrinks α\alpha by a factor β(0,1)\beta \in (0, 1) until the Armijo condition is satisfied:

f(xk+αdk)f(xk)+cαf(xk)Tdkf(x_k + \alpha d_k) \leq f(x_k) + c \alpha \nabla f(x_k)^T d_k

where c(0,1)c \in (0, 1) is a constant (typically c=104c = 10^{-4}). The parameter β\beta (typically 0.50.5) controls how aggressively we shrink. This guarantees sufficient decrease at each step while avoiding the cost of exact minimization.


Momentum

Momentum Update Rule

vk=βvk1+f(xk)xk+1=xkαvk\begin{aligned} v_k &= \beta v_{k-1} + \nabla f(x_k) \\ x_{k+1} &= x_k - \alpha v_k \end{aligned}

Here,

  • vkv_k=Velocity (accumulated gradient) at iteration k
  • β\beta=Momentum coefficient (typically 0.9)
  • α\alpha=Learning rate

💡 Physics Analogy

Imagine a ball rolling downhill on a frictionless surface. It accumulates velocity as it rolls, gaining speed in directions of consistent descent and slowing down through oscillations. The momentum term βvk1\beta v_{k-1} retains a fraction of the previous velocity, so the ball speeds up in directions where the gradient consistently points downhill and decelerates where the gradient flips sign.

Why momentum helps: In narrow valleys, vanilla gradient descent oscillates wildly because the gradient alternates direction. Momentum averages these oscillations, smoothing the trajectory. On flat plateaus, momentum accumulates velocity and accelerates through regions where vanilla GD stalls. The effective learning rate is amplified by 11β\frac{1}{1-\beta} in consistent directions — with β=0.9\beta = 0.9, this is a 10x speedup along persistent gradient directions.


Nesterov Accelerated Gradient

Nesterov Momentum

vk=βvk1+f(xkαβvk1)xk+1=xkαvk\begin{aligned} v_k &= \beta v_{k-1} + \nabla f(x_k - \alpha \beta v_{k-1}) \\ x_{k+1} &= x_k - \alpha v_k \end{aligned}

Here,

  • f(xkαβvk1)\nabla f(x_k - \alpha \beta v_{k-1})=Gradient evaluated at the 'look-ahead' position

Nesterov momentum evaluates the gradient not at the current position but at the position we would arrive at if we continued with the current velocity. This "look-ahead" correction provides better convergence guarantees: O(1/k2)O(1/k^2) for convex functions compared to O(1/k)O(1/k) for standard gradient descent. It is the theoretical foundation behind many practical accelerations.


Learning Rate Schedules

Step Decay Schedule

αk=α0γk/s\alpha_k = \alpha_0 \cdot \gamma^{\lfloor k / s \rfloor}

Here,

  • α0\alpha_0=Initial learning rate
  • γ\gamma=Decay factor (typically 0.1 or 0.5)
  • ss=Step size — decay every s iterations
  • k/s\lfloor k/s \rfloor=Floor division (integer part)

Cosine Annealing Schedule

αk=αmin+12(αmaxαmin)(1+cos(πkT))\alpha_k = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos\left(\frac{\pi k}{T}\right)\right)

Here,

  • αmin\alpha_{\min}=Minimum learning rate (floor)
  • αmax\alpha_{\max}=Maximum learning rate (typically = \alpha_0)
  • TT=Total number of iterations for one cycle
  • kk=Current iteration number

Cosine Annealing with Warm Restarts

αk=αmin+12(αmaxαmin)(1+cos(π(kmodT)T))\alpha_k = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos\left(\frac{\pi \, (k \bmod T)}{T}\right)\right)

Here,

  • kmodTk \bmod T=Iteration within the current cycle (resets at each restart)

Warm restarts periodically reset the learning rate to a high value. The high learning rate helps escape sharp minima and saddle points, while the annealing within each cycle allows fine-grained convergence to flat, generalizable minima.

ScheduleFormulaKey Benefit
Constantαk=α0\alpha_k = \alpha_0Simple baseline
Step decayαk=α0γk/s\alpha_k = \alpha_0 \gamma^{\lfloor k/s \rfloor}Adapts to progress
Exponential decayαk=α0eλk\alpha_k = \alpha_0 e^{-\lambda k}Smooth, monotonic decrease
Cosine annealingαk=αmin+12(αmaxαmin)(1+cos(πk/T))\alpha_k = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})(1 + \cos(\pi k / T))Smooth, periodic
Warm restartPeriodic cosine resetEscapes local minima
Linear warmupαk=α0k/K\alpha_k = \alpha_0 \cdot k / K for k<Kk < KStabilizes early training

Python Implementation

import numpy as np
import matplotlib.pyplot as plt


def gradient_descent(f, grad_f, x0, lr=0.01, n_iters=1000, tol=1e-6):
    """Vanilla gradient descent with convergence check."""
    x = x0.copy()
    history = [x.copy()]
    losses = [f(x)]

    for i in range(n_iters):
        grad = grad_f(x)
        x = x - lr * grad
        history.append(x.copy())
        losses.append(f(x))

        if np.linalg.norm(grad) < tol:
            break

    return x, np.array(history), losses


def momentum_gd(f, grad_f, x0, lr=0.01, beta=0.9, n_iters=1000, tol=1e-6):
    """Gradient descent with momentum."""
    x = x0.copy()
    v = np.zeros_like(x)
    history = [x.copy()]
    losses = [f(x)]

    for i in range(n_iters):
        grad = grad_f(x)
        v = beta * v + grad
        x = x - lr * v
        history.append(x.copy())
        losses.append(f(x))

        if np.linalg.norm(grad) < tol:
            break

    return x, np.array(history), losses


def nesterov_gd(f, grad_f, x0, lr=0.01, beta=0.9, n_iters=1000, tol=1e-6):
    """Gradient descent with Nesterov momentum."""
    x = x0.copy()
    v = np.zeros_like(x)
    history = [x.copy()]
    losses = [f(x)]

    for i in range(n_iters):
        grad = grad_f(x + lr * beta * (-v))
        v = beta * v + grad
        x = x - lr * v
        history.append(x.copy())
        losses.append(f(x))

        if np.linalg.norm(grad) < tol:
            break

    return x, np.array(history), losses


def cosine_annealing_lr(epoch, lr_min, lr_max, T):
    """Cosine annealing learning rate schedule."""
    return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * epoch / T))


def step_decay_lr(epoch, lr_0, gamma, step_size):
    """Step decay learning rate schedule."""
    return lr_0 * (gamma ** (epoch // step_size))


# --- Example usage ---
f = lambda x: (x[0] - 2) ** 2 + (x[1] - 3) ** 2
grad_f = lambda x: np.array([2 * (x[0] - 2), 2 * (x[1] - 3)])

x0 = np.array([0.0, 0.0])

x_vanilla, hist_vanilla, losses_vanilla = gradient_descent(
    f, grad_f, x0, lr=0.1, n_iters=100
)
x_mom, hist_mom, losses_mom = momentum_gd(
    f, grad_f, x0, lr=0.1, beta=0.9, n_iters=100
)
x_nest, hist_nest, losses_nest = nesterov_gd(
    f, grad_f, x0, lr=0.1, beta=0.9, n_iters=100
)

print(f"Vanilla GD optimum: {x_vanilla}, iterations: {len(losses_vanilla)}")
print(f"Momentum GD optimum: {x_mom}, iterations: {len(losses_mom)}")
print(f"Nesterov GD optimum: {x_nest}, iterations: {len(losses_nest)}")

# Learning rate schedule visualization
epochs = range(200)
lr_cosine = [cosine_annealing_lr(e, 1e-4, 1e-1, 200) for e in epochs]
lr_step = [step_decay_lr(e, 1e-1, 0.5, 50) for e in epochs]

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(epochs, lr_cosine, label="Cosine Annealing")
plt.plot(epochs, lr_step, label="Step Decay")
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.legend()
plt.title("Learning Rate Schedules")

plt.subplot(1, 2, 2)
plt.plot(losses_vanilla, label="Vanilla GD")
plt.plot(losses_mom, label="Momentum GD")
plt.plot(losses_nest, label="Nesterov GD")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.legend()
plt.title("Convergence Comparison")
plt.tight_layout()
plt.savefig("gd_comparison.png", dpi=150)
plt.show()

Applications in AI/ML

Neural Network Training

ℹ️ Backpropagation and Gradient Descent

Training a neural network is a gradient descent problem over millions (or billions) of parameters. The loss L(θ)\mathcal{L}(\theta) measures the discrepancy between predictions and labels. Backpropagation computes θL\nabla_\theta \mathcal{L} efficiently via the chain rule. Each training step updates all weights in the direction that reduces the loss: θt+1=θtαθL(θt)\theta_{t+1} = \theta_t - \alpha \nabla_\theta \mathcal{L}(\theta_t).

In practice, deep learning uses mini-batch SGD with momentum or Adam because:

  1. Full-batch gradients are too expensive to compute for large datasets.
  2. The noise in mini-batch gradients acts as implicit regularization.
  3. Momentum accelerates convergence through consistent gradient directions.
  4. Adam's per-parameter adaptive learning rates handle sparse features and varying curvatures.

Convex Optimization Connection

For convex problems like linear regression and SVMs, gradient descent has strong theoretical guarantees. The objective function has no local minima that are not global minima, and the convergence rate depends on the condition number κ=L/μ\kappa = L/\mu. Gradient descent with exact line search achieves linear convergence for strongly convex functions.

Regularized Optimization

Regularized Loss Minimization

minθ  1ni=1nL(fθ(xi),yi)+λR(θ)\min_\theta \; \frac{1}{n}\sum_{i=1}^{n} \mathcal{L}(f_\theta(x_i), y_i) + \lambda R(\theta)

Here,

  • L\mathcal{L}=Per-sample loss function (e.g., cross-entropy, MSE)
  • R(θ)R(\theta)=Regularization term (L1, L2, or other penalty)
  • λ\lambda=Regularization strength (trade-off parameter)

Gradient descent naturally handles regularized objectives. The gradient becomes θ[1nLi+λR(θ)]=1nθLi+λθR(θ)\nabla_\theta \left[\frac{1}{n}\sum \mathcal{L}_i + \lambda R(\theta)\right] = \frac{1}{n}\sum \nabla_\theta \mathcal{L}_i + \lambda \nabla_\theta R(\theta). For L2 regularization (R(θ)=θ2R(\theta) = \|\theta\|^2), the update becomes xk+1=(12αλ)xkαf(xk)x_{k+1} = (1 - 2\alpha\lambda) x_k - \alpha \nabla f(x_k), which shrinks weights toward zero at every step — a phenomenon called weight decay.

Online Learning

ℹ️ Online Gradient Descent

In online learning, data arrives one sample (or one mini-batch) at a time, and we must update the model before seeing the next sample. Online gradient descent processes each sample immediately: xt+1=xtαtft(xt)x_{t+1} = x_t - \alpha_t \nabla f_t(x_t), where ftf_t is the loss on the tt-th sample. The regret — cumulative loss compared to the best fixed decision in hindsight — grows as O(T)O(\sqrt{T}) with appropriately decaying step sizes αt=O(1/t)\alpha_t = O(1/\sqrt{t}). This sublinear regret means the algorithm "almost" converges to the best decision, making it a cornerstone of online learning theory.


Common Mistakes

MistakeWhy It's WrongCorrect Approach
Using a constant high learning rateCauses divergence and oscillation around the minimumUse a learning rate schedule (cosine, step decay)
Ignoring gradient magnitudeA gradient norm of 0.001 and 100 require very different step sizesUse adaptive methods (Adam) or gradient clipping
Not shuffling training dataCreates systematic bias in mini-batch gradientsShuffle the dataset at each epoch
Setting batch size too smallExcessive gradient noise prevents convergenceUse batch sizes of 32–512 for stable training
Training without momentumSlow convergence in ill-conditioned loss landscapesAdd momentum (β=0.9\beta = 0.9) as a default
Using vanilla GD for deep learningFull-batch computation is prohibitively expensiveUse mini-batch SGD or Adam
Skipping learning rate warmupLarge initial gradients can destabilize early trainingWarm up learning rate over the first few epochs
Not monitoring loss curvesMissing divergence, plateaus, or overfittingPlot training and validation loss every epoch
Assuming convergence to global minimumNon-convex problems may have many local minima and saddle pointsUse multiple restarts or annealing to escape poor minima
Confusing gradient descent with Newton's methodGD uses first-order information only; Newton uses second-order curvatureUse Newton/BFGS for small problems, GD for large-scale

Interview Questions

Q1: Why is the learning rate the most important hyperparameter in gradient descent?

💡Answer

The learning rate controls the step size at each iteration. If it's too large, the iterates overshoot and diverge. If it's too small, convergence is impractically slow. Unlike other hyperparameters that mainly affect model capacity or regularization, the learning rate directly determines whether the optimization converges at all. For a quadratic f(x)=12ax2f(x) = \frac{1}{2}ax^2, the iteration becomes xk+1=xk(1αa)x_{k+1} = x_k(1 - \alpha a), which converges only if 1αa<1|1 - \alpha a| < 1, i.e., α<2/a\alpha < 2/a. This critical dependence makes learning rate tuning essential.

Q2: What is the difference between batch gradient descent, SGD, and mini-batch GD?

💡Answer

Batch GD computes the exact gradient over all nn samples per update — stable but O(n)O(n) per step. SGD uses a single random sample — fast but noisy. Mini-batch GD uses bb samples — the practical compromise. The noise in SGD/mini-batch is not a bug but a feature: it provides implicit regularization, helps escape saddle points, and enables training on data streams. Modern deep learning universally uses mini-batch GD with b[32,512]b \in [32, 512].

Q3: How does momentum improve gradient descent?

💡Answer

Momentum accumulates past gradients: vk=βvk1+f(xk)v_k = \beta v_{k-1} + \nabla f(x_k), then updates xk+1=xkαvkx_{k+1} = x_k - \alpha v_k. This has two key benefits: (1) It averages out oscillations in narrow valleys, smoothing the trajectory toward the minimum. (2) It accelerates through flat regions where the gradient is small but consistent, as velocity accumulates over iterations. The effective learning rate in consistent directions is amplified by 11β10×\frac{1}{1-\beta} \approx 10\times when β=0.9\beta = 0.9. Nesterov momentum improves this further by evaluating the gradient at a look-ahead position, achieving O(1/k2)O(1/k^2) convergence for convex functions.

Q4: Explain the convergence guarantees of gradient descent for convex functions.

💡Answer

For LL-Lipschitz smooth functions, GD with α=1/L\alpha = 1/L achieves O(1/k)O(1/k) convergence: f(xk)f(x)Lx0x22kf(x_k) - f(x^*) \leq \frac{L\|x_0 - x^*\|^2}{2k}. For μ\mu-strongly convex and LL-smooth functions, GD with α=2/(L+μ)\alpha = 2/(L+\mu) achieves linear convergence at rate (κ1κ+1)2\left(\frac{\kappa-1}{\kappa+1}\right)^2 per iteration, where κ=L/μ\kappa = L/\mu is the condition number. For non-convex functions, GD converges to a stationary point (f<ϵ\|\nabla f\| < \epsilon) in O(1/ϵ2)O(1/\epsilon^2) iterations, but there is no guarantee of reaching the global minimum.

Q5: When would you choose SGD over Adam?

💡Answer

SGD with momentum often achieves better generalization than Adam in the final stages of training. Adam's adaptive per-parameter learning rates can converge faster initially but may generalize worse because they escape sharp minima less effectively. The common practice is: use Adam for initial rapid convergence, then switch to SGD with momentum (and a decaying learning rate) for the final training phase. This "warm start with Adam, finish with SGD" strategy combines the speed of adaptive methods with the generalization of SGD.


Second-Order Effects: Condition Number

DfCondition Number

The condition number κ=L/μ\kappa = L/\mu measures the ratio of the largest to smallest curvature of the function. For f(x)=12xTAxf(x) = \frac{1}{2}x^T A x with eigenvalues λmax\lambda_{\max} and λmin\lambda_{\min}, κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min}. A high condition number means the function has very different curvatures in different directions — creating narrow, elongated valleys where gradient descent oscillates.

⚠️ Ill-conditioned Problems

When κ1\kappa \gg 1, gradient descent requires O(κlog(1/ϵ))O(\kappa \log(1/\epsilon)) iterations to reach accuracy ϵ\epsilon. For example, if κ=1000\kappa = 1000, you need roughly 1000 times more iterations than if κ=1\kappa = 1. Preconditioning (transforming coordinates to reduce κ\kappa) or using second-order methods is essential for such problems.


Practice Problems

Problem 1: Manual Gradient Descent Step

📝One Step of Gradient Descent

Problem: Given f(x,y)=(x3)2+2(y+1)2f(x, y) = (x - 3)^2 + 2(y + 1)^2, starting point (x0,y0)=(0,0)(x_0, y_0) = (0, 0), and learning rate α=0.1\alpha = 0.1, perform one gradient descent update.

💡Solution

Gradient: f=(2(x3),  4(y+1))\nabla f = \left(2(x-3), \; 4(y+1)\right)

At (0,0)(0, 0): f(0,0)=(2(03),  4(0+1))=(6,  4)\nabla f(0, 0) = (2(0-3), \; 4(0+1)) = (-6, \; 4)

Update: x1=x0α(6)=0+0.6=0.6x_1 = x_0 - \alpha \cdot (-6) = 0 + 0.6 = 0.6

y1=y0α4=00.4=0.4y_1 = y_0 - \alpha \cdot 4 = 0 - 0.4 = -0.4

New point: (0.6,0.4)(0.6, -0.4)

Check: f(0,0)=9+2=11f(0, 0) = 9 + 2 = 11, f(0.6,0.4)=(0.63)2+2(0.4+1)2=5.76+0.72=6.48<11f(0.6, -0.4) = (0.6-3)^2 + 2(-0.4+1)^2 = 5.76 + 0.72 = 6.48 < 11. The function decreased, confirming the step was in the right direction.


Problem 2: Learning Rate Divergence

📝Divergent Learning Rate

Problem: For f(x)=x2f(x) = x^2 with x0=1x_0 = 1 and gradient descent xk+1=xkα2xkx_{k+1} = x_k - \alpha \cdot 2x_k, find the range of α\alpha for convergence.

💡Solution

The update is xk+1=xk(12α)x_{k+1} = x_k(1 - 2\alpha). This is a geometric sequence: xk=(12α)kx0x_k = (1 - 2\alpha)^k \cdot x_0.

Convergence condition: 12α<1    1<12α<1    0<α<1|1 - 2\alpha| < 1 \implies -1 < 1 - 2\alpha < 1 \implies 0 < \alpha < 1.

  • If α=0.5\alpha = 0.5: xk=0x_k = 0 for all k1k \geq 1 (optimal, converges in one step).
  • If α=0.3\alpha = 0.3: xk=(0.4)k0x_k = (0.4)^k \to 0 (converges slowly).
  • If α=1.0\alpha = 1.0: xk=(1)kx_k = (-1)^k (oscillates, does not converge).
  • If α=1.5\alpha = 1.5: xk=(2)kx_k = (-2)^k (diverges).

Optimal: α=0.5\alpha = 0.5 gives fastest convergence for this quadratic.


Problem 3: Momentum vs Vanilla GD

📝Momentum Acceleration

Problem: Consider a function with narrow valley along y=0y = 0 where f(x,y)=x2+100y2f(x, y) = x^2 + 100y^2. Explain why momentum helps and compute the effective step size after 5 iterations with β=0.9\beta = 0.9 and constant gradient (2,0)(2, 0).

💡Solution

The function has L=200L = 200 (from the yy-direction) and μ=2\mu = 2 (from the xx-direction), so κ=100\kappa = 100. Vanilla GD oscillates wildly in the yy-direction due to the large curvature.

Momentum with constant gradient g=(2,0)g = (2, 0): v0=0v_0 = 0, v1=β0+g=gv_1 = \beta \cdot 0 + g = g

v2=βv1+g=(1+β)gv_2 = \beta v_1 + g = (1 + \beta)gvk=(1+β+β2++βk1)g=1βk1βgv_k = (1 + \beta + \beta^2 + \cdots + \beta^{k-1})g = \frac{1 - \beta^k}{1 - \beta} g

After k=5k = 5 iterations: v5=10.9510.9(2,0)=10.590490.1(2,0)=4.0951(2,0)=(8.19,0)v_5 = \frac{1 - 0.9^5}{1 - 0.9} \cdot (2, 0) = \frac{1 - 0.59049}{0.1} \cdot (2, 0) = 4.0951 \cdot (2, 0) = (8.19, 0)

The effective step in the xx-direction is α8.19\alpha \cdot 8.19 — about 4×4\times the gradient magnitude. As kk \to \infty, vk11βg=10gv_k \to \frac{1}{1-\beta} g = 10 \cdot g, amplifying the consistent direction by 10×10\times.


Problem 4: Cosine Annealing Schedule

📝Learning Rate at Specific Epochs

Problem: For cosine annealing with αmin=105\alpha_{\min} = 10^{-5}, αmax=0.1\alpha_{\max} = 0.1, and T=100T = 100 epochs, compute the learning rate at epochs k=0,25,50,75,100k = 0, 25, 50, 75, 100.

💡Solution

αk=105+0.11052(1+cos(πk100))\alpha_k = 10^{-5} + \frac{0.1 - 10^{-5}}{2}\left(1 + \cos\left(\frac{\pi k}{100}\right)\right)
  • k=0k = 0: α0=105+0.05(1+cos0)=105+0.052=0.100010.1\alpha_0 = 10^{-5} + 0.05 \cdot (1 + \cos 0) = 10^{-5} + 0.05 \cdot 2 = 0.10001 \approx 0.1
  • k=25k = 25: α25=105+0.05(1+cos(π/4))=105+0.051.7071=0.08536\alpha_{25} = 10^{-5} + 0.05 \cdot (1 + \cos(\pi/4)) = 10^{-5} + 0.05 \cdot 1.7071 = 0.08536
  • k=50k = 50: α50=105+0.05(1+cos(π/2))=105+0.051=0.05001\alpha_{50} = 10^{-5} + 0.05 \cdot (1 + \cos(\pi/2)) = 10^{-5} + 0.05 \cdot 1 = 0.05001
  • k=75k = 75: α75=105+0.05(1+cos(3π/4))=105+0.050.2929=0.01466\alpha_{75} = 10^{-5} + 0.05 \cdot (1 + \cos(3\pi/4)) = 10^{-5} + 0.05 \cdot 0.2929 = 0.01466
  • k=100k = 100: α100=105+0.05(1+cosπ)=105+0.050=105\alpha_{100} = 10^{-5} + 0.05 \cdot (1 + \cos \pi) = 10^{-5} + 0.05 \cdot 0 = 10^{-5}

The schedule starts near maximum, decays smoothly through the middle, and ends at the minimum — exactly the behavior needed for training neural networks.


Quick Reference

📋Key Takeaways

  • Update Rule: xk+1=xkαkf(xk)x_{k+1} = x_k - \alpha_k \nabla f(x_k) — move opposite the gradient by step size αk\alpha_k.
  • Learning Rate: The most critical hyperparameter. Too large diverges, too small is slow. Use schedules (cosine, step decay) instead of constant values.
  • Batch vs. SGD vs. Mini-batch: Batch GD uses all data (stable but expensive), SGD uses one sample (fast but noisy), mini-batch GD (b=32b = 32512512) is the practical default for deep learning.
  • Convergence Rate: O(1/k)O(1/k) for convex, linear convergence (κ1κ+1)2k\left(\frac{\kappa-1}{\kappa+1}\right)^{2k} for strongly convex, O(1/ϵ2)O(1/\epsilon^2) to reach f<ϵ\|\nabla f\| < \epsilon for non-convex.
  • Line Search: Backtracking with Armijo condition f(x+αd)f(x)+cαfTdf(x + \alpha d) \leq f(x) + c\alpha \nabla f^T d provides adaptive step sizes without exact minimization.
  • Momentum: vk=βvk1+f(xk)v_k = \beta v_{k-1} + \nabla f(x_k) with β0.9\beta \approx 0.9 averages oscillations and accelerates through consistent gradient directions by 11β\frac{1}{1-\beta}.
  • Nesterov Momentum: Evaluate gradient at the look-ahead position for O(1/k2)O(1/k^2) convergence — the theoretical foundation for accelerated methods.
  • Schedules: Cosine annealing αk=αmin+12(αmaxαmin)(1+cos(πk/T))\alpha_k = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})(1 + \cos(\pi k/T)) is the most popular schedule in deep learning. Add warm restarts to escape poor minima.
  • Deep Learning Practice: Start with Adam (lr=0.001) for fast convergence; switch to SGD+momentum for final training. Use gradient clipping when gradients explode.
  • Python: Use scipy.optimize.minimize() for small convex problems; implement custom loops with NumPy for full control over training dynamics.

Cross-References

  • Stochastic Gradient Descent: Adam, RMSProp, and adaptive learning rate methods → SGD
  • Newton's Method: Second-order optimization with Hessian information → Newton's Method
  • Constrained Optimization: KKT conditions, penalty methods, and interior-point methods → Constrained Optimization
  • Convex Optimization: Global optimality guarantees and convex function properties → Convex Optimization
  • Hyperparameter Optimization: Learning rate tuning via grid search, Bayesian optimization → Hyperparameter Optimization
  • Calculus Optimization: First and second derivative tests, unconstrained optimization → Calculus Optimization
  • Partial Derivatives: Gradients and Jacobians needed for computing f\nabla fPartial Derivatives
  • Linear Algebra Norms: Vector norms used in convergence analysis → Linear Algebra Norms
Lesson Progress62 / 100