← Math|64 of 100
Optimization

Newton's Method

Master Newton's method, quasi-Newton methods, and their use in optimization.

📂 Second-order📖 Lesson 64 of 100🎓 Free Course

Advertisement

Why It Matters

💡 Why It Matters

Newton's method is the cornerstone of second-order optimization. While gradient descent only uses first-derivative (slope) information, Newton's method leverages second-derivative (curvature) information to navigate the loss landscape far more efficiently. This makes it the algorithm of choice for medium-scale problems where exact Hessian computation is feasible. Understanding Newton's method is essential for appreciating why quasi-Newton methods like BFGS exist, and for grasping the trade-offs between convergence speed and computational cost in optimization.

ℹ️ When to Use Newton's Method

Use Newton's method or its variants when: (1) the problem has a moderate number of parameters (n < 10,000), (2) the objective function is twice-differentiable, (3) you need high-precision solutions, or (4) the Hessian is sparse or has special structure. Avoid it for very high-dimensional problems where Hessian computation becomes infeasible.


Newton's Update Rule

Newton's Update Rule

thetatextnew=thetaH1nablaf(theta)\\theta_{\\text{new}} = \\theta - H^{-1} \\nabla f(\\theta)

Here,

  • θ\theta=Current parameter vector
  • HH=Hessian matrix of second derivatives, H = \nabla^2 f(\theta)
  • f(θ)\nabla f(\theta)=Gradient vector at \theta
  • H1H^{-1}=Inverse of the Hessian matrix

The intuition is simple: gradient descent moves in the direction of steepest descent (-∇f), while Newton's method moves in a direction that accounts for the curvature of the function. By multiplying the gradient by H⁻¹, Newton's method effectively rescales each direction according to how curved the function is in that direction.

💡 Geometric Intuition

At each iteration, Newton's method fits a quadratic approximation (second-order Taylor expansion) to the function at the current point and jumps directly to the minimum of that quadratic. If the function is truly quadratic, Newton's method converges in a single step.


Quadratic Convergence

ThQuadratic Convergence of Newton's Method

ℹ️ Convergence Rates Compared

  • Linear convergence: Error at iteration k is O(ρᵏ) where 0 < ρ < 1. Each iteration reduces error by a constant factor.
  • Quadratic convergence: Error at iteration k is O(C^{2^k}). Each iteration roughly doubles the number of correct digits.
  • Superlinear convergence: Between linear and quadratic. BFGS achieves superlinear convergence.

Hessian Matrix

DfHessian Matrix

Hessian of a Quadratic Function

textIff(theta)=frac12thetaTAtheta+bTtheta+c,textthenH=A\\text{If } f(\\theta) = \\frac{1}{2}\\theta^T A \\theta + b^T \\theta + c, \\text{ then } H = A

Here,

  • AA=Symmetric positive definite matrix
  • bb=Linear coefficient vector
  • cc=Scalar constant

📝Hessian Computation

Compute the Hessian of f(x, y) = x⁴ + y⁴ - 4xy + 1.

💡Solution

∂f/∂x = 4x³ - 4y, ∂f/∂y = 4y³ - 4x

H = [[12x², -4], [-4, 12y²]]

At the critical point (1, 1): H = [[12, -4], [-4, 12]], eigenvalues = {8, 16}, both positive → local minimum.


Damped Newton Method

Damped Newton's Update Rule

thetatextnew=thetaalphakH1nablaf(theta)\\theta_{\\text{new}} = \\theta - \\alpha_k H^{-1} \\nabla f(\\theta)

Here,

  • αk\alpha_k=Step size determined by line search at iteration k
  • H1f(θ)H^{-1} \nabla f(\theta)=Newton direction (may not be a descent direction if H is not positive definite)

⚠️ Why Damping is Necessary

Pure Newton's method can diverge if: (1) the initial guess is far from the optimum, (2) the Hessian is not positive definite (e.g., near saddle points), or (3) the function is not well-approximated by its quadratic expansion. Damping (line search) ensures that each step actually decreases the objective function value.

DfLine Search Strategies for Damped Newton

The typical damped Newton algorithm at iteration k:

  1. Compute gradient g_k = ∇f(θ_k) and Hessian H_k = ∇²f(θ_k)
  2. Solve H_k d_k = -g_k for the Newton direction d_k
  3. If H_k is not positive definite, modify it (e.g., add λI to make it positive definite)
  4. Perform line search to find step size α_k
  5. Update θ_{k+1} = θ_k + α_k d_k

Quasi-Newton Methods

DfQuasi-Newton Methods

BFGS Update Rule

Bk+1=Bk+fracykykTykTskfracBkskskTBkskTBkskB_{k+1} = B_k + \\frac{y_k y_k^T}{y_k^T s_k} - \\frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}

Here,

  • BkB_k=Current approximation to the Hessian
  • sks_k=Step vector: s_k = \theta_{k+1} - \theta_k
  • yky_k=Gradient change: y_k = \nabla f(\theta_{k+1}) - \nabla f(\theta_k)

💡 BFGS Properties

  • Maintains positive definiteness of B_k (if B_0 is positive definite)
  • Superlinear convergence near the optimum
  • Requires O(n²) memory and O(n²) per iteration
  • The most popular quasi-Newton method in practice

DfL-BFGS (Limited-memory BFGS)

ℹ️ L-BFGS Two-loop Recursion

L-BFGS computes the product H⁻¹∇f using a two-loop recursion that does not require explicit matrix storage. The two-loop algorithm processes the m stored (s, y) pairs in reverse order in the first loop and forward order in the second loop, producing the result in O(mn) time.


Gauss-Newton Method

Gauss-Newton Update for Nonlinear Least Squares

thetatextnew=theta(JTJ)1JTr(theta)\\theta_{\\text{new}} = \\theta - (J^T J)^{-1} J^T r(\\theta)

Here,

  • JJ=Jacobian matrix of the residual function r(\theta)
  • r(θ)r(\theta)=Residual vector: r_i(\theta) = y_i - f(x_i; \theta)
  • JTJJ^T J=Approximation to the Hessian (ignores second-derivative terms)

💡 Why Gauss-Newton Works

For least squares problems f(θ) = ½‖r(θ)‖², the true Hessian is H = J^T J + Σ r_i ∇²r_i. The Gauss-Newton method drops the second term Σ r_i ∇²r_i, which is small when residuals are near zero. This gives a good approximation without computing second derivatives, and J^T J is always positive semidefinite.

📝Gauss-Newton Example

Fit the model y = ae^(bx) to data points (1, 2.5), (2, 4.8), (3, 8.9), (4, 16.2) using Gauss-Newton.

💡Solution

Residual: r_i(θ) = y_i - a·e^(b·x_i)

Jacobian row i: J_i = [-e^(b·x_i), -a·x_i·e^(b·x_i)]

Initial guess: a₀ = 1, b₀ = 0.5

J at initial guess: J = [[-e^0.5, -1·1·e^0.5], [-e^1, -1·2·e^1], [-e^1.5, -1·3·e^1.5], [-e^2, -1·4·e^2]]

Compute J^T J, J^T r, solve for update, iterate until convergence.


Comparison with First-Order Methods

MethodConvergence RateCost per IterationMemoryHessian RequiredBest For
Gradient DescentLinearO(n)O(n)NoLarge-scale, non-smooth
Momentum SGDLinear (faster)O(n)O(n)NoDeep learning training
Newton's MethodQuadraticO(n³)O(n²)Yes (exact)Small-scale, smooth
Damped NewtonQuadratic (near optimum)O(n³)O(n²)Yes (exact)Medium-scale, non-convex
BFGSSuperlinearO(n²)O(n²)No (approximated)Medium-scale, smooth
L-BFGSSuperlinearO(mn)O(mn)No (limited memory)Large-scale, smooth
Gauss-NewtonQuadratic (near solution)O(mn²)O(n²)No (Jacobian)Nonlinear least squares

ℹ️ Choosing the Right Method

  • n < 100: Use Newton's method or BFGS — the O(n³) cost is negligible.
  • 100 < n < 10,000: Use L-BFGS with m = 10-20 — balances speed and memory.
  • n > 10,000: Use first-order methods (Adam, SGD) — Hessian-based methods are too expensive.
  • Nonlinear least squares: Use Gauss-Newton or Levenberg-Marquardt.

Python Implementation

import numpy as np
from scipy.optimize import minimize, least_squares

# --- Newton's Method (using scipy with Hessian) ---

def rosenbrock(x):
    """Rosenbrock function: f(x,y) = (1-x)^2 + 100(y-x^2)^2"""
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def rosenbrock_grad(x):
    """Gradient of Rosenbrock function."""
    dfdx = -2*(1 - x[0]) - 400*x[0]*(x[1] - x[0]**2)
    dfdy = 200*(x[1] - x[0]**2)
    return np.array([dfdx, dfdy])

def rosenbrock_hessian(x):
    """Hessian of Rosenbrock function."""
    h11 = 2 - 400*(x[1] - 3*x[0]**2)
    h12 = -400*x[0]
    h21 = -400*x[0]
    h22 = 200
    return np.array([[h11, h12], [h21, h22]])

# Newton's method with exact Hessian
result_newton = minimize(
    rosenbrock,
    x0=[-1.0, 1.0],
    jac=rosenbrock_grad,
    hess=rosenbrock_hessian,
    method='Newton-CG'
)
print(f"Newton: x = {result_newton.x}, f(x) = {result_newton.fun:.2e}")

# --- BFGS (Quasi-Newton) ---

result_bfgs = minimize(
    rosenbrock,
    x0=[-1.0, 1.0],
    method='BFGS'
)
print(f"BFGS:   x = {result_bfgs.x}, f(x) = {result_bfgs.fun:.2e}")

# --- L-BFGS-B (Bounded Quasi-Newton) ---

result_lbfgs = minimize(
    rosenbrock,
    x0=[-1.0, 1.0],
    method='L-BFGS-B',
    bounds=[(-2, 2), (-2, 2)]
)
print(f"L-BFGS: x = {result_lbfgs.x}, f(x) = {result_lbfgs.fun:.2e}")

# --- Gauss-Newton for Nonlinear Least Squares ---

def model(x, t):
    """Model: y = a * exp(b * t)"""
    return x[0] * np.exp(x[1] * t)

def residuals(x, t, y):
    """Residuals for least squares."""
    return y - model(x, t)

# Data
t_data = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_data = np.array([2.71, 7.39, 20.09, 54.60, 148.41])

# Levenberg-Marquardt (Gauss-Newton variant)
result_gn = least_squares(
    residuals,
    x0=[1.0, 0.5],
    args=(t_data, y_data),
    method='trf'
)
print(f"Gauss-Newton: a = {result_gn.x[0]:.4f}, b = {result_gn.x[1]:.4f}")

# --- Comparison of convergence rates ---

from scipy.optimize import minimize as minimize_comparison

def rosenbrock_only(x):
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

methods = ['BFGS', 'L-BFGS-B', 'Nelder-Mead', 'Powell']
for method in methods:
    res = minimize_comparison(
        rosenbrock_only,
        x0=[-1.0, 1.0],
        method=method,
        options={'maxiter': 1000}
    )
    print(f"{method:12s}: f(x) = {res.fun:.2e}, iters = {res.nit}")

Applications in AI/ML

ℹ️ Newton's Method in Machine Learning

Second-order methods appear in ML in several important contexts, even though they are not used for training large neural networks directly.

DfNatural Gradient Descent

Key applications include:

  • Logistic regression: Newton's method (IRLS) converges in 5-6 iterations for most problems
  • Gaussian process regression: Matrix inversion is equivalent to solving the normal equations
  • Variational inference: Natural gradients for mean-field and full-rank approximations
  • Reinforcement learning: Trust region policy optimization (TRPO) uses a second-order approximation
  • Hyperparameter optimization: L-BFGS for tuning differentiable hyperparameters
  • Computer vision: Bundle adjustment in SLAM uses Gauss-Newton/Levenberg-Marquardt
  • Scientific computing: Nonlinear PDE solvers, curve fitting, parameter estimation

Common Mistakes

MistakeWhy It's WrongCorrect Approach
Using Newton's method without checking H is positive definiteNewton direction may not be a descent directionAdd regularization: H + λI, or use modified Cholesky
Applying Newton's method to non-smooth functionsHessian does not existUse subgradient methods or smooth approximations
Forgetting line search in damped NewtonPure Newton can diverge far from optimumAlways use backtracking or Wolfe line search
Using exact Newton for neural networksO(n³) cost is prohibitive for millions of parametersUse Adam, SGD, or L-BFGS for small networks
Ignoring Hessian conditioningIll-conditioned Hessian causes numerical instabilityUse preconditioning or regularize the Hessian
Not scaling featuresPoorly scaled features lead to ill-conditioned HStandardize features before applying Newton's method
Assuming quadratic convergence from any starting pointQuadratic convergence only holds near the optimumUse damping far from optimum, pure Newton near optimum
Confusing BFGS with L-BFGS memory requirementsBFGS uses O(n²) memory, L-BFGS uses O(mn)Use L-BFGS for problems with n > 10,000

Interview Questions

📝Q1: Why is Newton's method not used for training deep neural networks?

A deep neural network with 1 million parameters has a Hessian with 10¹² entries. Computing and storing this requires petabytes of memory, and inversion costs O(10¹⁸) operations — completely infeasible.

📝Q2: What makes BFGS better than Newton's method in practice?

BFGS avoids direct Hessian computation by building an approximation from gradient differences. It achieves superlinear convergence with O(n²) cost instead of O(n³), making it practical for problems with thousands of parameters.

📝Q3: When would you use L-BFGS over Adam for training a model?

L-BFGS is preferred for smooth, convex or mildly non-convex problems with moderate dimensions (n < 100K) where you can compute the full gradient. Adam is better for very large-scale, noisy, non-convex problems like deep learning where stochastic gradients are necessary.

📝Q4: Explain the secant equation in quasi-Newton methods.

The secant equation B_{k+1}s_k = y_k ensures that the updated Hessian approximation B_{k+1} is consistent with the most recent gradient information. It is the multidimensional analog of the secant method for root finding, and it replaces the need to compute second derivatives.

📝Q5: What is the relationship between Gauss-Newton and Newton's method?

Gauss-Newton is a simplification of Newton's method for least squares problems. It approximates the Hessian as J^T J (ignoring the Σ r_i ∇²r_i term), which is cheaper to compute and always positive semidefinite. When residuals are small, this approximation is very accurate.


Practice Problems

📝Problem 1: Manual Newton Step

Apply one iteration of Newton's method to minimize f(x) = x⁴ - 4x³ + 6x² - 4x + 1 starting from x₀ = 2.0.

💡Solution

f(x) = (x-1)⁴, so f'(x) = 4(x-1)³, f''(x) = 12(x-1)²

At x₀ = 2: f'(2) = 4(1)³ = 4, f''(2) = 12(1)² = 12

Newton update: x₁ = x₀ - f'(x₀)/f''(x₀) = 2 - 4/12 = 2 - 1/3 = 5/3 ≈ 1.667

True minimum is at x = 1, so we are converging.

📝Problem 2: Hessian Positive Definiteness

Determine whether the Hessian of f(x,y) = x² + xy + y² at the origin is positive definite.

💡Solution

∂²f/∂x² = 2, ∂²f/∂x∂y = 1, ∂²f/∂y² = 1

H = [[2, 1], [1, 1]]

Leading minors: D₁ = 2 > 0, D₂ = 2·1 - 1·1 = 1 > 0

Both leading minors are positive, so H is positive definite. The origin is a strict local (and global) minimum.

📝Problem 3: BFGS Update

Given B_k = [[4, 0], [0, 2]], s_k = [1, 1]^T, y_k = [3, 1]^T, compute one BFGS update B_{k+1}.

💡Solution

y_k^T s_k = 3·1 + 1·1 = 4

B_k s_k = [4, 2]^T, s_k^T B_k s_k = [1, 1]·[4, 2] = 6

yy^T/4 = [[9/4, 3/4], [3/4, 1/4]]

B_k s_k s_k^T B_k / 6 = [[16/6, 8/6], [8/6, 4/6]] = [[8/3, 4/3], [4/3, 2/3]]

B_{k+1} = [[4, 0], [0, 2]] + [[9/4, 3/4], [3/4, 1/4]] - [[8/3, 4/3], [4/3, 2/3]]

= [[4 + 9/4 - 8/3, 0 + 3/4 - 4/3], [0 + 3/4 - 4/3, 2 + 1/4 - 2/3]]

= [[4 + 2.25 - 2.667, 0.75 - 1.333], [0.75 - 1.333, 2 + 0.25 - 0.667]]

= [[3.583, -0.583], [-0.583, 1.583]]

📝Problem 4: Gauss-Newton Step

For the model y = ae^(bx) with data points (1, 3), (2, 7), compute the Gauss-Newton Jacobian at a=1, b=0.5.

💡Solution

Model: f(x) = ae^(bx)

∂f/∂a = e^(bx), ∂f/∂b = ax·e^(bx)

At a=1, b=0.5:

  • x=1: ∂f/∂a = e^0.5 ≈ 1.649, ∂f/∂b = 1·1·e^0.5 ≈ 1.649
  • x=2: ∂f/∂a = e^1 ≈ 2.718, ∂f/∂b = 1·2·e^1 ≈ 5.436

J = [[1.649, 1.649], [2.718, 5.436]]

Residuals: r₁ = 3 - 1·e^0.5 ≈ 1.351, r₂ = 7 - 1·e^1 ≈ 4.282

J^T J and J^T r can be computed to find the update.


Quick Reference

📋Newton's Method Quick Reference

  • Update Rule: θ_{new} = θ - H⁻¹∇f(θ)
  • Convergence: Quadratic (near optimum), requires positive definite Hessian
  • Cost: O(n³) for Hessian computation and inversion, O(n²) memory
  • Damped Version: θ_{new} = θ - αH⁻¹∇f(θ) with line search for α
  • BFGS: Approximates H using gradient pairs (s_k, y_k), O(n²) cost, superlinear convergence
  • L-BFGS: Stores m recent pairs, O(mn) memory, best for large-scale smooth problems
  • Gauss-Newton: For least squares, uses J^T J as approximate Hessian
  • When to Use: Medium-scale smooth problems (n < 10K), when high precision is needed
  • When to Avoid: Very large n, non-smooth functions, noisy gradients

📋Key Formulas

FormulaDescription
θ - H⁻¹∇fNewton's update
θ - αH⁻¹∇fDamped Newton update
B_{k+1} = B_k + yy^T/(y^Ts) - B_k s s^T B_k/(s^T B_k s)BFGS Hessian update
θ - (J^T J)⁻¹ J^T rGauss-Newton update
θ - η F⁻¹ ∇LNatural gradient update

Cross-References

ℹ️ Related Topics

  • 063-optimization-gradient-descent.mdx — First-order methods — compare convergence rates and computational costs with Newton's method.
  • 065-optimization-convex.mdx — Convex optimization — Newton's method is guaranteed to converge for convex functions with positive definite Hessians.
  • 066-optimization-constrained.mdx — Constrained optimization — interior point methods use Newton's method to solve the KKT conditions.
  • 067-optimization-stochastic.mdx — Stochastic methods — compare with second-order methods for large-scale problems.
  • 068-optimization-learning-rate.mdx — Learning rate schedules — line search in damped Newton is analogous to adaptive learning rates.
  • 069-regularization.mdx: Regularization — Hessian regularization (adding λI) prevents ill-conditioning.
  • 070-information-geometry.mdx: Natural gradient — connects Newton's method to Riemannian geometry of parameter spaces.

💡 Next Steps

After mastering Newton's method, study trust-region methods (which bound the step size based on how well the quadratic model approximates the true function) and interior-point methods (which use Newton's method to solve constrained optimization problems via barrier functions).

Lesson Progress64 / 100