← Math|30 of 100
Calculus

Calculus Optimization

Master unconstrained and constrained optimization, derivative tests, gradient descent, and applications in machine learning and AI.

📂 Optimization📖 Lesson 30 of 100🎓 Free Course

Advertisement

Why It Matters

ℹ️ Why It Matters

Optimization is the mathematical backbone of machine learning. Every time a neural network trains, every time a model tunes its hyperparameters, every time an algorithm minimizes a loss function — optimization is happening. Understanding calculus-based optimization gives you the theoretical foundation to not just use ML tools, but to understand why they work, when they fail, and how to improve them. From finding the minimum of a simple quadratic to training billion-parameter models, the core principles are the same: find where the derivative is zero, verify it is a minimum, and iterate.

💡 The Big Picture

Optimization connects calculus to every quantitative discipline: physics (least action), economics (utility maximization), engineering (design optimization), and AI (loss minimization). Mastering this topic unlocks deep understanding across all these fields.


What is Optimization

DfOptimization

Optimization is the process of finding the input values that make a function achieve its maximum or minimum value. Formally, given a function f:RnRf: \mathbb{R}^n \to \mathbb{R}, we seek a point x\vec{x}^* such that f(x)f(x)f(\vec{x}^*) \leq f(\vec{x}) for all x\vec{x} in some domain (minimization) or f(x)f(x)f(\vec{x}^*) \geq f(\vec{x}) for all x\vec{x} (maximization).

Standard Optimization Problem

minxDf(x)\min_{\vec{x} \in \mathcal{D}} f(\vec{x})

Here,

  • f( ec{x})=The objective function to minimize
  • ec{x}=The decision variable (input vector)
  • mathcalDmathcal{D}=The feasible domain or constraint set
  • ec{x}^*=The optimal solution (minimizer)

ℹ️ Minimization vs Maximization

Maximizing f(x)f(x) is equivalent to minimizing f(x)-f(x). In practice, almost all optimization literature and software frames problems as minimization. When you see a "maximization" problem in ML (like maximizing log-likelihood), it is converted to minimization by taking the negative.


Critical Points

DfCritical Points

A critical point of a function f(x)f(x) is any point cc in the domain where either f(c)=0f'(c) = 0 or f(c)f'(c) does not exist. Critical points are candidates for local maxima and minima — they are necessary but not sufficient conditions for optimality.

Critical Point Conditions

f(c)=0orf(c) is undefinedf'(c) = 0 \quad \text{or} \quad f'(c) \text{ is undefined}

Here,

  • cc=A point in the domain of f
  • f(c)=0f'(c) = 0=Horizontal tangent (stationary point)
  • f(c)extundefinedf'(c) ext{ undefined}=Cusp, corner, or vertical tangent

⚠️ Critical ≠ Optimal

Not every critical point is an extremum. The function f(x)=x3f(x) = x^3 has f(0)=0f'(0) = 0, but x=0x = 0 is an inflection point, not a max or min. You must use the first or second derivative test to classify critical points.

📝Finding Critical Points

Problem: Find all critical points of f(x)=x36x2+9x+1f(x) = x^3 - 6x^2 + 9x + 1.

💡Solution

f(x)=3x212x+9=3(x24x+3)=3(x1)(x3)f'(x) = 3x^2 - 12x + 9 = 3(x^2 - 4x + 3) = 3(x-1)(x-3)

Setting f(x)=0f'(x) = 0: x=1x = 1 and x=3x = 3.

f(1)=16+9+1=5f(1) = 1 - 6 + 9 + 1 = 5 and f(3)=2754+27+1=1f(3) = 27 - 54 + 27 + 1 = 1.

So the critical points are (1,5)(1, 5) and (3,1)(3, 1).


First Derivative Test

ThFirst Derivative Test

Let ff be continuous at a critical point cc where f(c)=0f'(c) = 0. If ff' changes sign at cc:

  • If ff' changes from negative to positive at cc, then ff has a local minimum at cc.
  • If ff' changes from positive to negative at cc, then ff has a local maximum at cc.
  • If ff' does not change sign at cc, then cc is neither a local maximum nor minimum (it is an inflection point).

💡 Intuition

The first derivative test works because the sign of ff' tells you whether the function is increasing or decreasing. If ff is decreasing (negative derivative) before cc and increasing (positive derivative) after cc, the function must have "bottomed out" at cc — a local minimum.

📝First Derivative Test

Problem: Classify the critical points of f(x)=x44x3+6x24xf(x) = x^4 - 4x^3 + 6x^2 - 4x.

💡Solution

f(x)=4x312x2+12x4=4(x1)3f'(x) = 4x^3 - 12x^2 + 12x - 4 = 4(x-1)^3

Setting f(x)=0f'(x) = 0: x=1x = 1 is the only critical point.

Test signs: For x<1x < 1, f(x)=4(x1)3<0f'(x) = 4(x-1)^3 < 0. For x>1x > 1, f(x)=4(x1)3>0f'(x) = 4(x-1)^3 > 0.

Since ff' changes from negative to positive at x=1x = 1, there is a local minimum at x=1x = 1 with f(1)=14+64=1f(1) = 1 - 4 + 6 - 4 = -1.


Second Derivative Test

ThSecond Derivative Test

Let ff be twice differentiable at a critical point cc where f(c)=0f'(c) = 0:

  • If f(c)>0f''(c) > 0, then ff has a local minimum at cc.
  • If f(c)<0f''(c) < 0, then ff has a local maximum at cc.
  • If f(c)=0f''(c) = 0, the test is inconclusive (use the first derivative test or higher-order analysis).

ℹ️ Why It Works

If f(c)>0f''(c) > 0, the function is concave up at cc, meaning it curves upward like a cup — the bottom of the cup is a minimum. If f(c)<0f''(c) < 0, the function is concave down, curving downward like a dome — the top is a maximum.

Second Derivative Test Formula

f(c){>0    local min<0    local max=0    inconclusivef''(c) \begin{cases} > 0 \implies \text{local min} \\ < 0 \implies \text{local max} \\ = 0 \implies \text{inconclusive} \end{cases}

Here,

  • f(c)f''(c)=Second derivative evaluated at critical point c
  • f(c)=0f'(c) = 0=Prerequisite: c must be a stationary point

📝Second Derivative Test

Problem: Find and classify the extrema of f(x)=2x39x2+12x3f(x) = 2x^3 - 9x^2 + 12x - 3.

💡Solution

f(x)=6x218x+12=6(x1)(x2)=0    x=1,x=2f'(x) = 6x^2 - 18x + 12 = 6(x-1)(x-2) = 0 \implies x = 1, x = 2f(x)=12x18f''(x) = 12x - 18

At x=1x = 1: f(1)=1218=6<0f''(1) = 12 - 18 = -6 < 0local maximum, f(1)=2f(1) = 2.

At x=2x = 2: f(2)=2418=6>0f''(2) = 24 - 18 = 6 > 0local minimum, f(2)=1f(2) = 1.


Global vs Local Extrema

ThGlobal vs Local Extrema

  • A local (relative) minimum at cc means f(c)f(x)f(c) \leq f(x) for all xx in some open interval containing cc.
  • A global (absolute) minimum at cc means f(c)f(x)f(c) \leq f(x) for all xx in the entire domain.
  • Every global extremum is a local extremum, but not every local extremum is global.

ThExtreme Value Theorem

If ff is continuous on a closed, bounded interval [a,b][a, b], then ff attains both a global maximum and a global minimum on [a,b][a, b]. The global extrema occur either at critical points or at the endpoints aa and bb.

💡 Procedure for Global Extrema on [a,b]

To find the absolute maximum and minimum of ff on [a,b][a, b]:

  1. Find all critical points of ff in (a,b)(a, b).
  2. Evaluate ff at each critical point.
  3. Evaluate ff at the endpoints aa and bb.
  4. The largest value is the global maximum; the smallest is the global minimum.

⚠️ Unbounded Domains

On unbounded domains (e.g., (,)(-\infty, \infty)), a continuous function may have no global extrema. For example, f(x)=x2f(x) = x^2 has a global minimum at x=0x = 0, but f(x)=x3f(x) = x^3 has neither a global max nor min. Always check whether the function grows without bound in both directions.

📝Global Extrema

Problem: Find the global extrema of f(x)=x33x+2f(x) = x^3 - 3x + 2 on [2,2][-2, 2].

💡Solution

f(x)=3x23=3(x1)(x+1)=0    x=1,x=1f'(x) = 3x^2 - 3 = 3(x-1)(x+1) = 0 \implies x = -1, x = 1

Evaluate: f(2)=8+6+2=0f(-2) = -8 + 6 + 2 = 0, f(1)=1+3+2=4f(-1) = -1 + 3 + 2 = 4, f(1)=13+2=0f(1) = 1 - 3 + 2 = 0, f(2)=86+2=4f(2) = 8 - 6 + 2 = 4.

Global maximum: 44 at x=1x = -1 and x=2x = 2. Global minimum: 00 at x=2x = -2 and x=1x = 1.


Optimization Process

💡 Step-by-Step Optimization Procedure

For any optimization problem, follow this systematic approach:

Step 1: Identify the objective function. Determine what you are maximizing or minimizing. Write it as a function of one or more variables.

Step 2: Define the domain. Identify constraints on the variables (physical, logical, or problem-imposed). Determine whether the domain is bounded or unbounded.

Step 3: Express as a single-variable problem (if possible). Use constraints to eliminate variables. For example, if x+y=10x + y = 10, substitute y=10xy = 10 - x.

Step 4: Find critical points. Compute the derivative and solve f(x)=0f'(x) = 0. Also check where f(x)f'(x) is undefined.

Step 5: Classify critical points. Use the first derivative test (sign analysis) or the second derivative test.

Step 6: Check endpoints (if bounded). Evaluate the function at the boundaries of the domain.

Step 7: Compare all candidates. The largest value is the global maximum; the smallest is the global minimum.

Step 8: Verify and interpret. Check that the answer makes sense in the context of the problem.


Worked Examples

Example 1: Maximum Volume of a Box

📝Open Box from Cardboard

Problem: A square piece of cardboard of side 12 inches has squares cut from each corner. The sides are folded up to form an open box. Find the dimensions that maximize the volume.

💡Solution

Let xx = side length of cut squares. Then the box has dimensions (122x)×(122x)×x(12-2x) \times (12-2x) \times x.

V(x)=x(122x)2=x(14448x+4x2)=4x348x2+144xV(x) = x(12-2x)^2 = x(144 - 48x + 4x^2) = 4x^3 - 48x^2 + 144x

Domain: 0<x<60 < x < 6 (cut cannot exceed half the side).

V(x)=12x296x+144=12(x28x+12)=12(x2)(x6)V'(x) = 12x^2 - 96x + 144 = 12(x^2 - 8x + 12) = 12(x-2)(x-6)

V(x)=0    x=2V'(x) = 0 \implies x = 2 (since x=6x = 6 is at the boundary).

V(x)=24x96V''(x) = 24x - 96, so V(2)=4896=48<0V''(2) = 48 - 96 = -48 < 0 → local maximum.

Dimensions: x=2x = 2, base =8×8= 8 \times 8, height =2= 2. Maximum volume =32= 32 cubic inches.

Example 2: Shortest Distance

📝Shortest Distance from Point to Curve

Problem: Find the point on y=x2y = x^2 closest to (0,3)(0, 3).

💡Solution

Distance squared: D(x)=x2+(x23)2=x2+x46x2+9=x45x2+9D(x) = x^2 + (x^2 - 3)^2 = x^2 + x^4 - 6x^2 + 9 = x^4 - 5x^2 + 9

Minimizing D(x)D(x) is equivalent to minimizing distance (since \sqrt{\cdot} is monotonic).

D(x)=4x310x=2x(2x25)=0    x=0D'(x) = 4x^3 - 10x = 2x(2x^2 - 5) = 0 \implies x = 0 or x=±5/2x = \pm\sqrt{5/2}

D(x)=12x210D''(x) = 12x^2 - 10

At x=0x = 0: D(0)=10<0D''(0) = -10 < 0 → local maximum (not what we want).

At x=±5/2x = \pm\sqrt{5/2}: D(±5/2)=12(5/2)10=20>0D''(\pm\sqrt{5/2}) = 12(5/2) - 10 = 20 > 0 → local minimum.

Closest points: (5/2,5/2)\left(\sqrt{5/2}, 5/2\right) and (5/2,5/2)\left(-\sqrt{5/2}, 5/2\right).

Example 3: Minimum Surface Area

📝Cylinder with Fixed Volume

Problem: Find the radius and height of a cylinder with volume V=16πV = 16\pi that minimizes surface area.

💡Solution

Volume: πr2h=16π    h=16r2\pi r^2 h = 16\pi \implies h = \frac{16}{r^2}

Surface area: S=2πr2+2πrh=2πr2+32πrS = 2\pi r^2 + 2\pi r h = 2\pi r^2 + \frac{32\pi}{r}

S(r)=4πr32πr2=0    4r3=32    r3=8    r=2S'(r) = 4\pi r - \frac{32\pi}{r^2} = 0 \implies 4r^3 = 32 \implies r^3 = 8 \implies r = 2h=164=4h = \frac{16}{4} = 4

S(r)=4π+64πr3S''(r) = 4\pi + \frac{64\pi}{r^3}, so S(2)=4π+8π=12π>0S''(2) = 4\pi + 8\pi = 12\pi > 0 → minimum.

Optimal cylinder: radius =2= 2, height =4= 4 (height = diameter).


Constrained Optimization

DfConstrained Optimization

Constrained optimization involves minimizing or maximizing a function subject to one or more constraints on the variables. The constraints define a feasible region, and the optimum must lie within or on the boundary of this region.

Standard Constrained Problem

minxf(x)subject togi(x)=0,hj(x)0\min_{\vec{x}} f(\vec{x}) \quad \text{subject to} \quad g_i(\vec{x}) = 0, \quad h_j(\vec{x}) \leq 0

Here,

  • f( ec{x})=The objective function
  • g_i( ec{x}) = 0=Equality constraints
  • h_j( ec{x}) leq 0=Inequality constraints

DfLagrange Multipliers

The method of Lagrange multipliers converts a constrained optimization problem into a system of equations. To minimize f(x)f(\vec{x}) subject to g(x)=0g(\vec{x}) = 0, we solve f=λg\nabla f = \lambda \nabla g and g(x)=0g(\vec{x}) = 0 simultaneously, where λ\lambda is the Lagrange multiplier.

Lagrange Condition

f(x)=λg(x),g(x)=0\nabla f(\vec{x}^*) = \lambda \nabla g(\vec{x}^*), \quad g(\vec{x}^*) = 0

Here,

  • ablaf abla f=Gradient of the objective function
  • lambdalambda=Lagrange multiplier (sensitivity measure)
  • ablag abla g=Gradient of the constraint function
  • ec{x}^*=The optimal point

ℹ️ Geometric Interpretation

At the optimum, the gradient of the objective function is parallel to the gradient of the constraint. The multiplier λ\lambda measures how much the optimal value changes if the constraint is relaxed by one unit — it is the "shadow price" of the constraint.

ThKarush-Kuhn-Tucker (KKT) Conditions

For inequality constraints hj(x)0h_j(\vec{x}) \leq 0, the KKT conditions generalize Lagrange multipliers:

  1. Stationarity: f+jμjhj=0\nabla f + \sum_j \mu_j \nabla h_j = 0
  2. Primal feasibility: hj(x)0h_j(\vec{x}^*) \leq 0 for all jj
  3. Dual feasibility: μj0\mu_j \geq 0 for all jj
  4. Complementary slackness: μjhj(x)=0\mu_j h_j(\vec{x}^*) = 0 for all jj

📝Lagrange Multiplier Problem

Problem: Maximize f(x,y)=xyf(x, y) = xy subject to x+y=10x + y = 10.

💡Solution

f=(y,x)\nabla f = (y, x), g=(1,1)\nabla g = (1, 1) where g(x,y)=x+y10g(x,y) = x + y - 10.

f=λg    y=λ,x=λ    x=y=5\nabla f = \lambda \nabla g \implies y = \lambda, x = \lambda \implies x = y = 5.

Constraint: 5+5=105 + 5 = 10 ✓. Maximum value: f(5,5)=25f(5,5) = 25.


Python Implementation

Univariate Optimization with SciPy

import numpy as np
from scipy.optimize import minimize_scalar

# Minimize f(x) = x^4 - 4x^3 + 6x^2 - 4x + 1
f = lambda x: x**4 - 4*x**3 + 6*x**2 - 4*x + 1

# Bounded search
result = minimize_scalar(f, bounds=(-10, 10), method='bounded')
print(f"Minimum at x = {result.x:.6f}")
print(f"Minimum value = {result.fun:.6f}")

# With derivative information for faster convergence
def f_and_grad(x):
    val = x**4 - 4*x**3 + 6*x**2 - 4*x + 1
    grad = 4*x**3 - 12*x**2 + 12*x - 4
    return val, grad

from scipy.optimize import minimize
result = minimize(lambda x: f_and_grad(x)[0], x0=0.5,
                 jac=lambda x: f_and_grad(x)[1], method='BFGS')
print(f"Optimum at x = {result.x[0]:.6f}")

Multivariate Optimization

import numpy as np
from scipy.optimize import minimize

# Rosenbrock function: f(x,y) = (1-x)^2 + 100(y-x^2)^2
def rosenbrock(x):
    return (1 - x[0])**2 + 100*(x[1] - x[0]**2)**2

def rosenbrock_grad(x):
    dx = -2*(1 - x[0]) - 400*x[0]*(x[1] - x[0]**2)
    dy = 200*(x[1] - x[0]**2)
    return np.array([dx, dy])

# Minimize with gradient
result = minimize(rosenbrock, x0=[-1, 1], jac/rosenbrock_grad, method='L-BFGS-B')
print(f"Optimum: {result.x}, Value: {result.fun:.2e}")

# Without gradient (numerical approximation)
result = minimize(rosenbrock, x0=[-1, 1], method='Nelder-Mead')
print(f"Optimum: {result.x}, Value: {result.fun:.2e}")

Constrained Optimization

import numpy as np
from scipy.optimize import minimize

# Minimize f(x,y) = x^2 + y^2 subject to x + y = 1
def objective(x):
    return x[0]**2 + x[1]**2

def constraint_eq(x):
    return x[0] + x[1] - 1  # Must equal 0

from scipy.optimize import LinearConstraint, NonlinearConstraint

# Method 1: SLSQP with constraint dict
constraints = {'type': 'eq', 'fun': constraint_eq}
result = minimize(objective, x0=[0.5, 0.5], method='SLSQP', constraints=constraints)
print(f"Optimum: {result.x}, Value: {result.fun:.4f}")
# Expected: [0.5, 0.5] with value 0.5

# Method 2: Lagrange multiplier verification
from scipy.optimize import minimize

def lagrangian(x, lam):
    return x[0]**2 + x[1]**2 + lam * (x[0] + x[1] - 1)

# Solve the KKT system analytically
# x* = y* = 0.5, lambda* = -1
print(f"Verification: f(0.5, 0.5) = {0.5**2 + 0.5**2}")

Gradient Descent Implementation

import numpy as np

def gradient_descent(f, grad_f, x0, lr=0.01, n_iters=1000, tol=1e-8):
    """Basic gradient descent with convergence check."""
    x = np.array(x0, dtype=float)
    history = [x.copy()]

    for i in range(n_iters):
        g = grad_f(x)
        x_new = x - lr * g
        history.append(x_new.copy())

        if np.linalg.norm(x_new - x) < tol:
            print(f"Converged at iteration {i}")
            break
        x = x_new

    return x, np.array(history)

# Minimize f(x,y) = x^2 + 4y^2
f = lambda x: x[0]**2 + 4*x[1]**2
grad = lambda x: np.array([2*x[0], 8*x[1]])

x_opt, hist = gradient_descent(f, grad, [5.0, 5.0], lr=0.1)
print(f"Optimum: {x_opt}")  # Should be near [0, 0]

Applications in AI/ML

Loss Minimization

ℹ️ Loss Functions as Optimization

Every ML model training process is an optimization problem. The loss function L(θ)\mathcal{L}(\theta) measures how wrong the model is, and training seeks θ=argminθL(θ)\theta^* = \arg\min_\theta \mathcal{L}(\theta). Common losses include:

  • MSE for regression: L=1n(yiy^i)2\mathcal{L} = \frac{1}{n}\sum(y_i - \hat{y}_i)^2
  • Cross-entropy for classification: L=yilogy^i\mathcal{L} = -\sum y_i \log \hat{y}_i
  • Hinge loss for SVMs: L=max(0,1yiy^i)\mathcal{L} = \sum \max(0, 1 - y_i \hat{y}_i)

Gradient Descent in Neural Networks

Parameter Update Rule

θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)

Here,

  • hetat heta_t=Model parameters at iteration t
  • etaeta=Learning rate (step size)
  • ablahetamathcalL abla_ heta mathcal{L}=Gradient of loss w.r.t. parameters

💡 Learning Rate Selection

The learning rate η\eta is the most important hyperparameter. Too large: the optimizer overshoots and diverges. Too small: convergence is painfully slow. Best practices: start with η=0.001\eta = 0.001 for Adam, η=0.01\eta = 0.01 for SGD; use learning rate scheduling (cosine annealing, warm-up); monitor the loss curve for divergence or plateau.

Hyperparameter Tuning as Optimization

ℹ️ Black-Box Optimization

Hyperparameter tuning (learning rate, batch size, architecture choices) is optimization where the objective function is the validation loss, and the "gradient" is not available. Methods include:

  • Grid search: Exhaustive search over a predefined grid.
  • Random search: Sample random configurations (often more efficient than grid).
  • Bayesian optimization: Build a surrogate model of the objective and use it to choose the next evaluation point.
  • Evolutionary algorithms: Maintain a population of candidates and evolve them.

Common Mistakes

MistakeIncorrectCorrectWhy It Matters
Assuming critical point is extremumf(c)=0    f'(c) = 0 \implies min or maxMust test with 1st or 2nd derivative testInflection points also have f=0f' = 0
Forgetting endpoints on bounded domainsOnly check critical pointsAlso check f(a)f(a) and f(b)f(b)Global extremum may be at boundary
Confusing local and global extremaLocal min = global minLocal min is only optimal in a neighborhoodML training often gets stuck in local minima
Wrong second derivative testf(c)=0    f''(c) = 0 \implies minimumTest is inconclusive; use 1st derivative testf(x)=x4f(x) = x^4 has f(0)=0f''(0) = 0 but x=0x = 0 is a minimum
Ignoring domain constraintsOptimize over all R\mathbb{R}Respect physical/logical constraintsNegative lengths, probabilities >1> 1 are invalid
Not checking convergenceRun gradient descent for fixed iterationsMonitor gradient norm or loss changeMay stop before reaching optimum
Using wrong learning rateη=1.0\eta = 1.0 for all problemsTune η\eta per problem (typically 10310^{-3} to 10110^{-1})Too large diverges, too small is slow
Ignoring saddle points in high dimensionsAll critical points are extremaSaddle points are common in high dimensionsNeural network loss landscapes have many saddle points

Interview Questions

Q1: What is the difference between a local and global minimum?

💡Answer

A local minimum at xx^* means f(x)f(x)f(x^*) \leq f(x) for all xx in some neighborhood of xx^*. A global minimum means f(x)f(x)f(x^*) \leq f(x) for all xx in the entire domain. Every global minimum is also a local minimum, but the reverse is not true. In non-convex optimization (like neural network training), there can be many local minima, and finding the global minimum is generally NP-hard.

Q2: Explain the second derivative test. When does it fail?

💡Answer

The second derivative test classifies a critical point cc (where f(c)=0f'(c) = 0): if f(c)>0f''(c) > 0, cc is a local minimum (concave up); if f(c)<0f''(c) < 0, cc is a local maximum (concave down). It fails when f(c)=0f''(c) = 0 — the point could be a min, max, or inflection point. Example: f(x)=x4f(x) = x^4 has f(0)=0f''(0) = 0 but x=0x = 0 is a minimum. f(x)=x4f(x) = -x^4 has f(0)=0f''(0) = 0 but x=0x = 0 is a maximum. f(x)=x3f(x) = x^3 has f(0)=0f''(0) = 0 and it is an inflection point. In these cases, use the first derivative test or examine higher-order derivatives.

Q3: Why is non-convex optimization hard for neural networks?

💡Answer

Neural network loss landscapes are non-convex, meaning they have many local minima, saddle points, and flat regions. Key challenges:

  1. Local minima: Gradient descent may converge to a suboptimal local minimum.
  2. Saddle points: Points where the gradient is zero but it is neither a min nor max; common in high dimensions.
  3. Vanishing/exploding gradients: Deep networks can have gradients that shrink or explode, stalling learning.
  4. Loss plateaus: Flat regions where gradients are near zero, slowing convergence.

Despite this, in practice, most local minima in large neural networks have similar loss values, so finding any "good enough" local minimum often suffices.

Q4: How does the learning rate affect gradient descent convergence?

💡Answer

The learning rate η\eta controls the step size in gradient descent. If η\eta is too large, the iterates overshoot the minimum and may diverge (oscillate with increasing amplitude). If η\eta is too small, convergence is extremely slow, and the algorithm may get stuck in flat regions. An optimal learning rate allows fast convergence without oscillation. Adaptive methods (Adam, RMSprop, Adagrad) adjust the learning rate per-parameter based on historical gradient information, often converging faster and more reliably than fixed learning rates. Common practice: start with η=0.001\eta = 0.001 for Adam, use learning rate decay, and reduce η\eta when the loss plateaus.

Q5: What are KKT conditions and when are they used?

💡Answer

The Karush-Kuhn-Tucker (KKT) conditions are first-order necessary conditions for a solution in nonlinear programming to be optimal, provided some regularity conditions are satisfied. For a problem with inequality constraints hj(x)0h_j(\vec{x}) \leq 0:

  1. Stationarity: f+μjhj=0\nabla f + \sum \mu_j \nabla h_j = 0
  2. Primal feasibility: hj(x)0h_j(\vec{x}^*) \leq 0
  3. Dual feasibility: μj0\mu_j \geq 0
  4. Complementary slackness: μjhj(x)=0\mu_j h_j(\vec{x}^*) = 0

They are used in constrained optimization, SVM derivation (the dual formulation), and any problem with inequality constraints. Complementary slackness means either the constraint is active (hj=0h_j = 0) or its multiplier is zero — the constraint does not affect the solution.

Q6: Explain the vanishing gradient problem and its connection to optimization.

💡Answer

The vanishing gradient problem occurs when gradients become exponentially small as they propagate backward through deep network layers. Since parameter updates are proportional to the gradient (θθηL\theta \leftarrow \theta - \eta \nabla\mathcal{L}), near-zero gradients mean early layers learn extremely slowly or not at all.

Causes: Saturating activation functions (sigmoid, tanh) have derivatives near zero for large inputs. Deep compositions multiply many small derivatives together.

Solutions: Use ReLU activations (derivative is 0 or 1), batch normalization, residual connections, careful weight initialization (Xavier/He), and gradient clipping.


Practice Problems

Problem 1: Find and Classify Critical Points

📝Critical Point Analysis

Problem: Find all critical points of f(x)=2x33x212x+5f(x) = 2x^3 - 3x^2 - 12x + 5 and classify each as a local maximum, local minimum, or neither.

💡Solution

f(x)=6x26x12=6(x2x2)=6(x2)(x+1)=0f'(x) = 6x^2 - 6x - 12 = 6(x^2 - x - 2) = 6(x-2)(x+1) = 0

Critical points: x=2x = 2 and x=1x = -1.

f(x)=12x6f''(x) = 12x - 6

At x=1x = -1: f(1)=18<0f''(-1) = -18 < 0local maximum, f(1)=23+12+5=12f(-1) = -2 - 3 + 12 + 5 = 12.

At x=2x = 2: f(2)=18>0f''(2) = 18 > 0local minimum, f(2)=161224+5=15f(2) = 16 - 12 - 24 + 5 = -15.

Problem 2: Optimization Word Problem

📝Minimizing Cost

Problem: A cylindrical can must hold 1000 cm³ of liquid. The material for the top and bottom costs 0.05/cm2andthesidecosts0.05/cm² and the side costs0.03/cm². Find the dimensions that minimize cost.

💡Solution

Volume: πr2h=1000    h=1000πr2\pi r^2 h = 1000 \implies h = \frac{1000}{\pi r^2}

Cost: C=2(0.05)πr2+0.03(2πrh)=0.1πr2+60rC = 2(0.05)\pi r^2 + 0.03(2\pi r h) = 0.1\pi r^2 + \frac{60}{r}

C(r)=0.2πr60r2=0    r3=600.2π=300π    r=(300π)1/34.57C'(r) = 0.2\pi r - \frac{60}{r^2} = 0 \implies r^3 = \frac{60}{0.2\pi} = \frac{300}{\pi} \implies r = \left(\frac{300}{\pi}\right)^{1/3} \approx 4.57 cm

h=1000π(4.57)215.24h = \frac{1000}{\pi (4.57)^2} \approx 15.24 cm

C(r)=0.2π+120r3>0C''(r) = 0.2\pi + \frac{120}{r^3} > 0 → confirmed minimum.

Problem 3: Prove Convexity Implies Global Minimum

📝Convex Function Property

Problem: Prove that if ff is strictly convex on an interval and f(c)=0f'(c) = 0, then cc is the unique global minimum.

💡Solution

By strict convexity, for any xcx \neq c: f(x)>f(c)+f(c)(xc)=f(c)+0=f(c)f(x) > f(c) + f'(c)(x - c) = f(c) + 0 = f(c).

Therefore f(x)>f(c)f(x) > f(c) for all xcx \neq c, which means cc is the unique global minimum.

This is why convex optimization is so powerful — any local minimum is automatically the global minimum. This property underpins the reliability of logistic regression, SVMs, and other convex ML models.

Problem 4: Multivariable Optimization

📝Find the Extrema

Problem: Find and classify all critical points of f(x,y)=x3+y33xyf(x, y) = x^3 + y^3 - 3xy.

💡Solution

fx=3x23y=0    y=x2f_x = 3x^2 - 3y = 0 \implies y = x^2fy=3y23x=0    x=y2f_y = 3y^2 - 3x = 0 \implies x = y^2

Substituting: x=(x2)2=x4    x4x=0    x(x31)=0x = (x^2)^2 = x^4 \implies x^4 - x = 0 \implies x(x^3 - 1) = 0

So x=0x = 0 or x=1x = 1. Corresponding: (0,0)(0, 0) and (1,1)(1, 1).

Hessian: H=(6x336y)H = \begin{pmatrix} 6x & -3 \\ -3 & 6y \end{pmatrix}

At (0,0)(0, 0): det(H)=09=9<0\det(H) = 0 - 9 = -9 < 0saddle point.

At (1,1)(1, 1): det(H)=369=27>0\det(H) = 36 - 9 = 27 > 0 and fxx=6>0f_{xx} = 6 > 0local minimum, f(1,1)=1f(1,1) = -1.


Quick Reference

📋Key Takeaways

  • Critical Points: Where f(x)=0f'(x) = 0 or is undefined; candidates for extrema but not guaranteed to be extrema.
  • First Derivative Test: Check sign change of ff' at critical point. Negative → positive = minimum; positive → negative = maximum; no change = inflection.
  • Second Derivative Test: f(c)>0f''(c) > 0 = minimum (concave up), f(c)<0f''(c) < 0 = maximum (concave down), f(c)=0f''(c) = 0 = inconclusive.
  • Global Extrema on [a,b]: Compare ff at all critical points and endpoints; largest is global max, smallest is global min.
  • Lagrange Multipliers: For constrained optimization, solve f=λg\nabla f = \lambda \nabla g with g=0g = 0; λ\lambda measures constraint sensitivity.
  • KKT Conditions: Generalize Lagrange to inequality constraints; complementary slackness means either constraint is active or its multiplier is zero.
  • Gradient Descent: θt+1=θtηL\theta_{t+1} = \theta_t - \eta \nabla\mathcal{L}; the workhorse of ML training; learning rate η\eta is the critical hyperparameter.
  • Convex Functions: Any local minimum is the global minimum; this property makes convex optimization tractable and reliable.
  • Saddle Points: Critical points that are neither max nor min; common in high-dimensional non-convex problems like neural networks.
  • Practical Tip: In ML, we rarely find exact minima — we iterate until the loss stops decreasing meaningfully (early stopping).

Cross-References

Lesson Progress30 / 100