← Math|31 of 100
Calculus

Lagrange Multipliers

Master constrained optimization with Lagrange multipliers, KKT conditions, and their applications in machine learning including SVM and regularization.

📂 Constrained Optimization📖 Lesson 31 of 100🎓 Free Course

Advertisement

Lagrange Multipliers

â„šī¸ Why It Matters

Many machine learning problems are inherently constrained: support vector machines maximize a margin subject to classification constraints, neural network training may impose fairness requirements, and regularization adds penalty constraints on model weights. Lagrange multipliers provide the mathematical framework for solving these constrained optimization problems by converting them into unconstrained ones that are far easier to solve with gradient-based methods.

In real-world ML pipelines, unconstrained optimization is the exception, not the rule. Every time you add a constraint — a budget on inference time, a fairness requirement across demographic groups, a maximum norm on weight vectors — you enter the domain of constrained optimization. Lagrange multipliers are the foundational tool that makes these problems tractable, and understanding them is essential for anyone building or analyzing ML systems.


What are Lagrange Multipliers

DfLagrange Multipliers

Lagrange multipliers are a strategy for finding the local maxima and minima of a function subject to equality constraints. The core idea is to introduce a new variable, the Lagrange multiplier Îģ\lambda, for each constraint, and then solve a system of equations derived from the gradients of the objective function and the constraints.

Lagrange Multiplier Condition

∇f(x⃗)=Îģ∇g(x⃗),g(x⃗)=0\nabla f(\vec{x}) = \lambda \nabla g(\vec{x}), \quad g(\vec{x}) = 0

Here,

  • f(x⃗)f(\vec{x})=The objective function to optimize
  • g(x⃗)=0g(\vec{x}) = 0=The equality constraint
  • Îģ\lambda=The Lagrange multiplier (a scalar)
  • ∇f\nabla f=The gradient of the objective function
  • ∇g\nabla g=The gradient of the constraint function

💡 Intuition

At the constrained optimum, the gradient of the objective must be parallel to the gradient of the constraint. If they were not parallel, there would be a direction along the constraint surface that could improve the objective, contradicting optimality. The multiplier Îģ\lambda quantifies how much the optimal value changes when the constraint is relaxed.


Geometric Interpretation

â„šī¸ Level Curves and Constraint Surfaces

Consider a 2D problem where we want to optimize f(x,y)f(x, y) subject to g(x,y)=0g(x, y) = 0. The constraint g(x,y)=0g(x, y) = 0 defines a curve in the plane. The level curves of ff (contours where ff is constant) form a family of curves. The constrained optimum occurs at the level curve that is tangent to the constraint curve — any level curve that crosses the constraint curve would not be optimal because nearby points on the constraint curve would yield higher or lower values of ff.

At the point of tangency:

  • ∇f\nabla f is perpendicular to the level curve of ff
  • ∇g\nabla g is perpendicular to the constraint curve g=0g = 0
  • Since both gradients are perpendicular to the same tangent line, they must be parallel
  • Therefore ∇f=Îģ∇g\nabla f = \lambda \nabla g for some scalar Îģ\lambda

If Îģ>0\lambda > 0, increasing the constraint (relaxing it in the direction of ∇g\nabla g) increases ff. If Îģ<0\lambda < 0, relaxing the constraint decreases ff. This makes Îģ\lambda the sensitivity of the optimal value to the constraint.


Method of Lagrange Multipliers

ThStep-by-Step Method

To optimize f(x⃗)f(\vec{x}) subject to g(x⃗)=0g(\vec{x}) = 0:

Step 1. Write down the Lagrangian function:

L(x⃗,Îģ)=f(x⃗)−Îģ g(x⃗)\mathcal{L}(\vec{x}, \lambda) = f(\vec{x}) - \lambda \, g(\vec{x})

Step 2. Take partial derivatives of L\mathcal{L} with respect to each variable xix_i and Îģ\lambda:

∂L∂xi=∂f∂xi−Îģ∂g∂xi=0\frac{\partial \mathcal{L}}{\partial x_i} = \frac{\partial f}{\partial x_i} - \lambda \frac{\partial g}{\partial x_i} = 0
∂L∂Îģ=−g(x⃗)=0\frac{\partial \mathcal{L}}{\partial \lambda} = -g(\vec{x}) = 0

Step 3. Solve the resulting system of n+1n + 1 equations (where nn is the number of variables) for the n+1n + 1 unknowns: the nn components of x⃗\vec{x} and Îģ\lambda.

Step 4. Evaluate ff at each solution to identify maxima and minima.

âš ī¸ Important Caveats

  • The method finds candidates for optima; you must verify which are maxima, minima, or saddle points.
  • The constraint gradient ∇g\nabla g must be nonzero at the solution (regularity condition).
  • For multiple constraints, add a multiplier per constraint.

Single Constraint Examples

📝Example 1: Rectangle of Maximum Area

Problem: Find the rectangle with the largest area that can be inscribed in a circle of radius RR.

💡Solution

Objective: Maximize f(x,y)=(2x)(2y)=4xyf(x, y) = (2x)(2y) = 4xy (area of rectangle with half-widths x,yx, y).

Constraint: g(x,y)=x2+y2−R2=0g(x, y) = x^2 + y^2 - R^2 = 0 (rectangle must lie on the circle).

Lagrangian: L=4xy−Îģ(x2+y2−R2)\mathcal{L} = 4xy - \lambda(x^2 + y^2 - R^2)

System of equations:

∂L∂x=4y−2Îģx=0  ⟹  Îģ=2yx\frac{\partial \mathcal{L}}{\partial x} = 4y - 2\lambda x = 0 \implies \lambda = \frac{2y}{x}
∂L∂y=4x−2Îģy=0  ⟹  Îģ=2xy\frac{\partial \mathcal{L}}{\partial y} = 4x - 2\lambda y = 0 \implies \lambda = \frac{2x}{y}
∂L∂Îģ=−(x2+y2−R2)=0\frac{\partial \mathcal{L}}{\partial \lambda} = -(x^2 + y^2 - R^2) = 0

Solving: From the first two equations: 2yx=2xy  ⟹  y2=x2  ⟹  y=x\frac{2y}{x} = \frac{2x}{y} \implies y^2 = x^2 \implies y = x (taking positive values).

Substituting into the constraint: x2+x2=R2  ⟹  x=R2x^2 + x^2 = R^2 \implies x = \frac{R}{\sqrt{2}}, y=R2y = \frac{R}{\sqrt{2}}.

Maximum area: 4xy=4⋅R2⋅R2=2R24xy = 4 \cdot \frac{R}{\sqrt{2}} \cdot \frac{R}{\sqrt{2}} = 2R^2.

The optimal rectangle is a square with side length R2R\sqrt{2}.


📝Example 2: Closest Point on a Plane

Problem: Find the point on the plane 2x+y−2z=42x + y - 2z = 4 closest to the origin.

💡Solution

Objective: Minimize f(x,y,z)=x2+y2+z2f(x, y, z) = x^2 + y^2 + z^2 (squared distance from origin).

Constraint: g(x,y,z)=2x+y−2z−4=0g(x, y, z) = 2x + y - 2z - 4 = 0.

Lagrangian: L=x2+y2+z2−Îģ(2x+y−2z−4)\mathcal{L} = x^2 + y^2 + z^2 - \lambda(2x + y - 2z - 4)

System:

∂L∂x=2x−2Îģ=0  ⟹  x=Îģ\frac{\partial \mathcal{L}}{\partial x} = 2x - 2\lambda = 0 \implies x = \lambda
∂L∂y=2y−Îģ=0  ⟹  y=Îģ2\frac{\partial \mathcal{L}}{\partial y} = 2y - \lambda = 0 \implies y = \frac{\lambda}{2}
∂L∂z=2z+2Îģ=0  ⟹  z=−Îģ\frac{\partial \mathcal{L}}{\partial z} = 2z + 2\lambda = 0 \implies z = -\lambda
2x+y−2z=42x + y - 2z = 4

Solving: Substituting: 2Îģ+Îģ2+2Îģ=4  ⟹  9Îģ2=4  ⟹  Îģ=892\lambda + \frac{\lambda}{2} + 2\lambda = 4 \implies \frac{9\lambda}{2} = 4 \implies \lambda = \frac{8}{9}.

Solution: x=89x = \frac{8}{9}, y=49y = \frac{4}{9}, z=−89z = -\frac{8}{9}.

Minimum distance: (89)2+(49)2+(89)2=453\sqrt{\left(\frac{8}{9}\right)^2 + \left(\frac{4}{9}\right)^2 + \left(\frac{8}{9}\right)^2} = \frac{4\sqrt{5}}{3}.


Multiple Constraints

Lagrange Multipliers with Multiple Equality Constraints

L(x⃗,Îģ1,â€Ļ,Îģm)=f(x⃗)−∑j=1mÎģj gj(x⃗)\mathcal{L}(\vec{x}, \lambda_1, \ldots, \lambda_m) = f(\vec{x}) - \sum_{j=1}^{m} \lambda_j \, g_j(\vec{x})

Here,

  • f(x⃗)f(\vec{x})=The objective function
  • gj(x⃗)=0g_j(\vec{x}) = 0=The j-th equality constraint (m total)
  • Îģj\lambda_j=The Lagrange multiplier for the j-th constraint

ThMultiple Constraints System

For nn variables and mm equality constraints (with m<nm < n), solve:

∇f=∑j=1mÎģj∇gj,gj(x⃗)=0 for j=1,â€Ļ,m\nabla f = \sum_{j=1}^{m} \lambda_j \nabla g_j, \quad g_j(\vec{x}) = 0 \text{ for } j = 1, \ldots, m

This gives n+mn + m equations (from the gradient condition and constraints) for n+mn + m unknowns (the nn components of x⃗\vec{x} and the mm multipliers Îģj\lambda_j).

📝Example 3: Two Constraints

Problem: Optimize f(x,y,z)=x+2y+3zf(x, y, z) = x + 2y + 3z subject to g1:x2+y2+z2=1g_1: x^2 + y^2 + z^2 = 1 and g2:x+y+z=0g_2: x + y + z = 0.

💡Solution

Lagrangian: L=x+2y+3z−Îģ1(x2+y2+z2−1)−Îģ2(x+y+z)\mathcal{L} = x + 2y + 3z - \lambda_1(x^2 + y^2 + z^2 - 1) - \lambda_2(x + y + z)

System:

∂L∂x=1−2Îģ1x−Îģ2=0\frac{\partial \mathcal{L}}{\partial x} = 1 - 2\lambda_1 x - \lambda_2 = 0
∂L∂y=2−2Îģ1y−Îģ2=0\frac{\partial \mathcal{L}}{\partial y} = 2 - 2\lambda_1 y - \lambda_2 = 0
∂L∂z=3−2Îģ1z−Îģ2=0\frac{\partial \mathcal{L}}{\partial z} = 3 - 2\lambda_1 z - \lambda_2 = 0
x2+y2+z2=1,x+y+z=0x^2 + y^2 + z^2 = 1, \quad x + y + z = 0

Solving: Subtracting equations pairwise:

(1−2Îģ1x)−(2−2Îģ1y)=0  ⟹  2Îģ1(y−x)=1  ⟹  y−x=12Îģ1(1 - 2\lambda_1 x) - (2 - 2\lambda_1 y) = 0 \implies 2\lambda_1(y - x) = 1 \implies y - x = \frac{1}{2\lambda_1}
(2−2Îģ1y)−(3−2Îģ1z)=0  ⟹  2Îģ1(z−y)=1  ⟹  z−y=12Îģ1(2 - 2\lambda_1 y) - (3 - 2\lambda_1 z) = 0 \implies 2\lambda_1(z - y) = 1 \implies z - y = \frac{1}{2\lambda_1}

So y−x=z−yy - x = z - y, meaning 2y=x+z2y = x + z. Combined with x+y+z=0x + y + z = 0: 3y=0  ⟹  y=03y = 0 \implies y = 0, so z=−xz = -x.

From the sphere constraint: x2+0+x2=1  ⟹  x=±12x^2 + 0 + x^2 = 1 \implies x = \pm\frac{1}{\sqrt{2}}.

Maximum: f(12,0,−12)=12+0−32=−2f\left(\frac{1}{\sqrt{2}}, 0, -\frac{1}{\sqrt{2}}\right) = \frac{1}{\sqrt{2}} + 0 - \frac{3}{\sqrt{2}} = -\sqrt{2}.

Minimum: f(−12,0,12)=−12+0+32=2f\left(-\frac{1}{\sqrt{2}}, 0, \frac{1}{\sqrt{2}}\right) = -\frac{1}{\sqrt{2}} + 0 + \frac{3}{\sqrt{2}} = \sqrt{2}.


KKT Conditions

DfKarush-Kuhn-Tucker (KKT) Conditions

The KKT conditions generalize Lagrange multipliers to handle inequality constraints. For the problem of minimizing f(x⃗)f(\vec{x}) subject to gi(x⃗)≤0g_i(\vec{x}) \leq 0 (inequality) and hj(x⃗)=0h_j(\vec{x}) = 0 (equality), the KKT conditions are necessary conditions for optimality (under constraint qualifications).

ThKKT Conditions

For a minimization problem with inequality constraints gi(x⃗)≤0g_i(\vec{x}) \leq 0 and equality constraints hj(x⃗)=0h_j(\vec{x}) = 0, at the optimal point x⃗∗\vec{x}^*:

1. Stationarity:

∇f(x⃗∗)+∑iÎģi∇gi(x⃗∗)+∑jÎŧj∇hj(x⃗∗)=0⃗\nabla f(\vec{x}^*) + \sum_{i} \lambda_i \nabla g_i(\vec{x}^*) + \sum_{j} \mu_j \nabla h_j(\vec{x}^*) = \vec{0}

2. Primal Feasibility:

gi(x⃗∗)≤0∀ i,hj(x⃗∗)=0∀ jg_i(\vec{x}^*) \leq 0 \quad \forall \, i, \qquad h_j(\vec{x}^*) = 0 \quad \forall \, j

3. Dual Feasibility:

Îģiâ‰Ĩ0∀ i\lambda_i \geq 0 \quad \forall \, i

4. Complementary Slackness:

Îģi gi(x⃗∗)=0∀ i\lambda_i \, g_i(\vec{x}^*) = 0 \quad \forall \, i

💡 Complementary Slackness Intuition

Complementary slackness means that for each inequality constraint, either the constraint is active (binding, gi=0g_i = 0) or its multiplier is zero (Îģi=0\lambda_i = 0). An inactive constraint (gi<0g_i < 0) has no influence on the optimum, so its multiplier must be zero. Only active constraints "pull" on the solution, and their multipliers indicate how strongly they pull.

âš ī¸ Minimization vs Maximization Sign Convention

Different textbooks use different sign conventions. Some write ∇f+∑Îģi∇gi=0\nabla f + \sum \lambda_i \nabla g_i = 0 for minimization (as above), while others write ∇f=∑Îģi∇gi\nabla f = \sum \lambda_i \nabla g_i for maximization. Always check which convention is being used. The sign of Îģi\lambda_i must be consistent with the convention and the stationarity condition.


Saddle Points

â„šī¸ When Lagrange Multipliers Find Saddle Points

The Lagrange multiplier method finds all stationary points of the Lagrangian, which include maxima, minima, and saddle points. The method itself does not distinguish between these. To classify a candidate solution, you must use the second-order conditions.

ThSecond-Order Conditions for Constrained Optimization

Let L=∇x⃗x⃗2LL = \nabla^2_{\vec{x}\vec{x}} \mathcal{L} be the Hessian of the Lagrangian with respect to x⃗\vec{x}, and JgJ_g be the Jacobian of the constraints.

For a constrained minimum: LL must be positive definite on the tangent space of the active constraints.

For a constrained maximum: LL must be negative definite on the tangent space of the active constraints.

If LL is indefinite on the tangent space, the point is a constrained saddle point.

📝Saddle Point Example

Problem: Optimize f(x,y)=xyf(x, y) = xy subject to g(x,y)=x2+y2−1=0g(x, y) = x^2 + y^2 - 1 = 0.

💡Solution

Lagrangian: L=xy−Îģ(x2+y2−1)\mathcal{L} = xy - \lambda(x^2 + y^2 - 1)

System:

∂L∂x=y−2Îģx=0\frac{\partial \mathcal{L}}{\partial x} = y - 2\lambda x = 0
∂L∂y=x−2Îģy=0\frac{\partial \mathcal{L}}{\partial y} = x - 2\lambda y = 0
x2+y2=1x^2 + y^2 = 1

From the first two equations: y=2Îģxy = 2\lambda x and x=2Îģyx = 2\lambda y, so x=4Îģ2xx = 4\lambda^2 x, giving Îģ=Âą12\lambda = \pm\frac{1}{2}.

  • If Îģ=12\lambda = \frac{1}{2}: y=xy = x, so 2x2=1  ⟹  x=Âą122x^2 = 1 \implies x = \pm\frac{1}{\sqrt{2}}.

    • (x,y)=(12,12)(x, y) = \left(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}\right): f=12f = \frac{1}{2} (maximum)
    • (x,y)=(−12,−12)(x, y) = \left(-\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}\right): f=12f = \frac{1}{2} (maximum)
  • If Îģ=−12\lambda = -\frac{1}{2}: y=−xy = -x, so 2x2=1  ⟹  x=Âą122x^2 = 1 \implies x = \pm\frac{1}{\sqrt{2}}.

    • (x,y)=(12,−12)(x, y) = \left(\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}\right): f=−12f = -\frac{1}{2} (minimum)
    • (x,y)=(−12,12)(x, y) = \left(-\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}\right): f=−12f = -\frac{1}{2} (minimum)

All four points are genuine extrema (no saddle points on this constraint set). However, without checking second-order conditions, we would not know which are maxima and which are minima.

âš ī¸ Saddle Points in Practice

In ML, saddle points are common in non-convex optimization landscapes. While Lagrange multipliers can identify them analytically, in practice with high-dimensional neural networks, saddle points are far more prevalent than true local minima. Second-order methods (like Newton's method) can escape saddle points by following negative curvature directions.


Python Implementation

import numpy as np
from scipy.optimize import minimize, minimize_scalar

# ============================================
# Example 1: Single Equality Constraint
# Minimize x^2 + y^2 subject to x + y = 1
# ============================================

def lagrange_single_constraint():
    f = lambda x: x[0]**2 + x[1]**2
    constraint = {'type': 'eq', 'fun': lambda x: x[0] + x[1] - 1}
    
    result = minimize(f, x0=[0.5, 0.5], constraints=constraint)
    print(f"Optimal point: x = {result.x[0]:.4f}, y = {result.x[1]:.4f}")
    print(f"Optimal value: {result.fun:.4f}")
    print(f"Multiplier (approx): lambda = {2 * result.x[0]:.4f}")
    # Analytical solution: x = y = 1/2, lambda = 1

lagrange_single_constraint()

# ============================================
# Example 2: Minimize distance to a plane
# Minimize x^2 + y^2 + z^2 subject to 2x + y - 2z = 4
# ============================================

def closest_point_on_plane():
    f = lambda x: x[0]**2 + x[1]**2 + x[2]**2
    constraint = {'type': 'eq', 'fun': lambda x: 2*x[0] + x[1] - 2*x[2] - 4}
    
    result = minimize(f, x0=[1, 1, 1], constraints=constraint)
    print(f"\nClosest point: ({result.x[0]:.4f}, {result.x[1]:.4f}, {result.x[2]:.4f})")
    print(f"Distance: {np.sqrt(result.fun):.4f}")
    # Analytical: (8/9, 4/9, -8/9), distance = 4*sqrt(5)/3

closest_point_on_plane()

# ============================================
# Example 3: Multiple constraints
# Minimize x^2 + y^2 + z^2 subject to x+y+z=1 and x-y=0
# ============================================

def multiple_constraints():
    f = lambda x: x[0]**2 + x[1]**2 + x[2]**2
    constraints = [
        {'type': 'eq', 'fun': lambda x: x[0] + x[1] + x[2] - 1},
        {'type': 'eq', 'fun': lambda x: x[0] - x[1]}
    ]
    
    result = minimize(f, x0=[1/3, 1/3, 1/3], constraints=constraints)
    print(f"\nOptimal point: ({result.x[0]:.4f}, {result.x[1]:.4f}, {result.x[2]:.4f})")
    print(f"Optimal value: {result.fun:.4f}")

multiple_constraints()

# ============================================
# Example 4: KKT with inequality constraints
# Minimize x^2 + y^2 subject to x + y >= 1
# ============================================

def kkt_inequality():
    f = lambda x: x[0]**2 + x[1]**2
    # Convert >= to <=: -(x + y - 1) <= 0
    constraint = {'type': 'ineq', 'fun': lambda x: x[0] + x[1] - 1}
    
    result = minimize(f, x0=[0, 0], constraints=constraint)
    print(f"\nKKT optimal point: ({result.x[0]:.4f}, {result.x[1]:.4f})")
    print(f"Optimal value: {result.fun:.4f}")
    # Solution: x = y = 0.5, value = 0.5

kkt_inequality()

Applications in AI/ML

Support Vector Machines (SVM)

â„šī¸ SVM and Lagrange Multipliers

The SVM primal problem maximizes the margin 2âˆĨwâˆĨ\frac{2}{\|w\|} subject to classification constraints. Lagrange multipliers transform this into the dual problem, where each multiplier Îąi\alpha_i corresponds to a training point. Points with Îąi>0\alpha_i > 0 are the support vectors — the only points that matter for the decision boundary.

SVM Dual Formulation

max⁡α∑i=1nαi−12∑i=1n∑j=1nαiαjyiyj x⃗iTx⃗j\max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n}\sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j \, \vec{x}_i^T \vec{x}_j

Here,

  • Îąi\alpha_i=Lagrange multiplier for the i-th training point
  • yi∈{−1,+1}y_i \in \{-1, +1\}=Class label of the i-th point
  • x⃗i\vec{x}_i=Feature vector of the i-th point

Regularization as Constraint

💡 L1 and L2 Regularization

Ridge regression (L2) minimizes âˆĨy−XwâˆĨ2\|y - Xw\|^2 subject to âˆĨwâˆĨ22≤t\|w\|_2^2 \leq t. The Lagrangian form is âˆĨy−XwâˆĨ2+ÎģâˆĨwâˆĨ22\|y - Xw\|^2 + \lambda \|w\|_2^2. Lasso (L1) uses âˆĨwâˆĨ1≤t\|w\|_1 \leq t, leading to âˆĨy−XwâˆĨ2+ÎģâˆĨwâˆĨ1\|y - Xw\|^2 + \lambda \|w\|_1. The multiplier Îģ\lambda controls the regularization strength — it is the sensitivity of the loss to the constraint radius tt.

Fairness Constraints

â„šī¸ Fairness in ML via Lagrange Multipliers

Modern fairness-aware ML adds constraints like demographic parity (P(Y^=1âˆŖA=0)=P(Y^=1âˆŖA=1)P(\hat{Y}=1|A=0) = P(\hat{Y}=1|A=1)) or equalized odds. These are incorporated as equality or inequality constraints in the optimization problem, and Lagrange multipliers (or KKT conditions) provide the mechanism for solving the resulting constrained problem. The multipliers quantify the "cost of fairness" — how much accuracy is sacrificed per unit of fairness enforcement.

Portfolio Optimization

In finance-inspired ML, the Markowitz mean-variance portfolio optimizes return subject to risk constraints. Lagrange multipliers solve this, and the resulting efficient frontier maps out the tradeoff between risk and return — parameterized by Îģ\lambda.


Common Mistakes

MistakeWhy It's WrongCorrect Approach
Forgetting the constraint equationSolving ∇f=Îģ∇g\nabla f = \lambda \nabla g alone gives nn equations for n+1n+1 unknownsAlways include g(x⃗)=0g(\vec{x}) = 0 as an equation
Ignoring the sign of Îģ\lambdaÎģ>0\lambda > 0 vs Îģ<0\lambda < 0 indicates whether relaxing the constraint helps or hurtsTrack the sign; it has physical meaning
Confusing L=f+Îģg\mathcal{L} = f + \lambda g with L=f−Îģg\mathcal{L} = f - \lambda gBoth conventions work, but mixing them leads to sign errors in Îģ\lambdaPick one convention and be consistent
Assuming all critical points are extremaLagrange finds stationary points of L\mathcal{L}, including saddle pointsVerify with second-order conditions
Forgetting complementary slackness (KKT)With inequality constraints, assuming all Îģi>0\lambda_i > 0 is wrongCheck: either Îģi=0\lambda_i = 0 or gi=0g_i = 0
Not checking constraint qualificationIf ∇g=0\nabla g = 0 at the candidate, the method may failVerify regularity (e.g., ∇g≠0\nabla g \neq 0)
Using Lagrange for unconstrained problemsLagrange multipliers are for constrained optimizationUse gradient-based methods for unconstrained problems
Ignoring boundary effectsThe Lagrangian method finds interior critical points, not boundary solutionsCheck boundary conditions separately if the domain is bounded

Interview Questions

Q1: What is the geometric interpretation of a Lagrange multiplier?

💡Answer

The Lagrange multiplier Îģ\lambda measures the rate of change of the optimal value of the objective function with respect to relaxation of the constraint. Specifically, if the constraint is g(x⃗)=cg(\vec{x}) = c, then df∗dc=Îģ\frac{df^*}{dc} = \lambda, where f∗f^* is the optimal value. Geometrically, at the constrained optimum, the gradient of the objective is parallel to the gradient of the constraint, and Îģ\lambda is the proportionality constant. A large âˆŖÎģâˆŖ|\lambda| means the constraint is very "expensive" — relaxing it would significantly improve the objective.

Q2: Why does the Lagrange multiplier method work? What is the intuition behind ∇f=Îģ∇g\nabla f = \lambda \nabla g?

💡Answer

At the constrained optimum, there is no feasible direction that improves the objective. Any movement along the constraint surface g=0g = 0 must have zero first-order change in ff. The tangent space to the constraint consists of all directions v⃗\vec{v} such that ∇g⋅v⃗=0\nabla g \cdot \vec{v} = 0. For ff to have zero change in all such directions, ∇f\nabla f must be orthogonal to this tangent space, meaning ∇f\nabla f is parallel to ∇g\nabla g. Hence ∇f=Îģ∇g\nabla f = \lambda \nabla g.

Q3: How do Lagrange multipliers relate to the dual problem in SVM?

💡Answer

The SVM primal maximizes the margin subject to classification constraints yi(wTxi+b)â‰Ĩ1y_i(w^T x_i + b) \geq 1. Introducing Lagrange multipliers Îąi\alpha_i for each constraint and applying the KKT conditions yields the dual problem, which depends only on dot products x⃗iTx⃗j\vec{x}_i^T \vec{x}_j. The complementary slackness condition implies Îąi>0\alpha_i > 0 only for support vectors (points exactly on the margin boundary). The dual formulation enables the kernel trick, allowing nonlinear classification via implicit feature mapping.

Q4: What is complementary slackness and why does it matter?

💡Answer

Complementary slackness states that for each inequality constraint gi≤0g_i \leq 0, the product Îģigi=0\lambda_i g_i = 0. This means either the constraint is active (gi=0g_i = 0, binding) or the multiplier is zero (Îģi=0\lambda_i = 0, inactive). This matters because it identifies which constraints actually influence the solution. In SVMs, it tells us that only support vectors (Îąi>0\alpha_i > 0) lie on the margin boundary; all other points have Îąi=0\alpha_i = 0 and can be removed without changing the solution.

Q5: When might Lagrange multipliers fail or give misleading results?

💡Answer

Lagrange multipliers may fail when: (1) the constraint qualification is violated (e.g., ∇g=0\nabla g = 0 at the candidate point), meaning the constraint surface is degenerate; (2) the problem is non-smooth (e.g., L1 regularization involves a non-differentiable norm); (3) the problem is non-convex, and the method finds saddle points rather than global optima; (4) the domain is not open (boundary solutions may not satisfy ∇f=Îģ∇g\nabla f = \lambda \nabla g). In such cases, subgradient methods, penalty methods, or interior-point methods are preferred.

Q6: Explain the relationship between Lagrange multipliers and sensitivity analysis.

💡Answer

The Lagrange multiplier Îģ\lambda is the shadow price of the constraint. If the constraint is g(x⃗)=cg(\vec{x}) = c, then f∗(c+Īĩ)≈f∗(c)+ÎģĪĩf^*(c + \epsilon) \approx f^*(c) + \lambda \epsilon for small Īĩ\epsilon. In ML, this means Îģ\lambda tells you how much the loss would decrease if you slightly relaxed a constraint (e.g., increasing the allowed model complexity, loosening a fairness requirement, or expanding a resource budget). This makes multipliers valuable for resource allocation and cost-benefit analysis.


Practice Problems

Problem 1: Box Optimization

📝Maximum Volume Box

Problem: Find the dimensions of a rectangular box (with no lid) of maximum volume given a surface area of 12 cm212 \, \text{cm}^2.

💡Solution

Let x,yx, y be the base dimensions and zz the height.

Objective: Maximize f=xyzf = xyz.

Constraint: g=xy+2xz+2yz−12=0g = xy + 2xz + 2yz - 12 = 0.

Lagrangian: L=xyz−Îģ(xy+2xz+2yz−12)\mathcal{L} = xyz - \lambda(xy + 2xz + 2yz - 12).

System:

yz=Îģ(y+2z),xz=Îģ(x+2z),xy=Îģ(2x+2y)yz = \lambda(y + 2z), \quad xz = \lambda(x + 2z), \quad xy = \lambda(2x + 2y)

From the first two: yzy+2z=xzx+2z  ⟹  y(x+2z)=x(y+2z)  ⟹  2yz=2xz  ⟹  y=x\frac{yz}{y+2z} = \frac{xz}{x+2z} \implies y(x+2z) = x(y+2z) \implies 2yz = 2xz \implies y = x.

From the first and third: yzy+2z=xy2x+2y\frac{yz}{y+2z} = \frac{xy}{2x+2y}. With y=xy = x: xzx+2z=x24x=x4\frac{xz}{x+2z} = \frac{x^2}{4x} = \frac{x}{4}.

So 4xz=x(x+2z)  ⟹  4z=x+2z  ⟹  x=2z4xz = x(x + 2z) \implies 4z = x + 2z \implies x = 2z.

Substituting into constraint: (2z)(2z)+2(2z)z+2(2z)z=4z2+4z2+4z2=12z2=12  ⟹  z=1(2z)(2z) + 2(2z)z + 2(2z)z = 4z^2 + 4z^2 + 4z^2 = 12z^2 = 12 \implies z = 1.

Dimensions: x=2x = 2, y=2y = 2, z=1z = 1. Maximum volume: V=4V = 4.


Problem 2: Nearest Point on an Ellipsoid

📝Closest Point on Ellipsoid

Problem: Find the point on the ellipsoid x24+y29+z2=1\frac{x^2}{4} + \frac{y^2}{9} + z^2 = 1 closest to the origin.

💡Solution

Objective: Minimize f=x2+y2+z2f = x^2 + y^2 + z^2.

Constraint: g=x24+y29+z2−1=0g = \frac{x^2}{4} + \frac{y^2}{9} + z^2 - 1 = 0.

Lagrangian: L=x2+y2+z2−Îģ(x24+y29+z2−1)\mathcal{L} = x^2 + y^2 + z^2 - \lambda\left(\frac{x^2}{4} + \frac{y^2}{9} + z^2 - 1\right).

System:

2x=Îģ⋅x2  ⟹  x(2−Îģ2)=02x = \lambda \cdot \frac{x}{2} \implies x\left(2 - \frac{\lambda}{2}\right) = 0
2y=Îģ⋅2y9  ⟹  y(2−2Îģ9)=02y = \lambda \cdot \frac{2y}{9} \implies y\left(2 - \frac{2\lambda}{9}\right) = 0
2z=Îģ⋅2z  ⟹  z(2−2Îģ)=02z = \lambda \cdot 2z \implies z(2 - 2\lambda) = 0

Either each variable is zero or the multiplier equals the bracketed coefficient. Testing combinations:

If x≠0x \neq 0: Îģ=4\lambda = 4. Then 2−2(4)9=109≠02 - \frac{2(4)}{9} = \frac{10}{9} \neq 0, so y=0y = 0. And 2−8=−6≠02 - 8 = -6 \neq 0, so z=0z = 0. From constraint: x24=1  ⟹  x=Âą2\frac{x^2}{4} = 1 \implies x = \pm 2.

If y≠0y \neq 0: Îģ=9\lambda = 9. Then 2−92=−52≠02 - \frac{9}{2} = -\frac{5}{2} \neq 0, so x=0x = 0. And 2−18=−16≠02 - 18 = -16 \neq 0, so z=0z = 0. From constraint: y29=1  ⟹  y=Âą3\frac{y^2}{9} = 1 \implies y = \pm 3.

If z≠0z \neq 0: Îģ=1\lambda = 1. Then 2−12≠02 - \frac{1}{2} \neq 0, so x=0x = 0. And 2−29≠02 - \frac{2}{9} \neq 0, so y=0y = 0. From constraint: z2=1  ⟹  z=Âą1z^2 = 1 \implies z = \pm 1.

Candidates and values:

  • (Âą2,0,0)(\pm 2, 0, 0): f=4f = 4
  • (0,Âą3,0)(0, \pm 3, 0): f=9f = 9
  • (0,0,Âą1)(0, 0, \pm 1): f=1f = 1

Closest point: (0,0,Âą1)(0, 0, \pm 1) with distance 11.


Problem 3: Constrained Exponential

📝Maximum of Exponential on a Circle

Problem: Find the maximum of f(x,y)=ex+yf(x, y) = e^{x+y} subject to x2+y2=1x^2 + y^2 = 1.

💡Solution

Lagrangian: L=ex+y−Îģ(x2+y2−1)\mathcal{L} = e^{x+y} - \lambda(x^2 + y^2 - 1).

System:

ex+y=2Îģx,ex+y=2Îģy,x2+y2=1e^{x+y} = 2\lambda x, \quad e^{x+y} = 2\lambda y, \quad x^2 + y^2 = 1

From the first two: 2Îģx=2Îģy2\lambda x = 2\lambda y. If Îģ≠0\lambda \neq 0: x=yx = y.

From constraint: 2x2=1  ⟹  x=y=±122x^2 = 1 \implies x = y = \pm\frac{1}{\sqrt{2}}.

  • (x,y)=(12,12)(x, y) = \left(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}\right): f=e2≈4.113f = e^{\sqrt{2}} \approx 4.113 (maximum)
  • (x,y)=(−12,−12)(x, y) = \left(-\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}\right): f=e−2≈0.243f = e^{-\sqrt{2}} \approx 0.243 (minimum)

Maximum value: e2e^{\sqrt{2}} at (12,12)\left(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}\right).


Problem 4: KKT with Two Inequality Constraints

📝KKT Problem

Problem: Minimize f(x,y)=(x−1)2+(y−1)2f(x, y) = (x-1)^2 + (y-1)^2 subject to x+y≤1x + y \leq 1 and x≤0x \leq 0.

💡Solution

Lagrangian: L=(x−1)2+(y−1)2+Îģ1(x+y−1)+Îģ2(x)\mathcal{L} = (x-1)^2 + (y-1)^2 + \lambda_1(x + y - 1) + \lambda_2(x).

KKT Conditions:

  1. ∂L∂x=2(x−1)+Îģ1+Îģ2=0\frac{\partial \mathcal{L}}{\partial x} = 2(x-1) + \lambda_1 + \lambda_2 = 0
  2. ∂L∂y=2(y−1)+Îģ1=0\frac{\partial \mathcal{L}}{\partial y} = 2(y-1) + \lambda_1 = 0
  3. x+y≤1x + y \leq 1, x≤0x \leq 0 (primal feasibility)
  4. Îģ1â‰Ĩ0\lambda_1 \geq 0, Îģ2â‰Ĩ0\lambda_2 \geq 0 (dual feasibility)
  5. Îģ1(x+y−1)=0\lambda_1(x + y - 1) = 0, Îģ2x=0\lambda_2 x = 0 (complementary slackness)

Case 1: Both constraints inactive (Îģ1=Îģ2=0\lambda_1 = \lambda_2 = 0). Then 2(x−1)=02(x-1) = 0 and 2(y−1)=02(y-1) = 0, giving x=1,y=1x = 1, y = 1. But x+y=2>1x + y = 2 > 1, violating feasibility. Infeasible.

Case 2: Only first constraint active (Îģ1>0,Îģ2=0,x+y=1\lambda_1 > 0, \lambda_2 = 0, x + y = 1). From conditions 1 and 2: 2(x−1)+Îģ1=02(x-1) + \lambda_1 = 0 and 2(y−1)+Îģ1=02(y-1) + \lambda_1 = 0. So x−1=y−1  ⟹  x=yx - 1 = y - 1 \implies x = y. With x+y=1x + y = 1: x=y=12x = y = \frac{1}{2}. Check x≤0x \leq 0: 12>0\frac{1}{2} > 0, violating feasibility. Infeasible.

Case 3: Only second constraint active (Îģ2>0,Îģ1=0,x=0\lambda_2 > 0, \lambda_1 = 0, x = 0). From condition 2: 2(y−1)=0  ⟹  y=12(y-1) = 0 \implies y = 1. Check x+y=1≤1x + y = 1 \leq 1: feasible. Check x=0≤0x = 0 \leq 0: feasible. From condition 1: 2(0−1)+0+Îģ2=0  ⟹  Îģ2=2>02(0-1) + 0 + \lambda_2 = 0 \implies \lambda_2 = 2 > 0. Valid. Solution: (x,y)=(0,1)(x, y) = (0, 1), f=(0−1)2+(1−1)2=1f = (0-1)^2 + (1-1)^2 = 1.

Case 4: Both constraints active (x=0x = 0, x+y=1  ⟹  y=1x + y = 1 \implies y = 1). Same as Case 3.

Optimal solution: (x,y)=(0,1)(x, y) = (0, 1) with f∗=1f^* = 1 and Îģ1=0\lambda_1 = 0, Îģ2=2\lambda_2 = 2.


Quick Reference

📋Key Takeaways

  • Lagrange Multiplier Method: To optimize f(x⃗)f(\vec{x}) subject to g(x⃗)=0g(\vec{x}) = 0, solve ∇f=Îģ∇g\nabla f = \lambda \nabla g and g(x⃗)=0g(\vec{x}) = 0 simultaneously, where Îģ\lambda is the multiplier.
  • Geometric Interpretation: At the optimum, the gradient of the objective is parallel to the gradient of the constraint; Îģ\lambda measures the sensitivity of the optimal value to constraint changes.
  • Multiple Constraints: For mm equality constraints, use mm multipliers: ∇f=∑jÎģj∇gj\nabla f = \sum_j \lambda_j \nabla g_j.
  • KKT Conditions: For inequality constraints gi(x⃗)≤0g_i(\vec{x}) \leq 0, the conditions are: stationarity (∇f+∑Îģi∇gi=0\nabla f + \sum \lambda_i \nabla g_i = 0), primal feasibility (gi≤0g_i \leq 0), dual feasibility (Îģiâ‰Ĩ0\lambda_i \geq 0), and complementary slackness (Îģigi=0\lambda_i g_i = 0).
  • Complementary Slackness: Either a constraint is active (gi=0g_i = 0) or its multiplier is zero (Îģi=0\lambda_i = 0) — never both non-zero simultaneously.
  • SVM Connection: Support vector machines maximize margin 2âˆĨwâˆĨ\frac{2}{\|w\|} subject to yi(wTxi+b)â‰Ĩ1y_i(w^T x_i + b) \geq 1, solved via Lagrange multipliers leading to the dual formulation where only support vectors have Îąi>0\alpha_i > 0.
  • Regularization: L2 regularization ÎģâˆĨwâˆĨ22\lambda\|w\|_2^2 is the Lagrangian penalty for constraining âˆĨwâˆĨ22≤t\|w\|_2^2 \leq t; the multiplier Îģ\lambda trades off fit vs. complexity.
  • Second-Order Conditions: Use the bordered Hessian to classify constrained optima as minima, maxima, or saddle points.
  • Python: Use scipy.optimize.minimize() with constraints= parameter for equality and inequality constraints.
  • Sensitivity Analysis: Îģ\lambda quantifies how much the optimal objective changes per unit change in the constraint — useful for resource allocation and cost-benefit analysis.

Cross-References

  • Optimization Fundamentals: Core optimization theory and unconstrained methods → Optimization Fundamentals
  • Constrained Optimization: KKT conditions, penalty methods, and interior-point methods → Constrained Optimization
  • Partial Derivatives: Gradients and partial derivatives needed for the Lagrangian → Partial Derivatives
  • Multivariable Calculus: Gradients, Jacobians, and Hessians in higher dimensions → Multivariable Calculus
  • Gradient Descent: Gradient-based optimization that complements Lagrange methods → Gradient Descent
  • Linear Algebra Norms: Norms used in regularization constraints → Linear Algebra Norms
  • Matrix Calculus: Derivatives of matrix-valued functions in SVM derivations → Matrix Calculus
  • Probability Foundations: Probability constraints in ML fairness → Probability Foundations
Lesson Progress31 / 100