← Math|65 of 100
Optimization

Constrained Optimization

Master constrained optimization, KKT conditions, penalty methods, and practical solvers for real-world problems.

πŸ“‚ ConstraintsπŸ“– Lesson 65 of 100πŸŽ“ Free Course

Advertisement

Constrained Optimization

πŸ’‘ Why It Matters

In the real world, almost every optimization problem operates under constraints. A portfolio manager cannot invest more than the available budget. An engineer cannot exceed the maximum load a beam can support. A machine learning model must satisfy fairness regulations. Constrained optimization provides the mathematical framework for finding the best solution while respecting these limits. Without understanding constraints, theoretical optima are meaningless β€” they may be physically, financially, or legally impossible to achieve.

ℹ️ Historical Context

Constrained optimization has roots in the calculus of variations (Euler, Lagrange) and was formalized in the 20th century by Karush (1939), Kuhn (1951), and Tucker (1951). Today it underpins operations research, control theory, economics, and machine learning β€” from training SVMs to designing fair allocation systems.


Constrained Optimization

DfConstrained Optimization Problem

A constrained optimization problem seeks to minimize (or maximize) an objective function f:Rnβ†’Rf: \mathbb{R}^n \to \mathbb{R} subject to a set of constraints on the decision variables x∈Rnx \in \mathbb{R}^n. The general form is:

min⁑x∈Rnf(x)subjectΒ togi(x)≀0β€…β€Šβˆ€i=1,…,m,hj(x)=0β€…β€Šβˆ€j=1,…,p\min_{x \in \mathbb{R}^n} f(x) \quad \text{subject to} \quad g_i(x) \leq 0 \; \forall i = 1, \ldots, m, \quad h_j(x) = 0 \; \forall j = 1, \ldots, p

where gi(x)g_i(x) are inequality constraints and hj(x)h_j(x) are equality constraints.

Constrained Optimization Standard Form

min⁑f(x)Β s.t.Β gi(x)≀0,β€…β€Šhj(x)=0\min f(x) \text{ s.t. } g_i(x) \leq 0, \; h_j(x) = 0

Here,

  • f(x)f(x)=Objective function to minimize
  • gi(x)g_i(x)=Inequality constraints (must be ≀ 0)
  • hj(x)h_j(x)=Equality constraints (must equal 0)
  • xx=Decision variable vector in R^n

DfFeasible Region

The feasible region (or feasible set) is the set of all points xx that satisfy every constraint:

F={x∈Rn∣gi(x)≀0β€…β€Šβˆ€i,β€…β€Šhj(x)=0β€…β€Šβˆ€j}\mathcal{F} = \{ x \in \mathbb{R}^n \mid g_i(x) \leq 0 \; \forall i, \; h_j(x) = 0 \; \forall j \}

A point is feasible if it belongs to F\mathcal{F}; otherwise it is infeasible. The optimization problem is feasible if Fβ‰ βˆ…\mathcal{F} \neq \emptyset and infeasible otherwise.

DfBinding vs Non-Binding Constraints

A constraint gi(xβˆ—)≀0g_i(x^*) \leq 0 is binding (or active) at xβˆ—x^* if gi(xβˆ—)=0g_i(x^*) = 0. It is non-binding (or inactive) if gi(xβˆ—)<0g_i(x^*) < 0. Equality constraints are always binding by definition. Only binding constraints influence the optimal solution β€” inactive constraints can be removed without changing the optimum.

⚠️ Common Pitfall

Many beginners forget to verify that the feasible region is non-empty and bounded before searching for an optimum. An unbounded feasible region can lead to no finite minimum. Always check feasibility first.


Equality Constraints (Lagrange Multipliers Review)

DfEquality Constrained Problem

Consider the problem with only equality constraints:

min⁑x∈Rnf(x)subjectΒ tohj(x)=0,β€…β€Šj=1,…,p\min_{x \in \mathbb{R}^n} f(x) \quad \text{subject to} \quad h_j(x) = 0, \; j = 1, \ldots, p

Assume ff and hjh_j are continuously differentiable and that the gradients βˆ‡hj(xβˆ—)\nabla h_j(x^*) are linearly independent at the solution (the regularity condition).

ThLagrange Multiplier Theorem

If xβˆ—x^* is a local minimum of f(x)f(x) subject to hj(x)=0h_j(x) = 0 for j=1,…,pj = 1, \ldots, p, and the gradients βˆ‡hj(xβˆ—)\nabla h_j(x^*) are linearly independent, then there exist unique scalars ΞΌ1,…,ΞΌp\mu_1, \ldots, \mu_p (Lagrange multipliers) such that:

βˆ‡f(xβˆ—)+βˆ‘j=1pΞΌjβˆ‡hj(xβˆ—)=0\nabla f(x^*) + \sum_{j=1}^{p} \mu_j \nabla h_j(x^*) = 0

Each ΞΌj\mu_j measures the sensitivity of the optimal objective value to changes in the jj-th constraint. Specifically, if hj(x)=cjh_j(x) = c_j is relaxed to hj(x)=cj+Ο΅h_j(x) = c_j + \epsilon, the optimal objective changes by approximately ΞΌjΟ΅\mu_j \epsilon.

Lagrangian Function (Equality Constraints)

L(x,ΞΌ)=f(x)+βˆ‘j=1pΞΌjhj(x)\mathcal{L}(x, \mu) = f(x) + \sum_{j=1}^{p} \mu_j h_j(x)

Here,

  • L\mathcal{L}=Lagrangian function
  • f(x)f(x)=Original objective function
  • ΞΌj\mu_j=Lagrange multiplier for equality constraint j
  • hj(x)h_j(x)=j-th equality constraint

πŸ“Example: Shortest Distance to a Line

Find the point on the line x+y=1x + y = 1 closest to the origin.

Objective: min⁑f(x,y)=x2+y2\min f(x,y) = x^2 + y^2 Constraint: h(x,y)=x+yβˆ’1=0h(x,y) = x + y - 1 = 0

The Lagrangian is L=x2+y2+ΞΌ(x+yβˆ’1)\mathcal{L} = x^2 + y^2 + \mu(x + y - 1).

Setting partial derivatives to zero:

  • βˆ‚Lβˆ‚x=2x+ΞΌ=0β€…β€ŠβŸΉβ€…β€Šx=βˆ’ΞΌ/2\frac{\partial \mathcal{L}}{\partial x} = 2x + \mu = 0 \implies x = -\mu/2
  • βˆ‚Lβˆ‚y=2y+ΞΌ=0β€…β€ŠβŸΉβ€…β€Šy=βˆ’ΞΌ/2\frac{\partial \mathcal{L}}{\partial y} = 2y + \mu = 0 \implies y = -\mu/2
  • βˆ‚Lβˆ‚ΞΌ=x+yβˆ’1=0\frac{\partial \mathcal{L}}{\partial \mu} = x + y - 1 = 0

Substituting: βˆ’ΞΌ/2+(βˆ’ΞΌ/2)=1β€…β€ŠβŸΉβ€…β€ŠΞΌ=βˆ’1-\mu/2 + (-\mu/2) = 1 \implies \mu = -1, so x=y=1/2x = y = 1/2.

Optimal point: (1/2,1/2)(1/2, 1/2), minimum distance: 1/21/\sqrt{2}.

πŸ’‘Solution

The Lagrange multiplier ΞΌ=βˆ’1\mu = -1 means that if we relax the constraint to x+y=1+Ο΅x + y = 1 + \epsilon, the squared distance changes by approximately βˆ’Ο΅-\epsilon. Moving the line farther from the origin decreases the minimum distance (which is counterintuitive until you realize the sign convention).


Inequality Constraints

DfInequality Constrained Problem

The general inequality constrained problem is:

min⁑x∈Rnf(x)subjectΒ togi(x)≀0,β€…β€Ši=1,…,m\min_{x \in \mathbb{R}^n} f(x) \quad \text{subject to} \quad g_i(x) \leq 0, \; i = 1, \ldots, m

Inequality constraints are more complex than equality constraints because the active set (which constraints are binding) is unknown a priori. At the solution, some constraints may be inactive (gi(xβˆ—)<0g_i(x^*) < 0) and do not affect the optimum.

DfConstraint Qualification

A constraint qualification (CQ) is a condition on the constraint functions that ensures the KKT conditions are necessary for optimality. The most common are:

  • Linear Independence Constraint Qualification (LICQ): The gradients of all active constraints are linearly independent at xβˆ—x^*.
  • Mangasarian-Fromovitz CQ (MFCQ): The gradients of active equality constraints are linearly independent, and there exists a direction dd such that βˆ‡gi(xβˆ—)Td<0\nabla g_i(x^*)^T d < 0 for all active inequality constraints.
  • Slater's Condition (for convex problems): There exists a strictly feasible point xΛ‰\bar{x} such that gi(xΛ‰)<0g_i(\bar{x}) < 0 for all ii and hj(xΛ‰)=0h_j(\bar{x}) = 0 for all jj.

πŸ’‘ Why Constraint Qualifications Matter

Without a constraint qualification, the KKT conditions may fail to hold even at a local minimum. For example, consider minimizing xx subject to x2≀0x^2 \leq 0. The only feasible point is xβˆ—=0x^* = 0, which is optimal. But βˆ‡g(0)=0\nabla g(0) = 0, so the KKT gradient condition cannot be satisfied for any multiplier. The LICQ fails here because the active constraint gradient is zero.


KKT Conditions (Karush-Kuhn-Tucker)

ThKKT Conditions (Necessary)

Let xβˆ—x^* be a local minimum of f(x)f(x) subject to gi(x)≀0g_i(x) \leq 0 (i=1,…,mi = 1, \ldots, m) and hj(x)=0h_j(x) = 0 (j=1,…,pj = 1, \ldots, p). Assume that ff, gig_i, and hjh_j are continuously differentiable, and that a constraint qualification holds at xβˆ—x^*. Then there exist multipliers Ξ»iβ‰₯0\lambda_i \geq 0 and ΞΌj\mu_j such that:

  1. Stationarity: βˆ‡f(xβˆ—)+βˆ‘i=1mΞ»iβˆ‡gi(xβˆ—)+βˆ‘j=1pΞΌjβˆ‡hj(xβˆ—)=0\nabla f(x^*) + \sum_{i=1}^{m} \lambda_i \nabla g_i(x^*) + \sum_{j=1}^{p} \mu_j \nabla h_j(x^*) = 0

  2. Primal Feasibility: gi(xβˆ—)≀0β€…β€Šβˆ€ig_i(x^*) \leq 0 \; \forall i and hj(xβˆ—)=0β€…β€Šβˆ€jh_j(x^*) = 0 \; \forall j

  3. Dual Feasibility: Ξ»iβ‰₯0β€…β€Šβˆ€i\lambda_i \geq 0 \; \forall i

  4. Complementary Slackness: Ξ»igi(xβˆ—)=0β€…β€Šβˆ€i\lambda_i g_i(x^*) = 0 \; \forall i

KKT Conditions β€” Stationarity

βˆ‡f(xβˆ—)+βˆ‘i=1mΞ»iβˆ‡gi(xβˆ—)+βˆ‘j=1pΞΌjβˆ‡hj(xβˆ—)=0\nabla f(x^*) + \sum_{i=1}^{m} \lambda_i \nabla g_i(x^*) + \sum_{j=1}^{p} \mu_j \nabla h_j(x^*) = 0

Here,

  • βˆ‡f(xβˆ—)\nabla f(x^*)=Gradient of the objective at the optimum
  • Ξ»i\lambda_i=Lagrange multiplier for inequality constraint i (β‰₯ 0)
  • βˆ‡gi(xβˆ—)\nabla g_i(x^*)=Gradient of inequality constraint i
  • ΞΌj\mu_j=Lagrange multiplier for equality constraint j
  • βˆ‡hj(xβˆ—)\nabla h_j(x^*)=Gradient of equality constraint j

KKT Conditions β€” Complementary Slackness

Ξ»i gi(xβˆ—)=0βˆ€i=1,…,m\lambda_i \, g_i(x^*) = 0 \quad \forall i = 1, \ldots, m

Here,

  • Ξ»i\lambda_i=Multiplier for inequality constraint i
  • gi(xβˆ—)g_i(x^*)=Value of inequality constraint i at the optimum

ℹ️ Interpreting Complementary Slackness

Complementary slackness means that for each inequality constraint, exactly one of two things holds: either the constraint is active (gi(xβˆ—)=0g_i(x^*) = 0, the constraint is tight) or its multiplier is zero (Ξ»i=0\lambda_i = 0, the constraint does not affect the optimum). This is powerful because it tells us which constraints actually matter at the solution.

ThKKT Conditions (Sufficient)

If ff is convex, gig_i are convex, hjh_j are affine, and there exist multipliers Ξ»iβ‰₯0\lambda_i \geq 0 and ΞΌj\mu_j satisfying the KKT conditions at xβˆ—x^*, then xβˆ—x^* is a global minimum.

πŸ“Example: KKT with One Inequality Constraint

Minimize f(x)=(xβˆ’2)2f(x) = (x - 2)^2 subject to x≀1x \leq 1.

Lagrangian: L(x,Ξ»)=(xβˆ’2)2+Ξ»(xβˆ’1)\mathcal{L}(x, \lambda) = (x - 2)^2 + \lambda(x - 1)

KKT conditions:

  1. Stationarity: 2(xβˆ’2)+Ξ»=02(x - 2) + \lambda = 0
  2. Primal feasibility: x≀1x \leq 1
  3. Dual feasibility: Ξ»β‰₯0\lambda \geq 0
  4. Complementary slackness: Ξ»(xβˆ’1)=0\lambda(x - 1) = 0

Case 1: Constraint inactive (Ξ»=0\lambda = 0): 2(xβˆ’2)=0β€…β€ŠβŸΉβ€…β€Šx=22(x - 2) = 0 \implies x = 2. But x=2x = 2 violates x≀1x \leq 1. Infeasible.

Case 2: Constraint active (x=1x = 1): 2(1βˆ’2)+Ξ»=0β€…β€ŠβŸΉβ€…β€ŠΞ»=2β‰₯02(1 - 2) + \lambda = 0 \implies \lambda = 2 \geq 0. βœ“

Optimal solution: xβˆ—=1x^* = 1, f(xβˆ—)=1f(x^*) = 1, Ξ»βˆ—=2\lambda^* = 2.

πŸ’‘Solution

The multiplier Ξ»βˆ—=2\lambda^* = 2 indicates that if we relax the constraint to x≀1+Ο΅x \leq 1 + \epsilon, the optimal objective decreases by approximately 2Ο΅2\epsilon. This makes sense: relaxing the constraint allows xx to move closer to the unconstrained optimum at x=2x = 2.


Active Set Methods

DfActive Set Method

An active set method solves a constrained optimization problem by iteratively guessing which constraints are active (binding) at the solution and solving the resulting equality-constrained subproblem. At each iteration kk:

  1. Given the current iterate xkx^k and an estimated active set AkβŠ†{1,…,m}\mathcal{A}^k \subseteq \{1, \ldots, m\}, solve the equality-constrained subproblem:
min⁑dβˆ‡f(xk)Td+12dTβˆ‡xx2L(xk,Ξ»k) ds.t.βˆ‡gi(xk)Td=0β€…β€Šβˆ€i∈Ak\min_{d} \nabla f(x^k)^T d + \frac{1}{2} d^T \nabla^2_{xx} \mathcal{L}(x^k, \lambda^k) \, d \quad \text{s.t.} \quad \nabla g_i(x^k)^T d = 0 \; \forall i \in \mathcal{A}^k
  1. If the step dkd^k is nonzero, take a step xk+1=xk+Ξ±kdkx^{k+1} = x^k + \alpha_k d^k with step size Ξ±k\alpha_k chosen to maintain feasibility.
  2. Update the active set: add constraints that become binding, remove constraints whose multipliers become negative.
  3. Repeat until no further improvement is possible and all multipliers are nonnegative.

πŸ’‘ Active Set in Practice

Active set methods are the basis of many quadratic programming (QP) solvers. They are efficient when the number of constraints is moderate and the active set changes slowly between iterations. However, they can be slow for large-scale problems because solving the equality-constrained subproblem at each step requires factoring a matrix that changes with the active set.

DfWorking Set

In active set methods, the working set is the current estimate of which constraints will be active at the solution. At each iteration, the algorithm solves a subproblem with the working set treated as equality constraints. Constraints may be added to or removed from the working set based on:

  • Addition: A constraint becomes violated (gi(xk)>0g_i(x^k) > 0) and is added to the working set.
  • Removal: A constraint's multiplier becomes negative (Ξ»i<0\lambda_i < 0), indicating the solution would improve by moving away from that constraint.

Interior Point Methods

DfInterior Point Method

Interior point methods solve constrained optimization problems by tracing a path through the interior of the feasible region, rather than moving along its boundary (as active set methods do). They introduce barrier functions that prevent iterates from reaching the boundary, then gradually reduce the barrier to approach the true solution.

Logarithmic Barrier Function

min⁑f(x)βˆ’ΞΌβˆ‘i=1mln⁑(βˆ’gi(x))subjectΒ tohj(x)=0\min f(x) - \mu \sum_{i=1}^{m} \ln(-g_i(x)) \quad \text{subject to} \quad h_j(x) = 0

Here,

  • f(x)f(x)=Original objective function
  • ΞΌ\mu=Barrier parameter (decreases toward 0)
  • gi(x)g_i(x)=Inequality constraints (g_i(x) < 0 for interior points)
  • hj(x)h_j(x)=Equality constraints

ℹ️ How Barrier Methods Work

The barrier term βˆ’ΞΌln⁑(βˆ’gi(x))-\mu \ln(-g_i(x)) approaches +∞+\infty as gi(x)β†’0βˆ’g_i(x) \to 0^- (approaching the boundary). This creates a "wall" that keeps iterates strictly feasible. As ΞΌβ†’0\mu \to 0, the barrier solution converges to the KKT point of the original problem. In practice, a sequence of barrier subproblems is solved with decreasing ΞΌ\mu values, using the previous solution as a warm start.

DfPrimal-Dual Interior Point Methods

Modern interior point methods work with the primal-dual KKT system directly. They reformulate the KKT conditions as a nonlinear system and solve it using Newton's method. The key advantage is that they handle both primal and dual variables simultaneously, achieving polynomial-time complexity O(mlog⁑(1/Ο΅))O(\sqrt{m} \log(1/\epsilon)) for convex problems β€” a significant improvement over the exponential worst-case of active set methods.

⚠️ Interior Point Limitations

Interior point methods require a strictly feasible starting point, which can be difficult to find. They also struggle with degenerate problems (where multiple constraints are active at a vertex). For small to medium QPs, active set methods are often faster in practice.


Penalty Methods

DfPenalty Method

A penalty method converts a constrained optimization problem into a sequence of unconstrained problems by adding a penalty term to the objective that penalizes constraint violations. The idea is simple: instead of enforcing constraints exactly, we penalize deviations from feasibility and drive the penalty toward infinity.

Quadratic Penalty Method

min⁑xf(x)+Οβˆ‘i=1m[max⁑(0,gi(x))]2\min_{x} f(x) + \rho \sum_{i=1}^{m} \left[\max(0, g_i(x))\right]^2

Here,

  • f(x)f(x)=Original objective function
  • ρ\rho=Penalty parameter (increases toward infinity)
  • gi(x)g_i(x)=Inequality constraints
  • max⁑(0,gi(x))\max(0, g_i(x))=Positive part: zero if feasible, positive if violated

Quadratic Penalty for Equality Constraints

min⁑xf(x)+Οβˆ‘j=1phj(x)2\min_{x} f(x) + \rho \sum_{j=1}^{p} h_j(x)^2

Here,

  • f(x)f(x)=Original objective function
  • ρ\rho=Penalty parameter
  • hj(x)h_j(x)=Equality constraint j

ℹ️ Penalty Parameter Schedule

In practice, we solve a sequence of subproblems with increasing ρ\rho values: ρ1<ρ2<β‹―\rho_1 < \rho_2 < \cdots. At each step, we use the previous solution as a warm start. As Οβ†’βˆž\rho \to \infty, the penalty solution converges to the constrained optimum. However, large ρ\rho makes the subproblem ill-conditioned (the Hessian becomes nearly singular), so adaptive strategies are needed to balance accuracy and numerical stability.

DfAugmented Lagrangian Method

The augmented Lagrangian method improves on plain penalty methods by incorporating Lagrange multipliers directly. For equality constraints:

LA(x,ΞΌ,ρ)=f(x)+βˆ‘j=1pΞΌjhj(x)+ρ2βˆ‘j=1phj(x)2\mathcal{L}_A(x, \mu, \rho) = f(x) + \sum_{j=1}^{p} \mu_j h_j(x) + \frac{\rho}{2} \sum_{j=1}^{p} h_j(x)^2

At each iteration, the multipliers are updated as ΞΌjk+1=ΞΌjk+ρhj(xk)\mu_j^{k+1} = \mu_j^k + \rho h_j(x^k). This avoids the ill-conditioning of pure penalty methods because the multipliers absorb the bias that would otherwise require Οβ†’βˆž\rho \to \infty.

πŸ’‘ When to Use Penalty Methods

Penalty methods are simple to implement and work well when high accuracy is not required. They are commonly used in training neural networks with constraints (e.g., fairness constraints) where approximate feasibility is acceptable. For high-precision solutions, use interior point methods or Augmented Lagrangian methods.


Python Implementation

scipy.optimize

from scipy.optimize import minimize, NonlinearConstraint, LinearConstraint
import numpy as np

# Example 1: Equality constraint
# Minimize x^2 + y^2 subject to x + y = 1
def objective(x):
    return x[0]**2 + x[1]**2

def eq_constraint(x):
    return x[0] + x[1] - 1

constraints = [{'type': 'eq', 'fun': eq_constraint}]
result = minimize(objective, x0=[0, 0], constraints=constraints, method='SLSQP')
print(f"Optimal point: {result.x}")       # [0.5, 0.5]
print(f"Optimal value: {result.fun}")     # 0.5
print(f"Multiplier: {result.v}")          # Lagrange multiplier

# Example 2: Inequality constraint
# Minimize (x-2)^2 + (y-1)^2 subject to x + y <= 1
def ineq_constraint(x):
    return 1 - x[0] - x[1]  # g(x) >= 0 for scipy (>= form)

result2 = minimize(objective, x0=[0, 0], constraints={'type': 'ineq', 'fun': ineq_constraint})
print(f"Optimal point: {result2.x}")      # [0.6667, 0.3333]

# Example 3: Multiple constraints with bounds
from scipy.optimize import minimize
bounds = [(0, None), (0, None)]  # x >= 0, y >= 0
constraints_multi = [
    {'type': 'eq', 'fun': lambda x: x[0] + x[1] - 1},
    {'type': 'ineq', 'fun': lambda x: 1 - x[0]**2 - x[1]**2}
]
result3 = minimize(objective, x0=[0.5, 0.5], constraints=constraints_multi, bounds=bounds)
print(f"Optimal point: {result3.x}")

cvxpy

import cvxpy as cp

# Example 1: Simple constrained problem
x = cp.Variable(2)
objective = cp.Minimize(cp.sum_squares(x))
constraints = [cp.sum(x) == 1]
prob = cp.Problem(objective, constraints)
prob.solve()
print(f"Optimal value: {prob.value}")       # 0.5
print(f"Optimal x: {x.value}")              # [0.5, 0.5]
print(f"Shadow price: {constraints[0].dual_value}")  # Lagrange multiplier

# Example 2: SVM-like problem
n_samples, n_features = 100, 5
np.random.seed(42)
X = np.random.randn(n_samples, n_features)
y = np.sign(np.random.randn(n_samples))

w = cp.Variable(n_features)
b = cp.Variable()
loss = cp.sum(cp.pos(1 - cp.multiply(y, X @ w + b)))
reg = 0.01 * cp.sum_squares(w)
prob = cp.Problem(cp.Minimize(reg + loss))
prob.solve()
print(f"SVM accuracy: {np.mean(np.sign(X @ w.value + b.value) == y):.2%}")

# Example 3: Portfolio optimization
n_assets = 5
mu = np.random.randn(n_assets) * 0.05 + 0.1  # expected returns
Sigma = np.random.randn(n_assets, n_assets) * 0.1
Sigma = Sigma @ Sigma.T + np.eye(n_assets) * 0.01  # covariance

w = cp.Variable(n_assets)
ret = mu @ w
risk = cp.quad_form(w, Sigma)
constraints = [cp.sum(w) == 1, w >= 0]  # fully invested, no shorting
prob = cp.Problem(cp.Minimize(risk - 0.5 * ret), constraints)  # risk-averse
prob.solve()
print(f"Portfolio weights: {np.round(w.value, 3)}")

πŸ’‘ Choosing a Solver

  • scipy.optimize.minimize: Good for small, general nonlinear problems. Methods: SLSQP (supports constraints), L-BFGS-B (bounds only), trust-constr.
  • cvxpy: Best for convex problems (linear, quadratic, SOCP, SDP). Automatically selects an appropriate solver (ECOS, SCS, MOSEK).
  • For large-scale: Use interior point methods (MOSEK, Gurobi) or first-order methods (ADMM, proximal gradient).

Applications in AI/ML

Support Vector Machines (SVM)

DfSVM as Constrained Optimization

The primal SVM problem is a constrained optimization problem:

min⁑w,b,ΞΎ12βˆ₯wβˆ₯2+Cβˆ‘i=1nΞΎi\min_{w, b, \xi} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{n} \xi_i

subject to:

yi(wTxi+b)β‰₯1βˆ’ΞΎi,ΞΎiβ‰₯0βˆ€iy_i(w^T x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \quad \forall i

The objective balances margin maximization (βˆ₯wβˆ₯2\|w\|^2) with classification error (βˆ‘ΞΎi\sum \xi_i). The KKT conditions of this problem yield the support vectors β€” the data points where Ξ»i>0\lambda_i > 0 β€” which define the decision boundary.

πŸ“SVM KKT Interpretation

For an SVM, complementary slackness Ξ»i(yi(wTxi+b)βˆ’1+ΞΎi)=0\lambda_i (y_i(w^T x_i + b) - 1 + \xi_i) = 0 tells us:

  • If Ξ»i=0\lambda_i = 0: the point is correctly classified and not a support vector.
  • If Ξ»i>0\lambda_i > 0: the point lies on the margin or is misclassified β€” it is a support vector.

This sparsity is what makes SVMs memory-efficient: only support vectors are needed for prediction.

Fairness in Machine Learning

DfFairness-Constrained Optimization

Many ML applications require fairness constraints. For example, in lending or hiring, the model must satisfy demographic parity, equalized odds, or calibration across groups. These are naturally expressed as constrained optimization problems:

min⁑θL(ΞΈ)subjectΒ to∣P(Y^=1∣A=0)βˆ’P(Y^=1∣A=1)βˆ£β‰€Ο΅\min_{\theta} \mathcal{L}(\theta) \quad \text{subject to} \quad |P(\hat{Y}=1 | A=0) - P(\hat{Y}=1 | A=1)| \leq \epsilon

where AA is the protected attribute (e.g., race, gender) and Ο΅\epsilon is the maximum allowed disparity.

ℹ️ Fairness Trade-offs

Enforcing fairness constraints typically reduces accuracy. The Pareto frontier between accuracy and fairness is explored by varying the constraint threshold Ο΅\epsilon. There is no universally accepted definition of fairness β€” different definitions (demographic parity, equalized odds, calibration) can be mutually incompatible, as shown by Chouldechova (2017) and Kleinberg et al. (2016).

Other ML Applications

DfConstrained Optimization in Neural Networks

Constrained optimization appears throughout deep learning:

  • Regularization: Weight decay (L2L_2 penalty) can be viewed as a soft constraint on parameter magnitude.
  • Projection methods: After each gradient step, project weights onto a feasible set (e.g., simplex for attention weights, norm balls for robustness).
  • Meta-learning: bilevel constrained optimization where the inner loop optimizes task-specific parameters subject to constraints.
  • Reinforcement learning: Policy optimization with constraints on safety, energy, or resource usage (constrained MDPs).

Common Mistakes

MistakeWhy It's WrongCorrect Approach
Ignoring constraint qualificationsKKT conditions may not hold without a valid CQAlways verify LICQ, MFCQ, or Slater's condition
Forgetting Ξ»iβ‰₯0\lambda_i \geq 0 for inequalitiesNegative multipliers violate dual feasibilityCheck sign of multipliers at the solution
Assuming all constraints are activeMost constraints are typically inactive at the optimumUse complementary slackness to identify active constraints
Using penalty methods with fixed ρ\rhoA single ρ\rho either gives infeasible or ill-conditioned solutionsUse an increasing sequence of ρ\rho values
Confusing β‰₯\geq and ≀\leq formsThe sign convention affects the KKT sign conditionsStandardize: write all inequalities as gi(x)≀0g_i(x) \leq 0
Not checking feasibility firstOptimizing over an empty feasible set wastes computationVerify the feasible region is non-empty before solving
Ignoring convexityNon-convex problems may have local minima that are not globalCheck if the problem is convex; if not, use global optimization
Treating equality constraints as inequalitiesEquality constraints require different multiplier sign conventionsKeep equality and inequality constraints separate in the KKT system

Interview Questions

Q1: What is the difference between a binding and non-binding constraint? A binding constraint is satisfied with equality at the optimum (gi(xβˆ—)=0g_i(x^*) = 0). A non-binding constraint is strictly satisfied (gi(xβˆ—)<0g_i(x^*) < 0) and does not affect the optimal solution. Only binding constraints have non-zero Lagrange multipliers.

Q2: Why do we need constraint qualifications for KKT conditions? Constraint qualifications ensure that the feasible region is "well-behaved" near the optimum. Without them, the KKT conditions may fail to hold even at a local minimum. For example, if the active constraint gradients are linearly dependent, the KKT gradient equation may have no solution for the multipliers.

Q3: How do penalty methods differ from augmented Lagrangian methods? Penalty methods add Οβˆ‘hj(x)2\rho \sum h_j(x)^2 and require Οβ†’βˆž\rho \to \infty for exactness, causing ill-conditioning. Augmented Lagrangian methods add both ΞΌjhj(x)\mu_j h_j(x) and ρ2hj(x)2\frac{\rho}{2} h_j(x)^2, updating ΞΌj\mu_j at each step. This avoids the need for Οβ†’βˆž\rho \to \infty and maintains better conditioning.

Q4: When should you use interior point vs. active set methods? Interior point methods are preferred for large-scale problems (thousands of variables and constraints) because they have polynomial complexity. Active set methods are faster for small to medium QPs, especially when the active set changes little between iterations. Interior point methods also require a strictly feasible start, which can be hard to find.

Q5: Explain complementary slackness in plain language. Complementary slackness says that at the optimum, each constraint is either "tight" (active, gi(xβˆ—)=0g_i(x^*) = 0) or "irrelevant" (its multiplier is zero, Ξ»i=0\lambda_i = 0). A constraint cannot be simultaneously loose and influential β€” if there's slack, the constraint doesn't matter.

Q6: How do SVMs use KKT conditions? The KKT conditions of the SVM dual problem yield support vectors. Complementary slackness ensures that only points on the margin or misclassified points (Ξ»i>0\lambda_i > 0) contribute to the decision boundary. All other points (Ξ»i=0\lambda_i = 0) can be discarded, making SVMs memory-efficient.

Q7: What happens if a constraint qualification fails? If LICQ fails (active constraint gradients are linearly dependent), the KKT multipliers may not exist or may not be unique. The KKT conditions are no longer necessary for optimality. You must use alternative methods or reformulate the problem.

Q8: Can you convert an equality constraint to two inequality constraints? Yes: h(x)=0h(x) = 0 is equivalent to h(x)≀0h(x) \leq 0 and βˆ’h(x)≀0-h(x) \leq 0. However, this doubles the number of constraints and may violate constraint qualifications at the solution (since both constraints are active with linearly dependent gradients).


Practice Problems

πŸ“Problem 1: Budget-Constrained Utility Maximization

A consumer maximizes U(x,y)=xyU(x, y) = xy subject to the budget constraint 2x+3y=122x + 3y = 12 where x,yβ‰₯0x, y \geq 0. Find the optimal consumption bundle.

πŸ’‘Solution

Set up the Lagrangian: L=xy+ΞΌ(12βˆ’2xβˆ’3y)\mathcal{L} = xy + \mu(12 - 2x - 3y).

KKT conditions:

  • βˆ‚Lβˆ‚x=yβˆ’2ΞΌ=0β€…β€ŠβŸΉβ€…β€Šy=2ΞΌ\frac{\partial \mathcal{L}}{\partial x} = y - 2\mu = 0 \implies y = 2\mu
  • βˆ‚Lβˆ‚y=xβˆ’3ΞΌ=0β€…β€ŠβŸΉβ€…β€Šx=3ΞΌ\frac{\partial \mathcal{L}}{\partial y} = x - 3\mu = 0 \implies x = 3\mu
  • 2x+3y=122x + 3y = 12

Substituting: 2(3ΞΌ)+3(2ΞΌ)=12β€…β€ŠβŸΉβ€…β€Š12ΞΌ=12β€…β€ŠβŸΉβ€…β€ŠΞΌ=12(3\mu) + 3(2\mu) = 12 \implies 12\mu = 12 \implies \mu = 1.

Optimal bundle: xβˆ—=3x^* = 3, yβˆ—=2y^* = 2, Uβˆ—=6U^* = 6.

The multiplier ΞΌ=1\mu = 1 means an extra dollar of budget increases utility by approximately 1 unit.

πŸ“Problem 2: KKT with Two Inequality Constraints

Minimize f(x)=x12+x22f(x) = x_1^2 + x_2^2 subject to x1+x2≀1x_1 + x_2 \leq 1 and x1≀0.5x_1 \leq 0.5. Find the optimal solution and interpret the multipliers.

πŸ’‘Solution

Lagrangian: L=x12+x22+Ξ»1(x1+x2βˆ’1)+Ξ»2(x1βˆ’0.5)\mathcal{L} = x_1^2 + x_2^2 + \lambda_1(x_1 + x_2 - 1) + \lambda_2(x_1 - 0.5).

KKT conditions:

  1. βˆ‚Lβˆ‚x1=2x1+Ξ»1+Ξ»2=0\frac{\partial \mathcal{L}}{\partial x_1} = 2x_1 + \lambda_1 + \lambda_2 = 0
  2. βˆ‚Lβˆ‚x2=2x2+Ξ»1=0\frac{\partial \mathcal{L}}{\partial x_2} = 2x_2 + \lambda_1 = 0
  3. Ξ»1(x1+x2βˆ’1)=0\lambda_1(x_1 + x_2 - 1) = 0, Ξ»2(x1βˆ’0.5)=0\lambda_2(x_1 - 0.5) = 0
  4. Ξ»1,Ξ»2β‰₯0\lambda_1, \lambda_2 \geq 0, x1+x2≀1x_1 + x_2 \leq 1, x1≀0.5x_1 \leq 0.5

Case: Both active (x1=0.5x_1 = 0.5, x1+x2=1β€…β€ŠβŸΉβ€…β€Šx2=0.5x_1 + x_2 = 1 \implies x_2 = 0.5): 2(0.5)+Ξ»1+Ξ»2=0β€…β€ŠβŸΉβ€…β€ŠΞ»1+Ξ»2=βˆ’12(0.5) + \lambda_1 + \lambda_2 = 0 \implies \lambda_1 + \lambda_2 = -1. Impossible since Ξ»iβ‰₯0\lambda_i \geq 0.

Case: Only first active (x1+x2=1x_1 + x_2 = 1, Ξ»2=0\lambda_2 = 0): 2x2+Ξ»1=02x_2 + \lambda_1 = 0, 2x1+Ξ»1=0β€…β€ŠβŸΉβ€…β€Šx1=x22x_1 + \lambda_1 = 0 \implies x_1 = x_2. x1+x1=1β€…β€ŠβŸΉβ€…β€Šx1=x2=0.5x_1 + x_1 = 1 \implies x_1 = x_2 = 0.5, Ξ»1=βˆ’1\lambda_1 = -1. Impossible.

Case: Neither active (Ξ»1=Ξ»2=0\lambda_1 = \lambda_2 = 0): x1=x2=0x_1 = x_2 = 0. Both constraints satisfied. Optimal.

Optimal solution: xβˆ—=(0,0)x^* = (0, 0), fβˆ—=0f^* = 0, Ξ»1βˆ—=Ξ»2βˆ—=0\lambda_1^* = \lambda_2^* = 0.

Both constraints are inactive β€” the unconstrained minimum is already feasible.

πŸ“Problem 3: Penalty Method by Hand

Solve min⁑x\min x subject to xβ‰₯1x \geq 1 using the quadratic penalty method. Start with ρ=1\rho = 1 and do two iterations.

πŸ’‘Solution

Rewrite as min⁑x\min x subject to 1βˆ’x≀01 - x \leq 0.

Penalty formulation: min⁑x+ρ[max⁑(0,1βˆ’x)]2\min x + \rho [\max(0, 1-x)]^2.

Iteration 1 (ρ=1\rho = 1): Since the unconstrained minimum (x=βˆ’βˆžx = -\infty) is infeasible, consider the region where x<1x < 1: min⁑x+(1βˆ’x)2β€…β€ŠβŸΉβ€…β€Š1βˆ’2(1βˆ’x)=0β€…β€ŠβŸΉβ€…β€Šx=0.5\min x + (1-x)^2 \implies 1 - 2(1-x) = 0 \implies x = 0.5. Objective: 0.5+0.25=0.750.5 + 0.25 = 0.75.

Iteration 2 (ρ=10\rho = 10): min⁑x+10(1βˆ’x)2β€…β€ŠβŸΉβ€…β€Š1βˆ’20(1βˆ’x)=0β€…β€ŠβŸΉβ€…β€Šx=0.95\min x + 10(1-x)^2 \implies 1 - 20(1-x) = 0 \implies x = 0.95. Objective: 0.95+0.025=0.9750.95 + 0.025 = 0.975.

As Οβ†’βˆž\rho \to \infty, xβ†’1x \to 1 (the true optimum). Each iteration brings us closer but with diminishing returns and increasing ill-conditioning.

πŸ“Problem 4: SVM Support Vectors

Given a linearly separable dataset with two classes, explain how the KKT conditions identify support vectors and why this leads to a sparse solution.

πŸ’‘Solution

The SVM dual problem maximizes βˆ‘Ξ»iβˆ’12βˆ‘i,jΞ»iΞ»jyiyjxiTxj\sum \lambda_i - \frac{1}{2} \sum_{i,j} \lambda_i \lambda_j y_i y_j x_i^T x_j subject to 0≀λi≀C0 \leq \lambda_i \leq C and βˆ‘Ξ»iyi=0\sum \lambda_i y_i = 0.

By complementary slackness: Ξ»i(yi(wTxi+b)βˆ’1+ΞΎi)=0\lambda_i (y_i(w^T x_i + b) - 1 + \xi_i) = 0.

  • If Ξ»i=0\lambda_i = 0: point is correctly classified with margin > 1. Not needed for the model.
  • If 0<Ξ»i<C0 < \lambda_i < C: point is exactly on the margin (yi(wTxi+b)=1y_i(w^T x_i + b) = 1). This is a free support vector.
  • If Ξ»i=C\lambda_i = C: point is on the wrong side of the margin or misclassified. This is a bounded support vector.

The decision boundary depends only on points with Ξ»i>0\lambda_i > 0 (support vectors). Typically, only a small fraction of training points are support vectors, leading to a sparse, efficient model.

πŸ“Problem 5: Portfolio Optimization

Formulate and solve the following portfolio problem: minimize risk 12wTΞ£w\frac{1}{2} w^T \Sigma w subject to expected return wTΞΌβ‰₯0.08w^T \mu \geq 0.08 and wT1=1w^T \mathbf{1} = 1 (fully invested), where Ξ£\Sigma is the covariance matrix and ΞΌ\mu is the expected return vector.

πŸ’‘Solution

Lagrangian: L=12wTΞ£wβˆ’Ξ»(wTΞΌβˆ’0.08)+Ξ³(wT1βˆ’1)\mathcal{L} = \frac{1}{2} w^T \Sigma w - \lambda(w^T \mu - 0.08) + \gamma(w^T \mathbf{1} - 1).

KKT conditions:

  • Ξ£wβˆ’Ξ»ΞΌ+Ξ³1=0\Sigma w - \lambda \mu + \gamma \mathbf{1} = 0 (stationarity)
  • wTΞΌβ‰₯0.08w^T \mu \geq 0.08, wT1=1w^T \mathbf{1} = 1 (primal feasibility)
  • Ξ»β‰₯0\lambda \geq 0 (dual feasibility)
  • Ξ»(wTΞΌβˆ’0.08)=0\lambda(w^T \mu - 0.08) = 0 (complementary slackness)

If the return constraint is active (wTΞΌ=0.08w^T \mu = 0.08): solve the 2Γ—22 \times 2 system: w=Ξ£βˆ’1(Ξ»ΞΌβˆ’Ξ³1)w = \Sigma^{-1}(\lambda \mu - \gamma \mathbf{1}), then find Ξ»,Ξ³\lambda, \gamma from the two equality constraints.

This is the classical Markowitz mean-variance optimization, and the multiplier Ξ»\lambda is the "price of return" β€” how much risk increases per unit increase in the return target.


Quick Reference

πŸ“‹Key Concepts and Formulas

Standard Form:

min⁑f(x)s.t.gi(x)≀0,β€…β€Šhj(x)=0\min f(x) \quad \text{s.t.} \quad g_i(x) \leq 0, \; h_j(x) = 0

KKT Conditions (Necessary):

  1. Stationarity: βˆ‡f+βˆ‘Ξ»iβˆ‡gi+βˆ‘ΞΌjβˆ‡hj=0\nabla f + \sum \lambda_i \nabla g_i + \sum \mu_j \nabla h_j = 0
  2. Primal Feasibility: gi(xβˆ—)≀0g_i(x^*) \leq 0, hj(xβˆ—)=0h_j(x^*) = 0
  3. Dual Feasibility: Ξ»iβ‰₯0\lambda_i \geq 0
  4. Complementary Slackness: Ξ»igi(xβˆ—)=0\lambda_i g_i(x^*) = 0

KKT Conditions (Sufficient): If ff is convex, gig_i are convex, hjh_j are affine, and KKT holds, then xβˆ—x^* is a global minimum.

Lagrangian: L(x,Ξ»,ΞΌ)=f(x)+βˆ‘Ξ»igi(x)+βˆ‘ΞΌjhj(x)\mathcal{L}(x, \lambda, \mu) = f(x) + \sum \lambda_i g_i(x) + \sum \mu_j h_j(x)

Penalty Method: min⁑f(x)+Οβˆ‘[max⁑(0,gi(x))]2+Οβˆ‘hj(x)2\min f(x) + \rho \sum [\max(0, g_i(x))]^2 + \rho \sum h_j(x)^2

Augmented Lagrangian: LA=f(x)+βˆ‘ΞΌjhj(x)+ρ2βˆ‘hj(x)2\mathcal{L}_A = f(x) + \sum \mu_j h_j(x) + \frac{\rho}{2} \sum h_j(x)^2

Barrier Method: min⁑f(x)βˆ’ΞΌβˆ‘ln⁑(βˆ’gi(x))\min f(x) - \mu \sum \ln(-g_i(x))

Solver Selection:

Problem TypeRecommended Solver
Convex QP/SOCP/SDPcvxpy (ECOS, SCS, MOSEK)
Small nonlinearscipy.optimize (SLSQP)
Large-scale convexMOSEK, Gurobi
Non-convexGlobal optimizers, multi-start
ML with constraintscvxpy, custom penalty methods

Cross-References

  • Previous: 064 - Gradient Descent β€” unconstrained optimization foundations
  • Previous: 061 - Linear Algebra for Optimization β€” matrix operations used in constrained solvers
  • Related: 066 - Convex Optimization β€” when KKT conditions become sufficient for global optimality
  • Related: 059 - Calculus β€” derivatives and gradients underlying KKT stationarity
  • Application: 060 - Probability & Statistics β€” expected values and variance in portfolio optimization
  • Application: 062 - Information Theory β€” entropy-constrained coding and rate-distortion theory
  • Advanced: Semi-definite programming (SDP) extends constrained optimization to matrix-valued constraints
  • Advanced: Bilevel optimization involves nested constrained problems (leader-follower structure)
Lesson Progress65 / 100