Constrained Optimization

💡 Why It Matters

In the real world, almost every optimization problem operates under constraints. A portfolio manager cannot invest more than the available budget. An engineer cannot exceed the maximum load a beam can support. A machine learning model must satisfy fairness regulations. Constrained optimization provides the mathematical framework for finding the best solution while respecting these limits. Without understanding constraints, theoretical optima are meaningless — they may be physically, financially, or legally impossible to achieve.

ℹ️ Historical Context

Constrained optimization has roots in the calculus of variations (Euler, Lagrange) and was formalized in the 20th century by Karush (1939), Kuhn (1951), and Tucker (1951). Today it underpins operations research, control theory, economics, and machine learning — from training SVMs to designing fair allocation systems.

Constrained Optimization

DfConstrained Optimization Problem

A constrained optimization problem seeks to minimize (or maximize) an objective function $f: \mathbb{R}^n \to \mathbb{R}$ subject to a set of constraints on the decision variables $x \in \mathbb{R}^n$ . The general form is:

\min_{x \in \mathbb{R}^n} f(x) \quad \text{subject to} \quad g_i(x) \leq 0 \; \forall i = 1, \ldots, m, \quad h_j(x) = 0 \; \forall j = 1, \ldots, p

where $g_i(x)$ are inequality constraints and $h_j(x)$ are equality constraints.

Constrained Optimization Standard Form

\min f(x) \text{ s.t. } g_i(x) \leq 0, \; h_j(x) = 0

Here,

$f(x)$ =Objective function to minimize
$g_i(x)$ =Inequality constraints (must be ≤ 0)
$h_j(x)$ =Equality constraints (must equal 0)
$x$ =Decision variable vector in R^n

DfFeasible Region

The feasible region (or feasible set) is the set of all points $x$ that satisfy every constraint:

\mathcal{F} = \{ x \in \mathbb{R}^n \mid g_i(x) \leq 0 \; \forall i, \; h_j(x) = 0 \; \forall j \}

A point is feasible if it belongs to $\mathcal{F}$ ; otherwise it is infeasible. The optimization problem is feasible if $\mathcal{F} \neq \emptyset$ and infeasible otherwise.

DfBinding vs Non-Binding Constraints

A constraint $g_i(x^*) \leq 0$ is binding (or active) at $x^*$ if $g_i(x^*) = 0$ . It is non-binding (or inactive) if $g_i(x^*) < 0$ . Equality constraints are always binding by definition. Only binding constraints influence the optimal solution — inactive constraints can be removed without changing the optimum.

⚠️ Common Pitfall

Many beginners forget to verify that the feasible region is non-empty and bounded before searching for an optimum. An unbounded feasible region can lead to no finite minimum. Always check feasibility first.

Equality Constraints (Lagrange Multipliers Review)

DfEquality Constrained Problem

Consider the problem with only equality constraints:

\min_{x \in \mathbb{R}^n} f(x) \quad \text{subject to} \quad h_j(x) = 0, \; j = 1, \ldots, p

Assume $f$ and $h_j$ are continuously differentiable and that the gradients $\nabla h_j(x^*)$ are linearly independent at the solution (the regularity condition).

ThLagrange Multiplier Theorem

If $x^*$ is a local minimum of $f(x)$ subject to $h_j(x) = 0$ for $j = 1, \ldots, p$ , and the gradients $\nabla h_j(x^*)$ are linearly independent, then there exist unique scalars $\mu_1, \ldots, \mu_p$ (Lagrange multipliers) such that:

\nabla f(x^*) + \sum_{j=1}^{p} \mu_j \nabla h_j(x^*) = 0

Each $\mu_j$ measures the sensitivity of the optimal objective value to changes in the $j$ -th constraint. Specifically, if $h_j(x) = c_j$ is relaxed to $h_j(x) = c_j + \epsilon$ , the optimal objective changes by approximately $\mu_j \epsilon$ .

Lagrangian Function (Equality Constraints)

\mathcal{L}(x, \mu) = f(x) + \sum_{j=1}^{p} \mu_j h_j(x)

Here,

$\mathcal{L}$ =Lagrangian function
$f(x)$ =Original objective function
$\mu_j$ =Lagrange multiplier for equality constraint j
$h_j(x)$ =j-th equality constraint

📝Example: Shortest Distance to a Line

Find the point on the line $x + y = 1$ closest to the origin.

Objective: $\min f(x,y) = x^2 + y^2$ Constraint: $h(x,y) = x + y - 1 = 0$

The Lagrangian is $\mathcal{L} = x^2 + y^2 + \mu(x + y - 1)$ .

Setting partial derivatives to zero:

$\frac{\partial \mathcal{L}}{\partial x} = 2x + \mu = 0 \implies x = -\mu/2$
$\frac{\partial \mathcal{L}}{\partial y} = 2y + \mu = 0 \implies y = -\mu/2$
$\frac{\partial \mathcal{L}}{\partial \mu} = x + y - 1 = 0$

Substituting: $-\mu/2 + (-\mu/2) = 1 \implies \mu = -1$ , so $x = y = 1/2$ .

Optimal point: $(1/2, 1/2)$ , minimum distance: $1/\sqrt{2}$ .

💡Solution

The Lagrange multiplier $\mu = -1$ means that if we relax the constraint to $x + y = 1 + \epsilon$ , the squared distance changes by approximately $-\epsilon$ . Moving the line farther from the origin decreases the minimum distance (which is counterintuitive until you realize the sign convention).

Inequality Constraints

DfInequality Constrained Problem

The general inequality constrained problem is:

\min_{x \in \mathbb{R}^n} f(x) \quad \text{subject to} \quad g_i(x) \leq 0, \; i = 1, \ldots, m

Inequality constraints are more complex than equality constraints because the active set (which constraints are binding) is unknown a priori. At the solution, some constraints may be inactive ( $g_i(x^*) < 0$ ) and do not affect the optimum.

DfConstraint Qualification

A constraint qualification (CQ) is a condition on the constraint functions that ensures the KKT conditions are necessary for optimality. The most common are:

Linear Independence Constraint Qualification (LICQ): The gradients of all active constraints are linearly independent at $x^*$ .
Mangasarian-Fromovitz CQ (MFCQ): The gradients of active equality constraints are linearly independent, and there exists a direction $d$ such that $\nabla g_i(x^*)^T d < 0$ for all active inequality constraints.
Slater's Condition (for convex problems): There exists a strictly feasible point $\bar{x}$ such that $g_i(\bar{x}) < 0$ for all $i$ and $h_j(\bar{x}) = 0$ for all $j$ .

💡 Why Constraint Qualifications Matter

Without a constraint qualification, the KKT conditions may fail to hold even at a local minimum. For example, consider minimizing $x$ subject to $x^2 \leq 0$ . The only feasible point is $x^* = 0$ , which is optimal. But $\nabla g(0) = 0$ , so the KKT gradient condition cannot be satisfied for any multiplier. The LICQ fails here because the active constraint gradient is zero.

KKT Conditions (Karush-Kuhn-Tucker)

ThKKT Conditions (Necessary)

Let $x^*$ be a local minimum of $f(x)$ subject to $g_i(x) \leq 0$ ( $i = 1, \ldots, m$ ) and $h_j(x) = 0$ ( $j = 1, \ldots, p$ ). Assume that $f$ , $g_i$ , and $h_j$ are continuously differentiable, and that a constraint qualification holds at $x^*$ . Then there exist multipliers $\lambda_i \geq 0$ and $\mu_j$ such that:

Stationarity: $\nabla f(x^*) + \sum_{i=1}^{m} \lambda_i \nabla g_i(x^*) + \sum_{j=1}^{p} \mu_j \nabla h_j(x^*) = 0$
Primal Feasibility: $g_i(x^*) \leq 0 \; \forall i$ and $h_j(x^*) = 0 \; \forall j$
Dual Feasibility: $\lambda_i \geq 0 \; \forall i$
Complementary Slackness: $\lambda_i g_i(x^*) = 0 \; \forall i$

KKT Conditions — Stationarity

\nabla f(x^*) + \sum_{i=1}^{m} \lambda_i \nabla g_i(x^*) + \sum_{j=1}^{p} \mu_j \nabla h_j(x^*) = 0

Here,

$\nabla f(x^*)$ =Gradient of the objective at the optimum
$\lambda_i$ =Lagrange multiplier for inequality constraint i (≥ 0)
$\nabla g_i(x^*)$ =Gradient of inequality constraint i
$\mu_j$ =Lagrange multiplier for equality constraint j
$\nabla h_j(x^*)$ =Gradient of equality constraint j

KKT Conditions — Complementary Slackness

\lambda_i \, g_i(x^*) = 0 \quad \forall i = 1, \ldots, m

Here,

$\lambda_i$ =Multiplier for inequality constraint i
$g_i(x^*)$ =Value of inequality constraint i at the optimum

ℹ️ Interpreting Complementary Slackness

Complementary slackness means that for each inequality constraint, exactly one of two things holds: either the constraint is active ( $g_i(x^*) = 0$ , the constraint is tight) or its multiplier is zero ( $\lambda_i = 0$ , the constraint does not affect the optimum). This is powerful because it tells us which constraints actually matter at the solution.

ThKKT Conditions (Sufficient)

If $f$ is convex, $g_i$ are convex, $h_j$ are affine, and there exist multipliers $\lambda_i \geq 0$ and $\mu_j$ satisfying the KKT conditions at $x^*$ , then $x^*$ is a global minimum.

📝Example: KKT with One Inequality Constraint

Minimize $f(x) = (x - 2)^2$ subject to $x \leq 1$ .

Lagrangian: $\mathcal{L}(x, \lambda) = (x - 2)^2 + \lambda(x - 1)$

KKT conditions:

Stationarity: $2(x - 2) + \lambda = 0$
Primal feasibility: $x \leq 1$
Dual feasibility: $\lambda \geq 0$
Complementary slackness: $\lambda(x - 1) = 0$

Case 1: Constraint inactive ( $\lambda = 0$ ): $2(x - 2) = 0 \implies x = 2$ . But $x = 2$ violates $x \leq 1$ . Infeasible.

Case 2: Constraint active ( $x = 1$ ): $2(1 - 2) + \lambda = 0 \implies \lambda = 2 \geq 0$ . ✓

Optimal solution: $x^* = 1$ , $f(x^*) = 1$ , $\lambda^* = 2$ .

💡Solution

The multiplier $\lambda^* = 2$ indicates that if we relax the constraint to $x \leq 1 + \epsilon$ , the optimal objective decreases by approximately $2\epsilon$ . This makes sense: relaxing the constraint allows $x$ to move closer to the unconstrained optimum at $x = 2$ .

Active Set Methods

DfActive Set Method

An active set method solves a constrained optimization problem by iteratively guessing which constraints are active (binding) at the solution and solving the resulting equality-constrained subproblem. At each iteration $k$ :

Given the current iterate $x^k$ and an estimated active set $\mathcal{A}^k \subseteq \{1, \ldots, m\}$ , solve the equality-constrained subproblem:

\min_{d} \nabla f(x^k)^T d + \frac{1}{2} d^T \nabla^2_{xx} \mathcal{L}(x^k, \lambda^k) \, d \quad \text{s.t.} \quad \nabla g_i(x^k)^T d = 0 \; \forall i \in \mathcal{A}^k

If the step $d^k$ is nonzero, take a step $x^{k+1} = x^k + \alpha_k d^k$ with step size $\alpha_k$ chosen to maintain feasibility.
Update the active set: add constraints that become binding, remove constraints whose multipliers become negative.
Repeat until no further improvement is possible and all multipliers are nonnegative.

💡 Active Set in Practice

Active set methods are the basis of many quadratic programming (QP) solvers. They are efficient when the number of constraints is moderate and the active set changes slowly between iterations. However, they can be slow for large-scale problems because solving the equality-constrained subproblem at each step requires factoring a matrix that changes with the active set.

DfWorking Set

In active set methods, the working set is the current estimate of which constraints will be active at the solution. At each iteration, the algorithm solves a subproblem with the working set treated as equality constraints. Constraints may be added to or removed from the working set based on:

Addition: A constraint becomes violated ( $g_i(x^k) > 0$ ) and is added to the working set.
Removal: A constraint's multiplier becomes negative ( $\lambda_i < 0$ ), indicating the solution would improve by moving away from that constraint.

Interior Point Methods

DfInterior Point Method

Interior point methods solve constrained optimization problems by tracing a path through the interior of the feasible region, rather than moving along its boundary (as active set methods do). They introduce barrier functions that prevent iterates from reaching the boundary, then gradually reduce the barrier to approach the true solution.

Logarithmic Barrier Function

\min f(x) - \mu \sum_{i=1}^{m} \ln(-g_i(x)) \quad \text{subject to} \quad h_j(x) = 0

Here,

$f(x)$ =Original objective function
$\mu$ =Barrier parameter (decreases toward 0)
$g_i(x)$ =Inequality constraints (g_i(x) < 0 for interior points)
$h_j(x)$ =Equality constraints

ℹ️ How Barrier Methods Work

The barrier term $-\mu \ln(-g_i(x))$ approaches $+\infty$ as $g_i(x) \to 0^-$ (approaching the boundary). This creates a "wall" that keeps iterates strictly feasible. As $\mu \to 0$ , the barrier solution converges to the KKT point of the original problem. In practice, a sequence of barrier subproblems is solved with decreasing $\mu$ values, using the previous solution as a warm start.

DfPrimal-Dual Interior Point Methods

Modern interior point methods work with the primal-dual KKT system directly. They reformulate the KKT conditions as a nonlinear system and solve it using Newton's method. The key advantage is that they handle both primal and dual variables simultaneously, achieving polynomial-time complexity $O(\sqrt{m} \log(1/\epsilon))$ for convex problems — a significant improvement over the exponential worst-case of active set methods.

⚠️ Interior Point Limitations

Interior point methods require a strictly feasible starting point, which can be difficult to find. They also struggle with degenerate problems (where multiple constraints are active at a vertex). For small to medium QPs, active set methods are often faster in practice.

Penalty Methods

DfPenalty Method

A penalty method converts a constrained optimization problem into a sequence of unconstrained problems by adding a penalty term to the objective that penalizes constraint violations. The idea is simple: instead of enforcing constraints exactly, we penalize deviations from feasibility and drive the penalty toward infinity.

Quadratic Penalty Method

\min_{x} f(x) + \rho \sum_{i=1}^{m} \left[\max(0, g_i(x))\right]^2

Here,

$f(x)$ =Original objective function
$\rho$ =Penalty parameter (increases toward infinity)
$g_i(x)$ =Inequality constraints
$\max(0, g_i(x))$ =Positive part: zero if feasible, positive if violated

Quadratic Penalty for Equality Constraints

\min_{x} f(x) + \rho \sum_{j=1}^{p} h_j(x)^2

Here,

$f(x)$ =Original objective function
$\rho$ =Penalty parameter
$h_j(x)$ =Equality constraint j

ℹ️ Penalty Parameter Schedule

In practice, we solve a sequence of subproblems with increasing $\rho$ values: $\rho_1 < \rho_2 < \cdots$ . At each step, we use the previous solution as a warm start. As $\rho \to \infty$ , the penalty solution converges to the constrained optimum. However, large $\rho$ makes the subproblem ill-conditioned (the Hessian becomes nearly singular), so adaptive strategies are needed to balance accuracy and numerical stability.

DfAugmented Lagrangian Method

The augmented Lagrangian method improves on plain penalty methods by incorporating Lagrange multipliers directly. For equality constraints:

\mathcal{L}_A(x, \mu, \rho) = f(x) + \sum_{j=1}^{p} \mu_j h_j(x) + \frac{\rho}{2} \sum_{j=1}^{p} h_j(x)^2

At each iteration, the multipliers are updated as $\mu_j^{k+1} = \mu_j^k + \rho h_j(x^k)$ . This avoids the ill-conditioning of pure penalty methods because the multipliers absorb the bias that would otherwise require $\rho \to \infty$ .

💡 When to Use Penalty Methods

Penalty methods are simple to implement and work well when high accuracy is not required. They are commonly used in training neural networks with constraints (e.g., fairness constraints) where approximate feasibility is acceptable. For high-precision solutions, use interior point methods or Augmented Lagrangian methods.

Python Implementation

scipy.optimize

from scipy.optimize import minimize, NonlinearConstraint, LinearConstraint
import numpy as np

# Example 1: Equality constraint
# Minimize x^2 + y^2 subject to x + y = 1
def objective(x):
    return x[0]**2 + x[1]**2

def eq_constraint(x):
    return x[0] + x[1] - 1

constraints = [{'type': 'eq', 'fun': eq_constraint}]
result = minimize(objective, x0=[0, 0], constraints=constraints, method='SLSQP')
print(f"Optimal point: {result.x}")       # [0.5, 0.5]
print(f"Optimal value: {result.fun}")     # 0.5
print(f"Multiplier: {result.v}")          # Lagrange multiplier

# Example 2: Inequality constraint
# Minimize (x-2)^2 + (y-1)^2 subject to x + y <= 1
def ineq_constraint(x):
    return 1 - x[0] - x[1]  # g(x) >= 0 for scipy (>= form)

result2 = minimize(objective, x0=[0, 0], constraints={'type': 'ineq', 'fun': ineq_constraint})
print(f"Optimal point: {result2.x}")      # [0.6667, 0.3333]

# Example 3: Multiple constraints with bounds
from scipy.optimize import minimize
bounds = [(0, None), (0, None)]  # x >= 0, y >= 0
constraints_multi = [
    {'type': 'eq', 'fun': lambda x: x[0] + x[1] - 1},
    {'type': 'ineq', 'fun': lambda x: 1 - x[0]**2 - x[1]**2}
]
result3 = minimize(objective, x0=[0.5, 0.5], constraints=constraints_multi, bounds=bounds)
print(f"Optimal point: {result3.x}")

cvxpy

import cvxpy as cp

# Example 1: Simple constrained problem
x = cp.Variable(2)
objective = cp.Minimize(cp.sum_squares(x))
constraints = [cp.sum(x) == 1]
prob = cp.Problem(objective, constraints)
prob.solve()
print(f"Optimal value: {prob.value}")       # 0.5
print(f"Optimal x: {x.value}")              # [0.5, 0.5]
print(f"Shadow price: {constraints[0].dual_value}")  # Lagrange multiplier

# Example 2: SVM-like problem
n_samples, n_features = 100, 5
np.random.seed(42)
X = np.random.randn(n_samples, n_features)
y = np.sign(np.random.randn(n_samples))

w = cp.Variable(n_features)
b = cp.Variable()
loss = cp.sum(cp.pos(1 - cp.multiply(y, X @ w + b)))
reg = 0.01 * cp.sum_squares(w)
prob = cp.Problem(cp.Minimize(reg + loss))
prob.solve()
print(f"SVM accuracy: {np.mean(np.sign(X @ w.value + b.value) == y):.2%}")

# Example 3: Portfolio optimization
n_assets = 5
mu = np.random.randn(n_assets) * 0.05 + 0.1  # expected returns
Sigma = np.random.randn(n_assets, n_assets) * 0.1
Sigma = Sigma @ Sigma.T + np.eye(n_assets) * 0.01  # covariance

w = cp.Variable(n_assets)
ret = mu @ w
risk = cp.quad_form(w, Sigma)
constraints = [cp.sum(w) == 1, w >= 0]  # fully invested, no shorting
prob = cp.Problem(cp.Minimize(risk - 0.5 * ret), constraints)  # risk-averse
prob.solve()
print(f"Portfolio weights: {np.round(w.value, 3)}")

💡 Choosing a Solver

scipy.optimize.minimize: Good for small, general nonlinear problems. Methods: SLSQP (supports constraints), L-BFGS-B (bounds only), trust-constr.
cvxpy: Best for convex problems (linear, quadratic, SOCP, SDP). Automatically selects an appropriate solver (ECOS, SCS, MOSEK).
For large-scale: Use interior point methods (MOSEK, Gurobi) or first-order methods (ADMM, proximal gradient).

Applications in AI/ML

Support Vector Machines (SVM)

DfSVM as Constrained Optimization

The primal SVM problem is a constrained optimization problem:

\min_{w, b, \xi} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{n} \xi_i

subject to:

y_i(w^T x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \quad \forall i

The objective balances margin maximization ( $\|w\|^2$ ) with classification error ( $\sum \xi_i$ ). The KKT conditions of this problem yield the support vectors — the data points where $\lambda_i > 0$ — which define the decision boundary.

📝SVM KKT Interpretation

For an SVM, complementary slackness $\lambda_i (y_i(w^T x_i + b) - 1 + \xi_i) = 0$ tells us:

If $\lambda_i = 0$ : the point is correctly classified and not a support vector.
If $\lambda_i > 0$ : the point lies on the margin or is misclassified — it is a support vector.

This sparsity is what makes SVMs memory-efficient: only support vectors are needed for prediction.

Fairness in Machine Learning

DfFairness-Constrained Optimization

Many ML applications require fairness constraints. For example, in lending or hiring, the model must satisfy demographic parity, equalized odds, or calibration across groups. These are naturally expressed as constrained optimization problems:

\min_{\theta} \mathcal{L}(\theta) \quad \text{subject to} \quad |P(\hat{Y}=1 | A=0) - P(\hat{Y}=1 | A=1)| \leq \epsilon

where $A$ is the protected attribute (e.g., race, gender) and $\epsilon$ is the maximum allowed disparity.

ℹ️ Fairness Trade-offs

Enforcing fairness constraints typically reduces accuracy. The Pareto frontier between accuracy and fairness is explored by varying the constraint threshold $\epsilon$ . There is no universally accepted definition of fairness — different definitions (demographic parity, equalized odds, calibration) can be mutually incompatible, as shown by Chouldechova (2017) and Kleinberg et al. (2016).

Other ML Applications

DfConstrained Optimization in Neural Networks

Constrained optimization appears throughout deep learning:

Regularization: Weight decay ( $L_2$ penalty) can be viewed as a soft constraint on parameter magnitude.
Projection methods: After each gradient step, project weights onto a feasible set (e.g., simplex for attention weights, norm balls for robustness).
Meta-learning: bilevel constrained optimization where the inner loop optimizes task-specific parameters subject to constraints.
Reinforcement learning: Policy optimization with constraints on safety, energy, or resource usage (constrained MDPs).

Common Mistakes

Mistake	Why It's Wrong	Correct Approach
Ignoring constraint qualifications	KKT conditions may not hold without a valid CQ	Always verify LICQ, MFCQ, or Slater's condition
Forgetting $\lambda_i \geq 0$ for inequalities	Negative multipliers violate dual feasibility	Check sign of multipliers at the solution
Assuming all constraints are active	Most constraints are typically inactive at the optimum	Use complementary slackness to identify active constraints
Using penalty methods with fixed $\rho$	A single $\rho$ either gives infeasible or ill-conditioned solutions	Use an increasing sequence of $\rho$ values
Confusing $\geq$ and $\leq$ forms	The sign convention affects the KKT sign conditions	Standardize: write all inequalities as $g_i(x) \leq 0$
Not checking feasibility first	Optimizing over an empty feasible set wastes computation	Verify the feasible region is non-empty before solving
Ignoring convexity	Non-convex problems may have local minima that are not global	Check if the problem is convex; if not, use global optimization
Treating equality constraints as inequalities	Equality constraints require different multiplier sign conventions	Keep equality and inequality constraints separate in the KKT system

Interview Questions

Q1: What is the difference between a binding and non-binding constraint? A binding constraint is satisfied with equality at the optimum ( $g_i(x^*) = 0$ ). A non-binding constraint is strictly satisfied ( $g_i(x^*) < 0$ ) and does not affect the optimal solution. Only binding constraints have non-zero Lagrange multipliers.

Q2: Why do we need constraint qualifications for KKT conditions? Constraint qualifications ensure that the feasible region is "well-behaved" near the optimum. Without them, the KKT conditions may fail to hold even at a local minimum. For example, if the active constraint gradients are linearly dependent, the KKT gradient equation may have no solution for the multipliers.

Q3: How do penalty methods differ from augmented Lagrangian methods? Penalty methods add $\rho \sum h_j(x)^2$ and require $\rho \to \infty$ for exactness, causing ill-conditioning. Augmented Lagrangian methods add both $\mu_j h_j(x)$ and $\frac{\rho}{2} h_j(x)^2$ , updating $\mu_j$ at each step. This avoids the need for $\rho \to \infty$ and maintains better conditioning.

Q4: When should you use interior point vs. active set methods? Interior point methods are preferred for large-scale problems (thousands of variables and constraints) because they have polynomial complexity. Active set methods are faster for small to medium QPs, especially when the active set changes little between iterations. Interior point methods also require a strictly feasible start, which can be hard to find.

Q5: Explain complementary slackness in plain language. Complementary slackness says that at the optimum, each constraint is either "tight" (active, $g_i(x^*) = 0$ ) or "irrelevant" (its multiplier is zero, $\lambda_i = 0$ ). A constraint cannot be simultaneously loose and influential — if there's slack, the constraint doesn't matter.

Q6: How do SVMs use KKT conditions? The KKT conditions of the SVM dual problem yield support vectors. Complementary slackness ensures that only points on the margin or misclassified points ( $\lambda_i > 0$ ) contribute to the decision boundary. All other points ( $\lambda_i = 0$ ) can be discarded, making SVMs memory-efficient.

Q7: What happens if a constraint qualification fails? If LICQ fails (active constraint gradients are linearly dependent), the KKT multipliers may not exist or may not be unique. The KKT conditions are no longer necessary for optimality. You must use alternative methods or reformulate the problem.

Q8: Can you convert an equality constraint to two inequality constraints? Yes: $h(x) = 0$ is equivalent to $h(x) \leq 0$ and $-h(x) \leq 0$ . However, this doubles the number of constraints and may violate constraint qualifications at the solution (since both constraints are active with linearly dependent gradients).

Practice Problems

📝Problem 1: Budget-Constrained Utility Maximization

A consumer maximizes $U(x, y) = xy$ subject to the budget constraint $2x + 3y = 12$ where $x, y \geq 0$ . Find the optimal consumption bundle.

💡Solution

Set up the Lagrangian: $\mathcal{L} = xy + \mu(12 - 2x - 3y)$ .

KKT conditions:

$\frac{\partial \mathcal{L}}{\partial x} = y - 2\mu = 0 \implies y = 2\mu$
$\frac{\partial \mathcal{L}}{\partial y} = x - 3\mu = 0 \implies x = 3\mu$
$2x + 3y = 12$

Substituting: $2(3\mu) + 3(2\mu) = 12 \implies 12\mu = 12 \implies \mu = 1$ .

Optimal bundle: $x^* = 3$ , $y^* = 2$ , $U^* = 6$ .

The multiplier $\mu = 1$ means an extra dollar of budget increases utility by approximately 1 unit.

📝Problem 2: KKT with Two Inequality Constraints

Minimize $f(x) = x_1^2 + x_2^2$ subject to $x_1 + x_2 \leq 1$ and $x_1 \leq 0.5$ . Find the optimal solution and interpret the multipliers.

💡Solution

Lagrangian: $\mathcal{L} = x_1^2 + x_2^2 + \lambda_1(x_1 + x_2 - 1) + \lambda_2(x_1 - 0.5)$ .

KKT conditions:

$\frac{\partial \mathcal{L}}{\partial x_1} = 2x_1 + \lambda_1 + \lambda_2 = 0$
$\frac{\partial \mathcal{L}}{\partial x_2} = 2x_2 + \lambda_1 = 0$
$\lambda_1(x_1 + x_2 - 1) = 0$ , $\lambda_2(x_1 - 0.5) = 0$
$\lambda_1, \lambda_2 \geq 0$ , $x_1 + x_2 \leq 1$ , $x_1 \leq 0.5$

Case: Both active ( $x_1 = 0.5$ , $x_1 + x_2 = 1 \implies x_2 = 0.5$ ): $2(0.5) + \lambda_1 + \lambda_2 = 0 \implies \lambda_1 + \lambda_2 = -1$ . Impossible since $\lambda_i \geq 0$ .

Case: Only first active ( $x_1 + x_2 = 1$ , $\lambda_2 = 0$ ): $2x_2 + \lambda_1 = 0$ , $2x_1 + \lambda_1 = 0 \implies x_1 = x_2$ . $x_1 + x_1 = 1 \implies x_1 = x_2 = 0.5$ , $\lambda_1 = -1$ . Impossible.

Case: Neither active ( $\lambda_1 = \lambda_2 = 0$ ): $x_1 = x_2 = 0$ . Both constraints satisfied. Optimal.

Optimal solution: $x^* = (0, 0)$ , $f^* = 0$ , $\lambda_1^* = \lambda_2^* = 0$ .

Both constraints are inactive — the unconstrained minimum is already feasible.

📝Problem 3: Penalty Method by Hand

Solve $\min x$ subject to $x \geq 1$ using the quadratic penalty method. Start with $\rho = 1$ and do two iterations.

💡Solution

Rewrite as $\min x$ subject to $1 - x \leq 0$ .

Penalty formulation: $\min x + \rho [\max(0, 1-x)]^2$ .

Iteration 1 ( $\rho = 1$ ): Since the unconstrained minimum ( $x = -\infty$ ) is infeasible, consider the region where $x < 1$ : $\min x + (1-x)^2 \implies 1 - 2(1-x) = 0 \implies x = 0.5$ . Objective: $0.5 + 0.25 = 0.75$ .

Iteration 2 ( $\rho = 10$ ): $\min x + 10(1-x)^2 \implies 1 - 20(1-x) = 0 \implies x = 0.95$ . Objective: $0.95 + 0.025 = 0.975$ .

As $\rho \to \infty$ , $x \to 1$ (the true optimum). Each iteration brings us closer but with diminishing returns and increasing ill-conditioning.

📝Problem 4: SVM Support Vectors

Given a linearly separable dataset with two classes, explain how the KKT conditions identify support vectors and why this leads to a sparse solution.

💡Solution

The SVM dual problem maximizes $\sum \lambda_i - \frac{1}{2} \sum_{i,j} \lambda_i \lambda_j y_i y_j x_i^T x_j$ subject to $0 \leq \lambda_i \leq C$ and $\sum \lambda_i y_i = 0$ .

By complementary slackness: $\lambda_i (y_i(w^T x_i + b) - 1 + \xi_i) = 0$ .

If $\lambda_i = 0$ : point is correctly classified with margin > 1. Not needed for the model.
If $0 < \lambda_i < C$ : point is exactly on the margin ( $y_i(w^T x_i + b) = 1$ ). This is a free support vector.
If $\lambda_i = C$ : point is on the wrong side of the margin or misclassified. This is a bounded support vector.

The decision boundary depends only on points with $\lambda_i > 0$ (support vectors). Typically, only a small fraction of training points are support vectors, leading to a sparse, efficient model.

📝Problem 5: Portfolio Optimization

Formulate and solve the following portfolio problem: minimize risk $\frac{1}{2} w^T \Sigma w$ subject to expected return $w^T \mu \geq 0.08$ and $w^T \mathbf{1} = 1$ (fully invested), where $\Sigma$ is the covariance matrix and $\mu$ is the expected return vector.

💡Solution

Lagrangian: $\mathcal{L} = \frac{1}{2} w^T \Sigma w - \lambda(w^T \mu - 0.08) + \gamma(w^T \mathbf{1} - 1)$ .

KKT conditions:

$\Sigma w - \lambda \mu + \gamma \mathbf{1} = 0$ (stationarity)
$w^T \mu \geq 0.08$ , $w^T \mathbf{1} = 1$ (primal feasibility)
$\lambda \geq 0$ (dual feasibility)
$\lambda(w^T \mu - 0.08) = 0$ (complementary slackness)

If the return constraint is active ( $w^T \mu = 0.08$ ): solve the $2 \times 2$ system: $w = \Sigma^{-1}(\lambda \mu - \gamma \mathbf{1})$ , then find $\lambda, \gamma$ from the two equality constraints.

This is the classical Markowitz mean-variance optimization, and the multiplier $\lambda$ is the "price of return" — how much risk increases per unit increase in the return target.

Quick Reference

📋Key Concepts and Formulas

Standard Form:

\min f(x) \quad \text{s.t.} \quad g_i(x) \leq 0, \; h_j(x) = 0

KKT Conditions (Necessary):

Stationarity: $\nabla f + \sum \lambda_i \nabla g_i + \sum \mu_j \nabla h_j = 0$
Primal Feasibility: $g_i(x^*) \leq 0$ , $h_j(x^*) = 0$
Dual Feasibility: $\lambda_i \geq 0$
Complementary Slackness: $\lambda_i g_i(x^*) = 0$

KKT Conditions (Sufficient): If $f$ is convex, $g_i$ are convex, $h_j$ are affine, and KKT holds, then $x^*$ is a global minimum.

Lagrangian: $\mathcal{L}(x, \lambda, \mu) = f(x) + \sum \lambda_i g_i(x) + \sum \mu_j h_j(x)$

Penalty Method: $\min f(x) + \rho \sum [\max(0, g_i(x))]^2 + \rho \sum h_j(x)^2$

Augmented Lagrangian: $\mathcal{L}_A = f(x) + \sum \mu_j h_j(x) + \frac{\rho}{2} \sum h_j(x)^2$

Barrier Method: $\min f(x) - \mu \sum \ln(-g_i(x))$

Solver Selection:

Problem Type	Recommended Solver
Convex QP/SOCP/SDP	cvxpy (ECOS, SCS, MOSEK)
Small nonlinear	scipy.optimize (SLSQP)
Large-scale convex	MOSEK, Gurobi
Non-convex	Global optimizers, multi-start
ML with constraints	cvxpy, custom penalty methods

Cross-References

Previous: 064 - Gradient Descent — unconstrained optimization foundations
Previous: 061 - Linear Algebra for Optimization — matrix operations used in constrained solvers
Related: 066 - Convex Optimization — when KKT conditions become sufficient for global optimality
Related: 059 - Calculus — derivatives and gradients underlying KKT stationarity
Application: 060 - Probability & Statistics — expected values and variance in portfolio optimization
Application: 062 - Information Theory — entropy-constrained coding and rate-distortion theory
Advanced: Semi-definite programming (SDP) extends constrained optimization to matrix-valued constraints
Advanced: Bilevel optimization involves nested constrained problems (leader-follower structure)