Constrained Optimization
π‘ Why It Matters
In the real world, almost every optimization problem operates under constraints. A portfolio manager cannot invest more than the available budget. An engineer cannot exceed the maximum load a beam can support. A machine learning model must satisfy fairness regulations. Constrained optimization provides the mathematical framework for finding the best solution while respecting these limits. Without understanding constraints, theoretical optima are meaningless β they may be physically, financially, or legally impossible to achieve.
βΉοΈ Historical Context
Constrained optimization has roots in the calculus of variations (Euler, Lagrange) and was formalized in the 20th century by Karush (1939), Kuhn (1951), and Tucker (1951). Today it underpins operations research, control theory, economics, and machine learning β from training SVMs to designing fair allocation systems.
Constrained Optimization
DfConstrained Optimization Problem
A constrained optimization problem seeks to minimize (or maximize) an objective function subject to a set of constraints on the decision variables . The general form is:
where are inequality constraints and are equality constraints.
Constrained Optimization Standard Form
Here,
- =Objective function to minimize
- =Inequality constraints (must be β€ 0)
- =Equality constraints (must equal 0)
- =Decision variable vector in R^n
DfFeasible Region
The feasible region (or feasible set) is the set of all points that satisfy every constraint:
A point is feasible if it belongs to ; otherwise it is infeasible. The optimization problem is feasible if and infeasible otherwise.
DfBinding vs Non-Binding Constraints
A constraint is binding (or active) at if . It is non-binding (or inactive) if . Equality constraints are always binding by definition. Only binding constraints influence the optimal solution β inactive constraints can be removed without changing the optimum.
β οΈ Common Pitfall
Many beginners forget to verify that the feasible region is non-empty and bounded before searching for an optimum. An unbounded feasible region can lead to no finite minimum. Always check feasibility first.
Equality Constraints (Lagrange Multipliers Review)
DfEquality Constrained Problem
Consider the problem with only equality constraints:
Assume and are continuously differentiable and that the gradients are linearly independent at the solution (the regularity condition).
ThLagrange Multiplier Theorem
If is a local minimum of subject to for , and the gradients are linearly independent, then there exist unique scalars (Lagrange multipliers) such that:
Each measures the sensitivity of the optimal objective value to changes in the -th constraint. Specifically, if is relaxed to , the optimal objective changes by approximately .
Lagrangian Function (Equality Constraints)
Here,
- =Lagrangian function
- =Original objective function
- =Lagrange multiplier for equality constraint j
- =j-th equality constraint
πExample: Shortest Distance to a Line
Find the point on the line closest to the origin.
Objective: Constraint:
The Lagrangian is .
Setting partial derivatives to zero:
Substituting: , so .
Optimal point: , minimum distance: .
π‘Solution
The Lagrange multiplier means that if we relax the constraint to , the squared distance changes by approximately . Moving the line farther from the origin decreases the minimum distance (which is counterintuitive until you realize the sign convention).
Inequality Constraints
DfInequality Constrained Problem
The general inequality constrained problem is:
Inequality constraints are more complex than equality constraints because the active set (which constraints are binding) is unknown a priori. At the solution, some constraints may be inactive () and do not affect the optimum.
DfConstraint Qualification
A constraint qualification (CQ) is a condition on the constraint functions that ensures the KKT conditions are necessary for optimality. The most common are:
- Linear Independence Constraint Qualification (LICQ): The gradients of all active constraints are linearly independent at .
- Mangasarian-Fromovitz CQ (MFCQ): The gradients of active equality constraints are linearly independent, and there exists a direction such that for all active inequality constraints.
- Slater's Condition (for convex problems): There exists a strictly feasible point such that for all and for all .
π‘ Why Constraint Qualifications Matter
Without a constraint qualification, the KKT conditions may fail to hold even at a local minimum. For example, consider minimizing subject to . The only feasible point is , which is optimal. But , so the KKT gradient condition cannot be satisfied for any multiplier. The LICQ fails here because the active constraint gradient is zero.
KKT Conditions (Karush-Kuhn-Tucker)
ThKKT Conditions (Necessary)
Let be a local minimum of subject to () and (). Assume that , , and are continuously differentiable, and that a constraint qualification holds at . Then there exist multipliers and such that:
-
Stationarity:
-
Primal Feasibility: and
-
Dual Feasibility:
-
Complementary Slackness:
KKT Conditions β Stationarity
Here,
- =Gradient of the objective at the optimum
- =Lagrange multiplier for inequality constraint i (β₯ 0)
- =Gradient of inequality constraint i
- =Lagrange multiplier for equality constraint j
- =Gradient of equality constraint j
KKT Conditions β Complementary Slackness
Here,
- =Multiplier for inequality constraint i
- =Value of inequality constraint i at the optimum
βΉοΈ Interpreting Complementary Slackness
Complementary slackness means that for each inequality constraint, exactly one of two things holds: either the constraint is active (, the constraint is tight) or its multiplier is zero (, the constraint does not affect the optimum). This is powerful because it tells us which constraints actually matter at the solution.
ThKKT Conditions (Sufficient)
If is convex, are convex, are affine, and there exist multipliers and satisfying the KKT conditions at , then is a global minimum.
πExample: KKT with One Inequality Constraint
Minimize subject to .
Lagrangian:
KKT conditions:
- Stationarity:
- Primal feasibility:
- Dual feasibility:
- Complementary slackness:
Case 1: Constraint inactive (): . But violates . Infeasible.
Case 2: Constraint active (): . β
Optimal solution: , , .
π‘Solution
The multiplier indicates that if we relax the constraint to , the optimal objective decreases by approximately . This makes sense: relaxing the constraint allows to move closer to the unconstrained optimum at .
Active Set Methods
DfActive Set Method
An active set method solves a constrained optimization problem by iteratively guessing which constraints are active (binding) at the solution and solving the resulting equality-constrained subproblem. At each iteration :
- Given the current iterate and an estimated active set , solve the equality-constrained subproblem:
- If the step is nonzero, take a step with step size chosen to maintain feasibility.
- Update the active set: add constraints that become binding, remove constraints whose multipliers become negative.
- Repeat until no further improvement is possible and all multipliers are nonnegative.
π‘ Active Set in Practice
Active set methods are the basis of many quadratic programming (QP) solvers. They are efficient when the number of constraints is moderate and the active set changes slowly between iterations. However, they can be slow for large-scale problems because solving the equality-constrained subproblem at each step requires factoring a matrix that changes with the active set.
DfWorking Set
In active set methods, the working set is the current estimate of which constraints will be active at the solution. At each iteration, the algorithm solves a subproblem with the working set treated as equality constraints. Constraints may be added to or removed from the working set based on:
- Addition: A constraint becomes violated () and is added to the working set.
- Removal: A constraint's multiplier becomes negative (), indicating the solution would improve by moving away from that constraint.
Interior Point Methods
DfInterior Point Method
Interior point methods solve constrained optimization problems by tracing a path through the interior of the feasible region, rather than moving along its boundary (as active set methods do). They introduce barrier functions that prevent iterates from reaching the boundary, then gradually reduce the barrier to approach the true solution.
Logarithmic Barrier Function
Here,
- =Original objective function
- =Barrier parameter (decreases toward 0)
- =Inequality constraints (g_i(x) < 0 for interior points)
- =Equality constraints
βΉοΈ How Barrier Methods Work
The barrier term approaches as (approaching the boundary). This creates a "wall" that keeps iterates strictly feasible. As , the barrier solution converges to the KKT point of the original problem. In practice, a sequence of barrier subproblems is solved with decreasing values, using the previous solution as a warm start.
DfPrimal-Dual Interior Point Methods
Modern interior point methods work with the primal-dual KKT system directly. They reformulate the KKT conditions as a nonlinear system and solve it using Newton's method. The key advantage is that they handle both primal and dual variables simultaneously, achieving polynomial-time complexity for convex problems β a significant improvement over the exponential worst-case of active set methods.
β οΈ Interior Point Limitations
Interior point methods require a strictly feasible starting point, which can be difficult to find. They also struggle with degenerate problems (where multiple constraints are active at a vertex). For small to medium QPs, active set methods are often faster in practice.
Penalty Methods
DfPenalty Method
A penalty method converts a constrained optimization problem into a sequence of unconstrained problems by adding a penalty term to the objective that penalizes constraint violations. The idea is simple: instead of enforcing constraints exactly, we penalize deviations from feasibility and drive the penalty toward infinity.
Quadratic Penalty Method
Here,
- =Original objective function
- =Penalty parameter (increases toward infinity)
- =Inequality constraints
- =Positive part: zero if feasible, positive if violated
Quadratic Penalty for Equality Constraints
Here,
- =Original objective function
- =Penalty parameter
- =Equality constraint j
βΉοΈ Penalty Parameter Schedule
In practice, we solve a sequence of subproblems with increasing values: . At each step, we use the previous solution as a warm start. As , the penalty solution converges to the constrained optimum. However, large makes the subproblem ill-conditioned (the Hessian becomes nearly singular), so adaptive strategies are needed to balance accuracy and numerical stability.
DfAugmented Lagrangian Method
The augmented Lagrangian method improves on plain penalty methods by incorporating Lagrange multipliers directly. For equality constraints:
At each iteration, the multipliers are updated as . This avoids the ill-conditioning of pure penalty methods because the multipliers absorb the bias that would otherwise require .
π‘ When to Use Penalty Methods
Penalty methods are simple to implement and work well when high accuracy is not required. They are commonly used in training neural networks with constraints (e.g., fairness constraints) where approximate feasibility is acceptable. For high-precision solutions, use interior point methods or Augmented Lagrangian methods.
Python Implementation
scipy.optimize
from scipy.optimize import minimize, NonlinearConstraint, LinearConstraint
import numpy as np
# Example 1: Equality constraint
# Minimize x^2 + y^2 subject to x + y = 1
def objective(x):
return x[0]**2 + x[1]**2
def eq_constraint(x):
return x[0] + x[1] - 1
constraints = [{'type': 'eq', 'fun': eq_constraint}]
result = minimize(objective, x0=[0, 0], constraints=constraints, method='SLSQP')
print(f"Optimal point: {result.x}") # [0.5, 0.5]
print(f"Optimal value: {result.fun}") # 0.5
print(f"Multiplier: {result.v}") # Lagrange multiplier
# Example 2: Inequality constraint
# Minimize (x-2)^2 + (y-1)^2 subject to x + y <= 1
def ineq_constraint(x):
return 1 - x[0] - x[1] # g(x) >= 0 for scipy (>= form)
result2 = minimize(objective, x0=[0, 0], constraints={'type': 'ineq', 'fun': ineq_constraint})
print(f"Optimal point: {result2.x}") # [0.6667, 0.3333]
# Example 3: Multiple constraints with bounds
from scipy.optimize import minimize
bounds = [(0, None), (0, None)] # x >= 0, y >= 0
constraints_multi = [
{'type': 'eq', 'fun': lambda x: x[0] + x[1] - 1},
{'type': 'ineq', 'fun': lambda x: 1 - x[0]**2 - x[1]**2}
]
result3 = minimize(objective, x0=[0.5, 0.5], constraints=constraints_multi, bounds=bounds)
print(f"Optimal point: {result3.x}")
cvxpy
import cvxpy as cp
# Example 1: Simple constrained problem
x = cp.Variable(2)
objective = cp.Minimize(cp.sum_squares(x))
constraints = [cp.sum(x) == 1]
prob = cp.Problem(objective, constraints)
prob.solve()
print(f"Optimal value: {prob.value}") # 0.5
print(f"Optimal x: {x.value}") # [0.5, 0.5]
print(f"Shadow price: {constraints[0].dual_value}") # Lagrange multiplier
# Example 2: SVM-like problem
n_samples, n_features = 100, 5
np.random.seed(42)
X = np.random.randn(n_samples, n_features)
y = np.sign(np.random.randn(n_samples))
w = cp.Variable(n_features)
b = cp.Variable()
loss = cp.sum(cp.pos(1 - cp.multiply(y, X @ w + b)))
reg = 0.01 * cp.sum_squares(w)
prob = cp.Problem(cp.Minimize(reg + loss))
prob.solve()
print(f"SVM accuracy: {np.mean(np.sign(X @ w.value + b.value) == y):.2%}")
# Example 3: Portfolio optimization
n_assets = 5
mu = np.random.randn(n_assets) * 0.05 + 0.1 # expected returns
Sigma = np.random.randn(n_assets, n_assets) * 0.1
Sigma = Sigma @ Sigma.T + np.eye(n_assets) * 0.01 # covariance
w = cp.Variable(n_assets)
ret = mu @ w
risk = cp.quad_form(w, Sigma)
constraints = [cp.sum(w) == 1, w >= 0] # fully invested, no shorting
prob = cp.Problem(cp.Minimize(risk - 0.5 * ret), constraints) # risk-averse
prob.solve()
print(f"Portfolio weights: {np.round(w.value, 3)}")
π‘ Choosing a Solver
- scipy.optimize.minimize: Good for small, general nonlinear problems. Methods: SLSQP (supports constraints), L-BFGS-B (bounds only), trust-constr.
- cvxpy: Best for convex problems (linear, quadratic, SOCP, SDP). Automatically selects an appropriate solver (ECOS, SCS, MOSEK).
- For large-scale: Use interior point methods (MOSEK, Gurobi) or first-order methods (ADMM, proximal gradient).
Applications in AI/ML
Support Vector Machines (SVM)
DfSVM as Constrained Optimization
The primal SVM problem is a constrained optimization problem:
subject to:
The objective balances margin maximization () with classification error (). The KKT conditions of this problem yield the support vectors β the data points where β which define the decision boundary.
πSVM KKT Interpretation
For an SVM, complementary slackness tells us:
- If : the point is correctly classified and not a support vector.
- If : the point lies on the margin or is misclassified β it is a support vector.
This sparsity is what makes SVMs memory-efficient: only support vectors are needed for prediction.
Fairness in Machine Learning
DfFairness-Constrained Optimization
Many ML applications require fairness constraints. For example, in lending or hiring, the model must satisfy demographic parity, equalized odds, or calibration across groups. These are naturally expressed as constrained optimization problems:
where is the protected attribute (e.g., race, gender) and is the maximum allowed disparity.
βΉοΈ Fairness Trade-offs
Enforcing fairness constraints typically reduces accuracy. The Pareto frontier between accuracy and fairness is explored by varying the constraint threshold . There is no universally accepted definition of fairness β different definitions (demographic parity, equalized odds, calibration) can be mutually incompatible, as shown by Chouldechova (2017) and Kleinberg et al. (2016).
Other ML Applications
DfConstrained Optimization in Neural Networks
Constrained optimization appears throughout deep learning:
- Regularization: Weight decay ( penalty) can be viewed as a soft constraint on parameter magnitude.
- Projection methods: After each gradient step, project weights onto a feasible set (e.g., simplex for attention weights, norm balls for robustness).
- Meta-learning: bilevel constrained optimization where the inner loop optimizes task-specific parameters subject to constraints.
- Reinforcement learning: Policy optimization with constraints on safety, energy, or resource usage (constrained MDPs).
Common Mistakes
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| Ignoring constraint qualifications | KKT conditions may not hold without a valid CQ | Always verify LICQ, MFCQ, or Slater's condition |
| Forgetting for inequalities | Negative multipliers violate dual feasibility | Check sign of multipliers at the solution |
| Assuming all constraints are active | Most constraints are typically inactive at the optimum | Use complementary slackness to identify active constraints |
| Using penalty methods with fixed | A single either gives infeasible or ill-conditioned solutions | Use an increasing sequence of values |
| Confusing and forms | The sign convention affects the KKT sign conditions | Standardize: write all inequalities as |
| Not checking feasibility first | Optimizing over an empty feasible set wastes computation | Verify the feasible region is non-empty before solving |
| Ignoring convexity | Non-convex problems may have local minima that are not global | Check if the problem is convex; if not, use global optimization |
| Treating equality constraints as inequalities | Equality constraints require different multiplier sign conventions | Keep equality and inequality constraints separate in the KKT system |
Interview Questions
Q1: What is the difference between a binding and non-binding constraint? A binding constraint is satisfied with equality at the optimum (). A non-binding constraint is strictly satisfied () and does not affect the optimal solution. Only binding constraints have non-zero Lagrange multipliers.
Q2: Why do we need constraint qualifications for KKT conditions? Constraint qualifications ensure that the feasible region is "well-behaved" near the optimum. Without them, the KKT conditions may fail to hold even at a local minimum. For example, if the active constraint gradients are linearly dependent, the KKT gradient equation may have no solution for the multipliers.
Q3: How do penalty methods differ from augmented Lagrangian methods? Penalty methods add and require for exactness, causing ill-conditioning. Augmented Lagrangian methods add both and , updating at each step. This avoids the need for and maintains better conditioning.
Q4: When should you use interior point vs. active set methods? Interior point methods are preferred for large-scale problems (thousands of variables and constraints) because they have polynomial complexity. Active set methods are faster for small to medium QPs, especially when the active set changes little between iterations. Interior point methods also require a strictly feasible start, which can be hard to find.
Q5: Explain complementary slackness in plain language. Complementary slackness says that at the optimum, each constraint is either "tight" (active, ) or "irrelevant" (its multiplier is zero, ). A constraint cannot be simultaneously loose and influential β if there's slack, the constraint doesn't matter.
Q6: How do SVMs use KKT conditions? The KKT conditions of the SVM dual problem yield support vectors. Complementary slackness ensures that only points on the margin or misclassified points () contribute to the decision boundary. All other points () can be discarded, making SVMs memory-efficient.
Q7: What happens if a constraint qualification fails? If LICQ fails (active constraint gradients are linearly dependent), the KKT multipliers may not exist or may not be unique. The KKT conditions are no longer necessary for optimality. You must use alternative methods or reformulate the problem.
Q8: Can you convert an equality constraint to two inequality constraints? Yes: is equivalent to and . However, this doubles the number of constraints and may violate constraint qualifications at the solution (since both constraints are active with linearly dependent gradients).
Practice Problems
πProblem 1: Budget-Constrained Utility Maximization
A consumer maximizes subject to the budget constraint where . Find the optimal consumption bundle.
π‘Solution
Set up the Lagrangian: .
KKT conditions:
Substituting: .
Optimal bundle: , , .
The multiplier means an extra dollar of budget increases utility by approximately 1 unit.
πProblem 2: KKT with Two Inequality Constraints
Minimize subject to and . Find the optimal solution and interpret the multipliers.
π‘Solution
Lagrangian: .
KKT conditions:
- ,
- , ,
Case: Both active (, ): . Impossible since .
Case: Only first active (, ): , . , . Impossible.
Case: Neither active (): . Both constraints satisfied. Optimal.
Optimal solution: , , .
Both constraints are inactive β the unconstrained minimum is already feasible.
πProblem 3: Penalty Method by Hand
Solve subject to using the quadratic penalty method. Start with and do two iterations.
π‘Solution
Rewrite as subject to .
Penalty formulation: .
Iteration 1 (): Since the unconstrained minimum () is infeasible, consider the region where : . Objective: .
Iteration 2 (): . Objective: .
As , (the true optimum). Each iteration brings us closer but with diminishing returns and increasing ill-conditioning.
πProblem 4: SVM Support Vectors
Given a linearly separable dataset with two classes, explain how the KKT conditions identify support vectors and why this leads to a sparse solution.
π‘Solution
The SVM dual problem maximizes subject to and .
By complementary slackness: .
- If : point is correctly classified with margin > 1. Not needed for the model.
- If : point is exactly on the margin (). This is a free support vector.
- If : point is on the wrong side of the margin or misclassified. This is a bounded support vector.
The decision boundary depends only on points with (support vectors). Typically, only a small fraction of training points are support vectors, leading to a sparse, efficient model.
πProblem 5: Portfolio Optimization
Formulate and solve the following portfolio problem: minimize risk subject to expected return and (fully invested), where is the covariance matrix and is the expected return vector.
π‘Solution
Lagrangian: .
KKT conditions:
- (stationarity)
- , (primal feasibility)
- (dual feasibility)
- (complementary slackness)
If the return constraint is active (): solve the system: , then find from the two equality constraints.
This is the classical Markowitz mean-variance optimization, and the multiplier is the "price of return" β how much risk increases per unit increase in the return target.
Quick Reference
πKey Concepts and Formulas
Standard Form:
KKT Conditions (Necessary):
- Stationarity:
- Primal Feasibility: ,
- Dual Feasibility:
- Complementary Slackness:
KKT Conditions (Sufficient): If is convex, are convex, are affine, and KKT holds, then is a global minimum.
Lagrangian:
Penalty Method:
Augmented Lagrangian:
Barrier Method:
Solver Selection:
| Problem Type | Recommended Solver |
|---|---|
| Convex QP/SOCP/SDP | cvxpy (ECOS, SCS, MOSEK) |
| Small nonlinear | scipy.optimize (SLSQP) |
| Large-scale convex | MOSEK, Gurobi |
| Non-convex | Global optimizers, multi-start |
| ML with constraints | cvxpy, custom penalty methods |
Cross-References
- Previous: 064 - Gradient Descent β unconstrained optimization foundations
- Previous: 061 - Linear Algebra for Optimization β matrix operations used in constrained solvers
- Related: 066 - Convex Optimization β when KKT conditions become sufficient for global optimality
- Related: 059 - Calculus β derivatives and gradients underlying KKT stationarity
- Application: 060 - Probability & Statistics β expected values and variance in portfolio optimization
- Application: 062 - Information Theory β entropy-constrained coding and rate-distortion theory
- Advanced: Semi-definite programming (SDP) extends constrained optimization to matrix-valued constraints
- Advanced: Bilevel optimization involves nested constrained problems (leader-follower structure)