← Math|67 of 100
Optimization

Quadratic Programming

Master quadratic programming and its applications in SVMs, portfolio optimization, and machine learning.

📂 Quadratic📖 Lesson 67 of 100🎓 Free Course

Advertisement

Quadratic Programming

💡 Why It Matters

Quadratic programming (QP) is the bridge between linear programming and general nonlinear optimization. Nearly every machine learning model with a quadratic loss function—linear regression, SVMs, logistic regression regularization, portfolio optimization, and model predictive control—reduces to solving a QP. Mastering QP gives you the foundation for understanding how convex optimization solvers work under the hood.

ℹ️ Core Insight

A QP generalizes linear programming by allowing a quadratic objective while keeping constraints linear. If the quadratic term is positive semi-definite, the problem is convex and any local minimum is a global minimum.


Quadratic Programming: Standard Form

DfQuadratic Program (Standard Form)

Quadratic Programming — Standard Form

minxRn12xTQx+cTxs.t.Axb,Aeqx=beq\min_{x \in \mathbb{R}^n} \quad \frac{1}{2} x^T Q x + c^T x \quad \text{s.t.} \quad Ax \leq b, \quad A_{eq} x = b_{eq}

Here,

  • QQ=Symmetric n×n matrix defining the quadratic curvature
  • cc=Linear cost vector in ℝⁿ
  • xx=Decision variable vector
  • AA=Inequality constraint matrix (m×n)
  • bb=Inequality constraint right-hand side
  • AeqA_{eq}=Equality constraint matrix (p×n)
  • beqb_{eq}=Equality constraint right-hand side

⚠️ Convention

The factor ½ in front of xᵀQx is conventional—it simplifies the gradient to Qx + c rather than 2Qx + c. Always check whether a formulation uses this convention or not.


Positive Definite Hessian

ThConvexity of QP via Positive Definite Hessian

Hessian of the QP Objective

2f(x)=Qandf(x)=Qx+c\nabla^2 f(x) = Q \quad \text{and} \quad \nabla f(x) = Qx + c

Here,

  • QQ=The Hessian matrix — constant for QP, unlike general nonlinear problems
  • cc=Linear term determines the gradient at x = 0

💡 Why This Matters

Because the Hessian is constant, second-order methods (Newton's method) converge in a single step for unconstrained QPs. For constrained QPs, this structure enables efficient interior-point and active-set algorithms.


Active Set Methods

DfActive Set Method

Active Set Iteration

xk+1=xk+αkdk,where dk solves the QP restricted to the guessed active setx_{k+1} = x_k + \alpha_k d_k, \quad \text{where } d_k \text{ solves the QP restricted to the guessed active set}

Here,

  • dkd_k=Search direction from solving the equality-constrained QP with current active set
  • αk\alpha_k=Step length, typically 1 (full step) or limited by blocking constraints

ℹ️ Complexity

Active set methods are efficient for small to medium QPs (n < 10,000). The number of iterations is bounded by the number of constraints combinatorially, but in practice far fewer iterations are needed.


Interior Point Methods

Interior Point (Barrier) Formulation

minx,s12xTQx+cTxμi=1mln(si)s.t.Ax+s=b,s>0\min_{x,s} \quad \frac{1}{2}x^TQx + c^Tx - \mu \sum_{i=1}^{m} \ln(s_i) \quad \text{s.t.} \quad Ax + s = b, \quad s > 0

Here,

  • μ\mu=Barrier parameter — drives log-barrier term; decreased toward 0 in the central path
  • ss=Slack variables converting inequalities to equalities
  • ln(si)\ln(s_i)=Log-barrier enforces sᵢ > 0; approaches boundary as μ → 0

Central Path and Newton Step

[QATAdiag(S)1E][ΔxΔs]=[Qx+c+ATλAx+sbSΛeμe]\begin{bmatrix} Q & A^T \\ A & -\text{diag}(S)^{-1}E \end{bmatrix} \begin{bmatrix} \Delta x \\ \Delta s \end{bmatrix} = -\begin{bmatrix} Qx + c + A^T\lambda \\ Ax + s - b \\ S\Lambda e - \mu e \end{bmatrix}

Here,

  • SS=Diagonal matrix of slack variables s
  • Λ\Lambda=Diagonal matrix of Lagrange multipliers λ
  • ee=Vector of all ones
  • μ\mu=Barrier parameter (decreased each iteration)

💡 Practical Performance

Interior point methods solve QPs in O(n³·L) time where L is a logarithmic complexity measure. Modern implementations (MOSEK, Gurobi) are highly optimized and handle QPs with millions of variables.


Wolfe Dual

Lagrangian and Wolfe Dual of QP

L(x,λ,ν)=12xTQx+cTx+λT(Axb)+νT(Aeqxbeq)L(x, \lambda, \nu) = \frac{1}{2}x^TQx + c^Tx + \lambda^T(Ax - b) + \nu^T(A_{eq}x - b_{eq})

Here,

  • λ\lambda=Lagrange multipliers for inequality constraints (λ ≥ 0)
  • ν\nu=Lagrange multipliers for equality constraints (unrestricted)

Wolfe Dual Problem

maxλ0,ν12zTQ1zbTλbeqTνs.t.Qx+c+ATλ+AeqTν=0z=(c+ATλ+AeqTν)\max_{\lambda \geq 0, \nu} \quad -\frac{1}{2}z^TQ^{-1}z - b^T\lambda - b_{eq}^T\nu \quad \text{s.t.} \quad Qx + c + A^T\lambda + A_{eq}^T\nu = 0 \quad \Rightarrow \quad z = -(c + A^T\lambda + A_{eq}^T\nu)

Here,

  • zz=Substitution variable: z = Qx + c + Aᵀλ + A_{eq}ᵀν = 0 at optimality
  • Q1Q^{-1}=Inverse of Q — requires Q to be non-singular (positive definite)

ℹ️ Dual Insight

The Wolfe dual converts the QP into another QP in dual variables. Strong duality holds when Q ≻ 0 and constraints satisfy Slater's condition, meaning the primal and dual optima are equal.


Ridge Regression as QP

Ridge Regression — QP Formulation

minw12Xwy2+λ2w2=12wT(XTX+λI)w(XTy)Tw+12yTy\min_{w} \quad \frac{1}{2}\|Xw - y\|^2 + \frac{\lambda}{2}\|w\|^2 = \frac{1}{2}w^T(X^TX + \lambda I)w - (X^Ty)^Tw + \frac{1}{2}y^Ty

Here,

  • XX=n×d feature matrix
  • yy=n×1 target vector
  • λ\lambda=Regularization strength (controls bias-variance tradeoff)
  • ww=Weight vector to optimize

Ridge Regression — Closed-Form Solution

w=(XTX+λI)1XTyw^* = (X^TX + \lambda I)^{-1} X^Ty

Here,

  • XTX+λIX^TX + \lambda I=Modified Gram matrix — always invertible for λ > 0
  • λI\lambda I=Regularization term ensures positive definiteness even if XᵀX is singular

💡 QP Structure

Ridge regression is a QP with Q = XᵀX + λI (positive definite), c = -Xᵀy, and no constraints. The regularization term λI shifts the Hessian eigenvalues away from zero, making the problem strictly convex and numerically stable.


SVM as QP

SVM Primal (Hard Margin)

minw,b12w2s.t.yi(wTxi+b)1,i=1,,n\min_{w,b} \quad \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^Tx_i + b) \geq 1, \quad i = 1,\ldots,n

Here,

  • ww=Weight vector — ||w||⁻¹ is the margin width
  • bb=Bias / decision threshold
  • yiy_i=Class label: +1 or -1
  • xix_i=Feature vector of sample i

SVM Primal (Soft Margin)

minw,b,ξ12w2+Ci=1nξis.t.yi(wTxi+b)1ξi,ξi0\min_{w,b,\xi} \quad \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i \quad \text{s.t.} \quad y_i(w^Tx_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0

Here,

  • CC=Penalty parameter — trades off margin size vs. misclassification
  • ξi\xi_i=Slack variable for sample i (allows soft-margin violations)

SVM Dual

maxαi=1nαi12i,jαiαjyiyjxiTxjs.t.0αiC,iαiyi=0\max_{\alpha} \quad \sum_{i=1}^{n}\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j x_i^Tx_j \quad \text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum_{i}\alpha_i y_i = 0

Here,

  • αi\alpha_i=Dual variables — nonzero values identify support vectors
  • CC=Upper bound on each αᵢ (box constraint)
  • xiTxjx_i^Tx_j=Kernel trick replaces this with K(xᵢ, xⱼ) for nonlinear SVM

ℹ️ Key Insight

The SVM dual is a QP with Q_ij = yᵢyⱼxᵢᵀxⱼ. Only support vectors (αᵢ > 0) contribute to the decision function. This sparsity is why SVMs are memory-efficient at prediction time.


Python Implementation: cvxpy

import cvxpy as cp
import numpy as np

# Generate data for a QP: min 0.5*x'*Q*x + c'*x
np.random.seed(42)
n = 5
Q = np.random.randn(n, n)
Q = Q.T @ Q  # Make positive definite
c = np.random.randn(n)

# Decision variable
x = cp.Variable(n)

# Objective
objective = cp.Minimize(0.5 * cp.quad_form(x, Q) + c @ x)

# Constraints
A = np.random.randn(3, n)
b = np.ones(3)
constraints = [A @ x <= b, cp.sum(x) == 1]

# Solve
prob = cp.Problem(objective, constraints)
prob.solve()

print(f"Optimal value: {prob.value:.4f}")
print(f"Optimal x: {x.value}")
print(f"Status: {prob.status}")
print(f"Solver: {prob.solver_stats.solver_name}")

Python Implementation: scipy

from scipy.optimize import minimize
import numpy as np

np.random.seed(42)
n = 5
Q = np.random.randn(n, n)
Q = Q.T @ Q
c = np.random.randn(n)

# Unconstrained QP via scipy
def objective(x):
    return 0.5 * x @ Q @ x + c @ x

def gradient(x):
    return Q @ x + c

def hessian(x):
    return Q

x0 = np.zeros(n)
result = minimize(objective, x0, jac=gradient, hess=hessian, method='trust-constr')

print(f"Optimal value: {result.fun:.4f}")
print(f"Optimal x: {result.x}")
print(f"Converged: {result.success}")

# Constrained QP with bounds
from scipy.optimize import minimize
bounds = [(0, 1) for _ in range(n)]
eq_constraint = {'type': 'eq', 'fun': lambda x: np.sum(x) - 1}

result2 = minimize(objective, x0, jac=gradient, hess=hessian,
                   method='SLSQP', bounds=bounds, constraints=[eq_constraint])
print(f"Constrained optimal: {result2.fun:.4f}")

Applications in AI / Machine Learning

💡 Applications

QP appears throughout machine learning and AI — not just in SVMs but anywhere a quadratic loss is minimized subject to constraints.

📝Application: Model Predictive Control (MPC)

MPC solves a QP at each timestep to find optimal control actions over a finite horizon. The quadratic cost penalizes deviations from reference trajectories, and constraints enforce actuator limits and safety bounds. Real-time QP solvers ( qpOASES, HPIPM ) enable MPC for autonomous vehicles and robotics.

📝Application: Portfolio Optimization (Markowitz)

Markowitz Mean-Variance Optimization

minwwTΣws.t.wTμrmin,wT1=1,w0\min_w \quad w^T\Sigma w \quad \text{s.t.} \quad w^T\mu \geq r_{min}, \quad w^T\mathbf{1} = 1, \quad w \geq 0

Here,

  • ww=Portfolio weight vector
  • Σ\Sigma=Covariance matrix of asset returns
  • μ\mu=Expected return vector
  • rminr_{min}=Minimum required return

The covariance matrix Σ is positive semi-definite, making this a QP. The efficient frontier is obtained by varying r_{min}.

📝Application: Least Squares and Fitting

Any least squares problem min ‖Ax - b‖² is a QP with Q = AᵀA, c = -Aᵀb. Non-negative least squares (NNLS) adds x ≥ 0 constraints. Basis pursuit and compressed sensing add L1 terms to create QPs or second-order cone programs.


Common Mistakes

MistakeWhy It FailsCorrect Approach
Using a non-symmetric QGradient is ½(Q + Qᵀ)x, not QxSymmetrize: Q ← ½(Q + Qᵀ) before solving
Ignoring positive semi-definitenessNon-convex QP may have many local minimaVerify Q ≽ 0 via eigenvalue decomposition; use global solver if not
Confusing ½ conventionGradient off by factor of 2Always verify: ∇(½xᵀQx) = Qx, not 2Qx
Not scaling constraintsPoorly scaled A, b cause numerical issuesNormalize rows of A so ‖aᵢ‖ ≈ 1
Assuming unbounded feasible regionQP may be infeasible or unboundedCheck feasibility before solving; add bounds
Using QP for non-quadratic objectivesQP solver assumes quadratic curvatureUse nonlinear solver (IPOPT, SNOPT) for general NLP
Ignoring warm-startingRepeated QPs (MPC, online learning) waste timeUse previous solution as warm start for next solve

Interview Questions

📝Q: Why must Q be positive semi-definite for convex QP?

If Q has a negative eigenvalue λ with eigenvector v, then f(tv) = ½t²λ‖v‖² + t·cᵀv → -∞ as t → ∞ along v. The problem becomes unbounded with no finite minimum. Positive semi-definiteness ensures the quadratic bowl curves upward in every direction.

📝Q: What is the difference between active-set and interior-point methods?

Active-set methods work on the boundary of the feasible region by identifying which constraints are active. Interior-point methods stay strictly inside the feasible region and approach the boundary via a central path. Active-set is better for warm-starting; interior-point has better worst-case complexity.

📝Q: How does the kernel trick relate to QP?

The SVM dual is a QP involving dot products xᵢᵀxⱼ. The kernel trick replaces xᵢᵀxⱼ with K(xᵢ, xⱼ) = φ(xᵢ)ᵀφ(xⱼ), solving an infinite-dimensional QP without explicitly computing φ. Common kernels: RBF, polynomial, sigmoid.

📝Q: Why is ridge regression a QP with no constraints?

Ridge regression minimizes ½‖Xw - y‖² + (λ/2)‖w‖², which expands to ½wᵀ(XᵀX + λI)w - (Xᵀy)Tw + ½yᵀy. This is a QP with Q = XᵀX + λI ≻ 0 and no constraints. The positive definiteness of Q guarantees a unique global minimum.


Practice Problems

📝Problem 1: Formulate as QP

Rewrite the following as a standard QP: minimize f(x₁, x₂) = 3x₁² + 2x₁x₂ + x₂² - 4x₁ + 5x₂ subject to x₁ + x₂ ≤ 2 and x₁, x₂ ≥ 0.

💡Solution 1

Identify Q and c from the quadratic and linear terms:

f(x) = ½xᵀQx + cᵀx where Q = [[6, 2], [2, 2]], c = [-4, 5]ᵀ

Constraints in standard form: A = [[1, 1]], b = [2], with lower bounds x ≥ 0.

📝Problem 2: Solve Ridge Regression as QP

Given X = [[1, 0], [0, 1], [1, 1]] and y = [1, 2, 3]ᵀ with λ = 0.5, find the ridge regression solution using the QP formulation.

💡Solution 2

Compute Q = XᵀX + λI = [[2, 1], [1, 2]] + [[0.5, 0], [0, 0.5]] = [[2.5, 1], [1, 2.5]]

c = -Xᵀy = -[4, 4]ᵀ

Solve Qw = -c: w* = Q⁻¹(-c) = [[2.5, 1], [1, 2.5]]⁻¹ · [4, 4]ᵀ

det(Q) = 6.25 - 1 = 5.25

w* = (1/5.25)[[2.5, -1], [-1, 2.5]] · [4, 4]ᵀ = [10/5.25, 6/5.25]ᵀ ≈ [1.905, 1.143]ᵀ

📝Problem 3: Dual of a Simple QP

Derive the dual of: min ½x² s.t. x ≥ 1.

💡Solution 3

Lagrangian: L(x, λ) = ½x² - λ(x - 1) for λ ≥ 0.

∂L/∂x = 0 ⟹ x* = λ

Dual function: g(λ) = L(λ, λ) = ½λ² - λ² + λ = -½λ² + λ

Dual problem: max -½λ² + λ s.t. λ ≥ 0

Optimal: λ* = 1, x* = 1. Strong duality holds (Slater's condition satisfied).

📝Problem 4: SVM Dual Verification

For a 2D binary classification with support vectors x₁ = [1, 0], y₁ = 1 and x₂ = [0, 1], y₂ = -1, verify that α₁ = α₂ = 1 is optimal for C = 2.

💡Solution 4

Dual objective: α₁ + α₂ - ½(α₁²·1·1·1 + 2α₁α₂·(-1)·(0) + α₂²·1·1·1) = α₁ + α₂ - ½(α₁² + α₂²)

Gradient w.r.t. α₁: 1 - α₁ = 0 ⟹ α₁ = 1 Gradient w.r.t. α₂: 1 - α₂ = 0 ⟹ α₂ = 1

Both satisfy 0 ≤ αᵢ ≤ C = 2, and α₁y₁ + α₂y₂ = 1 - 1 = 0 ✓


Quick Reference

📋QP Standard Form

minx  12xTQx+cTxs.t.  Axb,  Aeqx=beq\min_x \; \frac{1}{2}x^TQx + c^Tx \quad \text{s.t.} \; Ax \leq b, \; A_{eq}x = b_{eq}

where Q is symmetric positive semi-definite (Q ≽ 0).

📋Key Properties

  • Convexity: Q ≽ 0 ⟹ convex QP; Q ≻ 0 ⟹ strictly convex, unique minimum
  • Gradient: ∇f = Qx + c; Hessian: ∇²f = Q (constant)
  • Duality gap is zero when Slater's condition holds (Q ≻ 0, feasible interior exists)
  • Time complexity: O(n³) per iteration for interior point; active set bounded by constraint count

📋Solver Selection

  • cvxpy: High-level modeling language; delegates to OSQP, ECOS, SCS
  • scipy.optimize: minimize() with method='trust-constr' or 'SLSQP' for QPs
  • OSQP: Operator Splitting QP; fast for large sparse QPs
  • Gurobi / MOSEK: Commercial solvers; best performance for large-scale QPs
  • qpOASES: Real-time QP solver for embedded MPC

📋ML / AI Connections

  • SVM training ⟹ QP (primal or dual form)
  • Ridge regression ⟹ unconstrained QP with Q = XᵀX + λI
  • Portfolio optimization ⟹ QP with covariance matrix
  • Model predictive control ⟹ repeated QP at each timestep
  • LASSO (L1) is not a QP but a second-order cone program

Cross-References

ℹ️ Related Topics

  • Convex Optimization — QP is a special case of convex optimization; see module on convex analysis
  • Linear Programming — LP is QP with Q = 0; simplex and interior-point methods generalize
  • KKT Conditions — Optimality conditions for constrained QP; see Lagrange duality
  • Newton's Method — Second-order unconstrained optimization; QP is the model subproblem
  • SVM Theory — SVM dual derivation relies on QP duality; see kernel methods module
  • Markowitz Portfolio Theory — Mean-variance optimization is a QP; see quantitative finance module
  • Model Predictive Control — Real-time QP solving; see control systems module
  • Regularization — Ridge (L2) and Lasso (L1) penalties in statistical learning
Lesson Progress67 / 100