← Math|61 of 100
Optimization

Convex Optimization

Master convex sets, convex functions, strong convexity, duality theory, and convex optimization problems with applications in machine learning.

📂 Convex📖 Lesson 61 of 100🎓 Free Course

Advertisement

Convex Optimization

💡 Why It Matters

Convex optimization is the backbone of modern machine learning. Every local minimum of a convex function is a global minimum, making these problems tractable and reliable. SVMs, linear regression, logistic regression, LASSO, ridge regression, and many neural network subproblems are all convex. Understanding convexity is the single most important concept for anyone building or analyzing optimization-based ML systems.

Convex optimization bridges pure mathematics and practical ML engineering. While non-convex problems (like training deep neural networks) dominate headlines, convex problems form the foundation: they have efficient solvers with provable guarantees, they appear as subroutines in larger algorithms, and many heuristics for non-convex problems work by solving convex relaxations. Mastering this topic means understanding why some optimization problems are "easy" and others are "hard," and having the tools to transform hard problems into tractable ones.


Convex Set

DfConvex Set

A set C⊆RnC \subseteq \mathbb{R}^n is convex if for all x,y∈Cx, y \in C and all θ∈[0,1]\theta \in [0, 1]:

θx+(1−θ)y∈C\theta x + (1 - \theta) y \in C

Geometrically, this means the line segment connecting any two points in CC lies entirely within CC. There are no "dents," indentations, or holes.

â„šī¸ Intuition

Imagine drawing a rubber band around the boundary of the set. If the rubber band touches every point on the boundary without cutting through the interior, the set is convex. A circle is convex; a crescent moon is not. The intersection of any collection of convex sets is convex, which is a powerful tool for constructing complex convex feasible regions.

ThExamples of Convex Sets

SetDescriptionConvex?
Rn\mathbb{R}^nEntire Euclidean spaceYes
Halfspace {x:aTx≤b}\{x : a^Tx \leq b\}Linear inequalityYes
Hyperplane {x:aTx=b}\{x : a^Tx = b\}Linear equalityYes
Ball {x:âˆĨx−câˆĨ2≤r}\{x : \|x - c\|_2 \leq r\}Euclidean ballYes
Nonnegative orthant R+n\mathbb{R}^n_+{x:xiâ‰Ĩ0}\{x : x_i \geq 0\}Yes
Simplex Δn\Delta_n{x:xiâ‰Ĩ0,∑xi=1}\{x : x_i \geq 0, \sum x_i = 1\}Yes
Positive semidefinite cone S+n\mathcal{S}^n_+{X:Xâǰ0}\{X : X \succeq 0\}Yes
Set with a bite taken out{x:âˆĨxâˆĨ2≤1}∖{x:âˆĨxâˆĨ2≤0.5}\{x : \|x\|_2 \leq 1\} \setminus \{x : \|x\|_2 \leq 0.5\}No

📝Problem: Is the Intersection Convex?

Is the set C={x∈R2:x12+x22≤1,  x1+x2â‰Ĩ1}C = \{x \in \mathbb{R}^2 : x_1^2 + x_2^2 \leq 1, \; x_1 + x_2 \geq 1\} convex?

💡Solution

Yes. The set CC is the intersection of a disk {x:x12+x22≤1}\{x : x_1^2 + x_2^2 \leq 1\} (convex) and a halfspace {x:x1+x2â‰Ĩ1}\{x : x_1 + x_2 \geq 1\} (convex). The intersection of convex sets is convex. Geometrically, CC is a circular segment.


Convex Function

DfConvex Function

A function f:Rn→Rf: \mathbb{R}^n \to \mathbb{R} is convex if its domain dom(f)\text{dom}(f) is a convex set and for all x,y∈dom(f)x, y \in \text{dom}(f) and θ∈[0,1]\theta \in [0, 1]:

f(θx+(1−θ)y)≤θf(x)+(1−θ)f(y)f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y)

This is the epigraph condition: the line segment connecting (x,f(x))(x, f(x)) and (y,f(y))(y, f(y)) lies on or above the graph of ff. A function is concave if −f-f is convex.

â„šī¸ Intuition

A convex function curves "upward" everywhere. If you stand on the graph and look along the curve, the ground always rises in both directions. This means there are no "dips" that create local minima distinct from the global minimum. The epigraph {(x,t):f(x)≤t}\{(x, t) : f(x) \leq t\} is a convex set — this connects convex functions to convex sets.

First-Order Condition

ThFirst-Order Optimality Condition

A differentiable function f:Rn→Rf: \mathbb{R}^n \to \mathbb{R} is convex if and only if for all x,y∈dom(f)x, y \in \text{dom}(f):

f(y)â‰Ĩf(x)+∇f(x)T(y−x)f(y) \geq f(x) + \nabla f(x)^T (y - x)

The tangent hyperplane at any point lies on or below the graph. This is the global linear lower bound characterization. Moreover, x∗x^* is a global minimum of a convex function ff if and only if:

∇f(x∗)=0\nabla f(x^*) = 0

Second-Order Condition

ThSecond-Order Condition

A twice-differentiable function f:Rn→Rf: \mathbb{R}^n \to \mathbb{R} is convex if and only if its Hessian is positive semidefinite everywhere:

∇2f(x)âǰ0∀x∈dom(f)\nabla^2 f(x) \succeq 0 \quad \forall x \in \text{dom}(f)

For strict convexity, we need ∇2f(x)â‰ģ0\nabla^2 f(x) \succ 0 (positive definite) everywhere. For functions of one variable, this reduces to f′′(x)â‰Ĩ0f''(x) \geq 0.

âš ī¸ Strict vs. Strong Convexity

Strict convexity (∇2fâ‰ģ0\nabla^2 f \succ 0) guarantees a unique minimizer but says nothing about the "flatness" near the minimum. Strong convexity (defined below) adds a curvature lower bound and gives much stronger convergence guarantees.

Common Convex and Non-Convex Functions

FunctionConvex?Notes
f(x)=cf(x) = cYesConstant functions are convex (and concave)
f(x)=x2f(x) = x^2YesStrictly convex
f(x)=exf(x) = e^xYesStrictly convex, f′′(x)=ex>0f''(x) = e^x > 0
f(x)=−log⁡(x)f(x) = -\log(x)YesStrictly convex on (0,∞)(0, \infty)
f(x)=âˆĨxâˆĨf(x) = \|x\|YesConvex but not differentiable at 0
f(x)=x4f(x) = x^4YesStrictly convex, though f′′(0)=0f''(0) = 0
f(x)=x3f(x) = x^3NoInflection point at x=0x = 0
f(x)=sin⁥(x)f(x) = \sin(x)NoConcave and convex regions
f(x)=1/xf(x) = 1/xNoOn R\mathbb{R}; convex on (0,∞)(0, \infty)
f(X)=log⁥det⁥(X)f(X) = \log\det(X)YesOn S++n\mathcal{S}^n_{++} (positive definite cone)
f(x)=âˆĨxâˆĨ22+âˆĨxâˆĨ1f(x) = \|x\|_2^2 + \|x\|_1YesSum of convex functions
f(x)=x1x2f(x) = x_1 x_2NoBilinear (saddle-shaped)

Properties of Convex Functions

ThFundamental Properties of Convex Functions

Let f,gf, g be convex functions and Îąâ‰Ĩ0\alpha \geq 0.

1. Non-negative weighted sums are convex:

h(x)=Îąf(x)+βg(x) is convex for ι,βâ‰Ĩ0h(x) = \alpha f(x) + \beta g(x) \text{ is convex for } \alpha, \beta \geq 0

2. Pointwise maximum preserves convexity:

h(x)=max⁥(f(x),g(x)) is convexh(x) = \max(f(x), g(x)) \text{ is convex}

3. Affine composition preserves convexity:

h(x)=f(Ax+b) is convexh(x) = f(Ax + b) \text{ is convex}

4. Perspective scaling preserves convexity:

h(x,t)=t⋅f(x/t) is convex on {(x,t):t>0}h(x, t) = t \cdot f(x/t) \text{ is convex on } \{(x, t) : t > 0\}

5. Partial minimization preserves convexity: If g(x,y)g(x, y) is convex in (x,y)(x, y) and CC is a convex set, then h(x)=inf⁥y∈Cg(x,y)h(x) = \inf_{y \in C} g(x, y) is convex.

6. Supremum of convex functions is convex:

f(x)=sup⁥ι∈Afι(x) is convex if each fι is convexf(x) = \sup_{\alpha \in \mathcal{A}} f_\alpha(x) \text{ is convex if each } f_\alpha \text{ is convex}

7. Restriction to a line preserves convexity: ff is convex if and only if g(t)=f(x+tv)g(t) = f(x + tv) is convex in tt for all x∈dom(f)x \in \text{dom}(f) and all directions vv.

💡 Building Complex Convex Functions

These closure properties are extraordinarily powerful. You can build complex convex functions from simple building blocks: norms, quadratics, logs, and exponentials. For example, f(x)=log⁡(ex1+ex2)f(x) = \log(e^{x_1} + e^{x_2}) is convex because it is the composition of the convex function log⁡(⋅)\log(\cdot) applied to the convex sum of exponentials (log-sum-exp). Most convex functions encountered in ML are constructed this way.


Strongly Convex Functions

DfStrongly Convex Function

A function f:Rn→Rf: \mathbb{R}^n \to \mathbb{R} is mm-strongly convex (m>0m > 0) if for all x,y∈dom(f)x, y \in \text{dom}(f) and θ∈[0,1]\theta \in [0, 1]:

f(θx+(1−θ)y)≤θf(x)+(1−θ)f(y)−m2θ(1−θ)âˆĨx−yâˆĨ2f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y) - \frac{m}{2}\theta(1-\theta)\|x - y\|^2

Equivalently, a differentiable ff is mm-strongly convex if and only if:

f(y)â‰Ĩf(x)+∇f(x)T(y−x)+m2âˆĨy−xâˆĨ2∀x,yf(y) \geq f(x) + \nabla f(x)^T(y - x) + \frac{m}{2}\|y - x\|^2 \quad \forall x, y

Or in terms of the Hessian: ∇2f(x)âǰmI\nabla^2 f(x) \succeq mI for all xx.

ThStrong Convexity Gives Linear Convergence

If ff is mm-strongly convex and LL-smooth (gradient is LL-Lipschitz), then gradient descent with step size Îą=1/L\alpha = 1/L satisfies:

f(xk)−f∗≤(L−mL+m)2k(f(x0)−f∗)f(x_k) - f^* \leq \left(\frac{L - m}{L + m}\right)^{2k} (f(x_0) - f^*)

The convergence rate ΁=L−mL+m<1\rho = \frac{L-m}{L+m} < 1 is linear. The condition number Îē=L/m\kappa = L/m controls the rate: smaller Îē\kappa means faster convergence.

📝Problem: Strong Convexity of Quadratic

Show that f(x)=12xTAx+bTx+cf(x) = \frac{1}{2}x^TAx + b^Tx + c with Aâ‰ģ0A \succ 0 is mm-strongly convex where m=Îģmin⁥(A)m = \lambda_{\min}(A).

💡Solution

The Hessian is ∇2f(x)=A\nabla^2 f(x) = A. Since AA is positive definite with minimum eigenvalue Îģmin⁥(A)>0\lambda_{\min}(A) > 0, we have AâǰÎģmin⁥(A)IA \succeq \lambda_{\min}(A) I. Thus ff is Îģmin⁥(A)\lambda_{\min}(A)-strongly convex. This gives linear convergence rate ΁=Îģmax⁥(A)−Îģmin⁥(A)Îģmax⁥(A)+Îģmin⁥(A)\rho = \frac{\lambda_{\max}(A) - \lambda_{\min}(A)}{\lambda_{\max}(A) + \lambda_{\min}(A)}.


Convex Optimization Problem

DfConvex Optimization Problem

A convex optimization problem has the form:

min⁥xf(x)s.t.gi(x)≤0  (i=1,â€Ļ,m),hj(x)=0  (j=1,â€Ļ,p)\min_x f(x) \quad \text{s.t.} \quad g_i(x) \leq 0 \; (i = 1, \ldots, m), \quad h_j(x) = 0 \; (j = 1, \ldots, p)

where ff and each gig_i are convex functions, and each hjh_j is affine (linear). The feasible set is convex (intersection of convex sets and hyperplanes), and the objective is convex over this set.

â„šī¸ Why Affine Equality Constraints Only

Equality constraints must be affine because a nonlinear equality h(x)=0h(x) = 0 defines a non-convex feasible set (it's a lower-dimensional manifold with no "interior"). For example, x2+y2=1x^2 + y^2 = 1 (a circle) is not a convex set — you can find two points on the circle whose midpoint is not on the circle.

ThKey Properties of Convex Optimization

1. Local = Global: Every local minimum of a convex problem is a global minimum.

2. Uniqueness: If ff is strictly convex, the minimizer x∗x^* is unique.

3. Optimality conditions: x∗x^* is optimal if and only if:

∇f(x∗)T(y−x∗)â‰Ĩ0∀y∈F\nabla f(x^*)^T (y - x^*) \geq 0 \quad \forall y \in \mathcal{F}

where F\mathcal{F} is the feasible set. For unconstrained problems, this reduces to ∇f(x∗)=0\nabla f(x^*) = 0.

4. Computational tractability: Convex problems can be solved in polynomial time (in theory and practice) using interior-point methods.

5. Duality: The dual of a convex problem provides lower bounds, optimality certificates, and insight into constraint sensitivity.


Common Convex Problems

Linear Programming (LP)

DfLinear Program

A linear program has the form:

min⁡xcTxs.t.Ax≤b,Aeqx=beq\min_x c^Tx \quad \text{s.t.} \quad Ax \leq b, \quad A_{eq}x = b_{eq}

The objective and constraints are all linear. LP is the simplest class of convex optimization and can be solved in polynomial time using interior-point methods (or the simplex method in practice).

â„šī¸ LP Applications

LP appears in resource allocation, scheduling, network flow, production planning, and as a relaxation for combinatorial problems. The dual of an LP is also an LP (LP duality is exact).

Quadratic Programming (QP)

DfQuadratic Program

A quadratic program has the form:

min⁡x12xTQx+cTxs.t.Ax≤b\min_x \frac{1}{2}x^TQx + c^Tx \quad \text{s.t.} \quad Ax \leq b

where Qâǰ0Q \succeq 0 (positive semidefinite) ensures convexity. If Qâ‰ģ0Q \succ 0 (positive definite), the problem is strictly convex.

â„šī¸ QP Applications

Ridge regression (min⁥âˆĨy−XwâˆĨ2+ÎģâˆĨwâˆĨ2\min \|y - Xw\|^2 + \lambda\|w\|^2) is an unconstrained QP. SVMs with linear kernel are QPs. Portfolio optimization (Markowitz) is a QP. Model predictive control (MPC) solves a QP at each time step.

Second-Order Cone Programming (SOCP)

DfSecond-Order Cone Program

An SOCP has the form:

min⁥xcTxs.t.âˆĨAix+biâˆĨ2≤ciTx+di(i=1,â€Ļ,m)\min_x c^Tx \quad \text{s.t.} \quad \|A_ix + b_i\|_2 \leq c_i^Tx + d_i \quad (i = 1, \ldots, m)

SOCPs generalize LPs and QPs. The constraint âˆĨAx+bâˆĨ2≤cTx+d\|Ax + b\|_2 \leq c^Tx + d is a second-order (Lorentz) cone constraint.

â„šī¸ SOCP Applications

SOCPs arise in robust optimization (worst-case constraints), portfolio optimization with transaction costs, antenna design, and signal processing. Many problems that appear nonlinear can be reformulated as SOCPs.

Semidefinite Programming (SDP)

DfSemidefinite Program

An SDP has the form:

min⁥Xtr(CX)s.t.tr(AiX)=bi  (i=1,â€Ļ,m),Xâǰ0\min_X \text{tr}(CX) \quad \text{s.t.} \quad \text{tr}(A_iX) = b_i \; (i = 1, \ldots, m), \quad X \succeq 0

The variable XX is a symmetric positive semidefinite matrix. SDPs generalize LPs (when XX is diagonal) and are among the most powerful convex optimization formulations.

â„šī¸ SDP Applications

SDPs appear in relaxation of combinatorial problems (MAX-CUT, graph coloring), control theory (LMI conditions), quantum information, and polynomial optimization. The Goemans-Williamson algorithm for MAX-CUT uses an SDP relaxation.

Problem Hierarchy

ThConvex Optimization Hierarchy

Each problem class is a special case of the one below it:

LP⊂QP⊂SOCP⊂SDP⊂Conic\text{LP} \subset \text{QP} \subset \text{SOCP} \subset \text{SDP} \subset \text{Conic}

Solvers like MOSEK and ECOS handle SOCPs and SDPs natively. For LPs, solvers like Gurobi and CLPK are extremely fast. CVXPY automatically transforms problems into the appropriate form.


Duality

DfLagrangian Dual

Given the convex optimization problem:

min⁥xf(x)s.t.gi(x)≤0  (i=1,â€Ļ,m),hj(x)=0  (j=1,â€Ļ,p)\min_x f(x) \quad \text{s.t.} \quad g_i(x) \leq 0 \; (i=1,\ldots,m), \quad h_j(x) = 0 \; (j=1,\ldots,p)

the Lagrangian is:

L(x,Îģ,ÎŊ)=f(x)+∑i=1mÎģigi(x)+∑j=1pÎŊjhj(x)\mathcal{L}(x, \lambda, \nu) = f(x) + \sum_{i=1}^{m} \lambda_i g_i(x) + \sum_{j=1}^{p} \nu_j h_j(x)

The Lagrangian dual function is:

g(Îģ,ÎŊ)=inf⁥x∈dom(f)L(x,Îģ,ÎŊ)g(\lambda, \nu) = \inf_{x \in \text{dom}(f)} \mathcal{L}(x, \lambda, \nu)

The dual problem is:

max⁥Îģâ‰Ĩ0, ÎŊg(Îģ,ÎŊ)\max_{\lambda \geq 0, \, \nu} g(\lambda, \nu)

â„šī¸ Intuition

The dual function g(Îģ,ÎŊ)g(\lambda, \nu) provides a lower bound on the optimal value p∗p^* of the primal problem for any Îģâ‰Ĩ0,ÎŊ\lambda \geq 0, \nu. By maximizing this lower bound, we find the tightest possible bound — this is the dual problem. The dual is always convex (even if the primal is not), because it is the pointwise infimum of affine functions in (Îģ,ÎŊ)(\lambda, \nu).

Weak and Strong Duality

ThWeak and Strong Duality

Let p∗p^* be the optimal value of the primal and d∗d^* be the optimal value of the dual.

Weak Duality (always holds):

d∗≤p∗d^* \leq p^*

The dual optimal value is always a lower bound on the primal optimal value. The gap p∗−d∗p^* - d^* is called the duality gap.

Strong Duality (for convex problems under constraint qualifications):

d∗=p∗d^* = p^*

Strong duality holds for convex problems when Slater's condition is satisfied: there exists a strictly feasible point xx with gi(x)<0g_i(x) < 0 for all ii and hj(x)=0h_j(x) = 0 for all jj.

💡 KKT Conditions and Strong Duality

When strong duality holds, the KKT conditions are both necessary and sufficient for optimality. The dual variables Îģi,ÎŊj\lambda_i, \nu_j at the optimum are the Lagrange multipliers. Complementary slackness (Îģigi(x∗)=0\lambda_i g_i(x^*) = 0) ensures that only active constraints have nonzero multipliers. The dual variables quantify how much the optimal value changes per unit relaxation of each constraint — they are shadow prices.

Dual Interpretation in ML

â„šī¸ Dual Variables in Machine Learning

In SVMs, the dual variables Îąi\alpha_i are nonzero only for support vectors. In regularization, the dual variable Îģ\lambda represents the trade-off between fit and complexity. In fairness-constrained ML, dual variables measure the "cost of fairness" — how much accuracy is sacrificed per unit of fairness enforcement. Understanding duality gives you insight into which constraints are binding and which data points matter.


Python Implementation

Using CVXPY

import cvxpy as cp
import numpy as np

# ============================================
# Example 1: Linear Program
# Minimize cost subject to constraints
# ============================================

n = 4  # number of variables
c = np.array([1.0, 2.0, 3.0, 4.0])  # cost vector
A = np.array([[1, 1, 0, 0],
              [0, 0, 1, 1],
              [1, 0, 1, 0]])
b = np.array([10.0, 8.0, 12.0])

x = cp.Variable(n)
constraints = [A @ x <= b, x >= 0]
objective = cp.Minimize(c @ x)
prob = cp.Problem(objective, constraints)
prob.solve()

print(f"Status: {prob.status}")
print(f"Optimal value: {prob.value:.4f}")
print(f"Optimal x: {x.value}")

# ============================================
# Example 2: Quadratic Program (Ridge Regression)
# ============================================

np.random.seed(42)
m, n = 50, 10
X = np.random.randn(m, n)
y = X @ np.random.randn(n) + 0.1 * np.random.randn(m)

w = cp.Variable(n)
lam = 0.5  # regularization parameter
objective = cp.Minimize(0.5 * cp.sum_squares(y - X @ w) + lam * cp.squares(w))
prob = cp.Problem(objective)
prob.solve()

print(f"\nRidge regression:")
print(f"Optimal w: {w.value[:3]}...")
print(f"Optimal loss: {prob.value:.4f}")

# ============================================
# Example 3: Second-Order Cone Program
# ============================================

x_socp = cp.Variable(3)
A_socp = np.array([[1, 0, 0], [0, 1, 0]])
b_socp = np.array([0.5, 0.5])
c_socp = np.array([1.0, 2.0, 3.0])

objective = cp.Minimize(c_socp @ x_socp)
constraints = [cp.norm(A_socp @ x_socp - b_socp, 2) <= 1.0]
prob = cp.Problem(objective, constraints)
prob.solve()

print(f"\nSOCP:")
print(f"Optimal value: {prob.value:.4f}")
print(f"Optimal x: {x_socp.value}")

# ============================================
# Example 4: Semidefinite Program
# ============================================

X_sdp = cp.Variable((3, 3), symmetric=True)
C = np.array([[1, 0, 0], [0, 2, 0], [0, 0, 3]])
A1 = np.array([[1, 0, 0], [0, 0, 0], [0, 0, 0]])

objective = cp.Minimize(cp.trace(C @ X_sdp))
constraints = [cp.trace(A1 @ X_sdp) == 1, X_sdp >> 0]  # X >> 0 means PSD
prob = cp.Problem(objective, constraints)
prob.solve()

print(f"\nSDP:")
print(f"Optimal value: {prob.value:.4f}")
print(f"X is PSD: {np.all(np.linalg.eigvals(X_sdp.value) > -1e-8)}")

Using SciPy

from scipy.optimize import minimize, LinearConstraint, NonlinearConstraint

# ============================================
# Example 1: Unconstrained convex function
# ============================================

f = lambda x: (x[0] - 1)**2 + (x[1] - 2)**2
grad_f = lambda x: np.array([2*(x[0] - 1), 2*(x[1] - 2)])

result = minimize(f, x0=[0.0, 0.0], jac=grad_f, method='L-BFGS-B')
print(f"Unconstrained minimum: x = {result.x}, f(x) = {result.fun:.6f}")

# ============================================
# Example 2: Linear constraints (box constraints)
# ============================================

f2 = lambda x: (x[0] - 2)**2 + (x[1] - 3)**2
bounds = [(0, 4), (0, 4)]

result = minimize(f2, x0=[0, 0], bounds=bounds, method='L-BFGS-B')
print(f"Box-constrained: x = {result.x}, f(x) = {result.fun:.6f}")

# ============================================
# Example 3: Equality and inequality constraints
# ============================================

f3 = lambda x: x[0]**2 + x[1]**2 + x[2]**2
jac_f3 = lambda x: np.array([2*x[0], 2*x[1], 2*x[2]])

constraints = [
    {'type': 'eq', 'fun': lambda x: x[0] + x[1] + x[2] - 1},
    {'type': 'ineq', 'fun': lambda x: x[0] - 0.1},  # x[0] >= 0.1
    {'type': 'ineq', 'fun': lambda x: x[1] - 0.1},  # x[1] >= 0.1
]

result = minimize(f3, x0=[1/3, 1/3, 1/3], jac=jac_f3, constraints=constraints, method='SLSQP')
print(f"Constrained: x = {result.x}, f(x) = {result.fun:.6f}")

Applications in AI/ML

Support Vector Machines

â„šī¸ SVM as Convex Optimization

The SVM primal problem is:

min⁥w,b12âˆĨwâˆĨ2s.t.yi(wTxi+b)â‰Ĩ1  ∀i\min_{w, b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^Tx_i + b) \geq 1 \; \forall i

This is a convex QP. The dual is also a convex QP, and complementary slackness reveals that only support vectors (Îąi>0\alpha_i > 0) determine the decision boundary. The kernel trick works because the dual depends only on dot products xiTxjx_i^Tx_j.

Ridge Regression (Tikhonov Regularization)

Ridge Regression

w∗=arg⁥min⁥wâˆĨy−XwâˆĨ22+ÎģâˆĨwâˆĨ22w^* = \arg\min_w \|y - Xw\|_2^2 + \lambda\|w\|_2^2

Here,

  • XX=Design matrix (n x p)
  • yy=Response vector (n x 1)
  • Îģ\lambda=Regularization strength
  • ww=Parameter vector (p x 1)

The closed-form solution is w∗=(XTX+ÎģI)−1XTyw^* = (X^TX + \lambda I)^{-1}X^Ty. This is a strictly convex problem (quadratic with XTX+ÎģIâ‰ģ0X^TX + \lambda I \succ 0), guaranteeing a unique global minimum. Ridge regression is an unconstrained QP and is convex for any Îģ>0\lambda > 0.

LASSO Regression

LASSO Regression

w∗=arg⁥min⁥wâˆĨy−XwâˆĨ22+ÎģâˆĨwâˆĨ1w^* = \arg\min_w \|y - Xw\|_2^2 + \lambda\|w\|_1

Here,

  • âˆĨwâˆĨ1\|w\|_1=L1 norm inducing sparsity
  • Îģ\lambda=Regularization strength

LASSO is convex (the L1 norm is convex) but not differentiable at wi=0w_i = 0. It promotes sparsity — many coefficients are driven exactly to zero, performing automatic feature selection. The proximal gradient method (ISTA/FISTA) is the standard solver.

Logistic Regression

â„šī¸ Logistic Regression as Convex Optimization

The negative log-likelihood for logistic regression is:

f(w)=−∑i=1n[yilog⁥(΃(wTxi))+(1−yi)log⁥(1âˆ’Īƒ(wTxi))]f(w) = -\sum_{i=1}^{n} \left[ y_i \log(\sigma(w^Tx_i)) + (1-y_i)\log(1-\sigma(w^Tx_i)) \right]

This is convex in ww (though nonlinear), and is a fundamental building block of ML pipelines. It can be solved by gradient descent, Newton's method, or interior-point methods.

Neural Network Training (Non-Convex but Informed)

âš ī¸ Convexity in Deep Learning

Training a neural network is non-convex — the loss landscape has many local minima and saddle points. However, convex optimization concepts still apply: (1) many local minima are nearly as good as the global minimum; (2) batch normalization and skip connections make the landscape "more convex"; (3) second-order methods use Hessian information; (4) convex relaxations provide bounds on the global minimum.


Common Mistakes

MistakeWhy It's WrongCorrect Approach
Assuming f′′(x)>0f''(x) > 0 means convex everywheref′′(x)â‰Ĩ0f''(x) \geq 0 is necessary for convexity, but a function can be convex even if f′′=0f'' = 0 at isolated points (e.g., f(x)=x4f(x) = x^4)Check ∇2f(x)âǰ0\nabla^2 f(x) \succeq 0 for all xx in the domain
Confusing convex and concaveMany people mix up "convex up" vs "convex down"Convex = bowl-shaped (holds water); concave = umbrella-shaped
Treating local minima as global minima in non-convex problemsLocal minima in non-convex problems may be far from global minimaUse convex relaxations, multi-start, or second-order methods
Forgetting that equality constraints must be affineh(x)=x2+y2−1=0h(x) = x^2 + y^2 - 1 = 0 does NOT define a convex feasible setNonlinear equalities create non-convex manifolds
Ignoring Slater's conditionStrong duality requires strict feasibilityVerify ∃x:gi(x)<0\exists x: g_i(x) < 0 and hj(x)=0h_j(x) = 0
Assuming all norms lead to convex problemsSome non-convex norms (like L0L_0) do notUse L1L_1, L2L_2, or other convex norms
Using gradient descent without checking Lipschitz continuityNon-Lipschitz gradients can cause divergenceVerify or bound the Lipschitz constant LL
Not checking if the Hessian is PSDUsing Newton's method on a non-convex function may find saddle pointsCheck eigenvalues of ∇2f\nabla^2 f or use trust-region methods
Assuming the sum of convex functions is always strictly convexSum of convex functions is convex, but may not be strictly convexf(x)=x4+(−x4)=0f(x) = x^4 + (-x^4) = 0 is convex but not strictly convex
Forgetting perspective scalingf(x/t)⋅tf(x/t) \cdot t is convex in (x,t)(x, t) but many miss thisUse perspective to handle fractional objectives

Interview Questions

Q1: What is the difference between convex and strictly convex?

💡Answer

A convex function satisfies f(θx+(1−θ)y)≤θf(x)+(1−θ)f(y)f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y). A strictly convex function satisfies the strict inequality f(θx+(1−θ)y)<θf(x)+(1−θ)f(y)f(\theta x + (1-\theta)y) < \theta f(x) + (1-\theta)f(y) for θ∈(0,1)\theta \in (0,1) and x≠yx \neq y. Convexity guarantees local = global; strict convexity additionally guarantees the global minimum is unique. However, strict convexity does not imply strong convexity — f(x)=x4f(x) = x^4 is strictly convex but not strongly convex (the Hessian vanishes at 0).

Q2: Why are convex problems easier to solve than non-convex problems?

💡Answer

Convex problems have three key properties: (1) every local minimum is a global minimum, so first-order methods (gradient descent) can find the optimum without getting stuck in local minima; (2) the KKT conditions are necessary and sufficient for optimality, providing a verifiable optimality certificate; (3) polynomial-time algorithms exist (interior-point methods). Non-convex problems may have exponentially many local minima, no efficient way to certify global optimality, and can be NP-hard in the worst case.

Q3: What is Slater's condition and why does it matter?

💡Answer

Slater's condition requires the existence of a strictly feasible point: ∃x\exists x in the relative interior of dom(f)\text{dom}(f) such that gi(x)<0g_i(x) < 0 for all inequality constraints and hj(x)=0h_j(x) = 0 for all equality constraints. When Slater's condition holds for a convex problem, strong duality holds (p∗=d∗p^* = d^*), meaning the dual optimal value equals the primal optimal value and the duality gap is zero. This enables: (1) recovering the primal solution from the dual; (2) using dual methods to solve the primal; (3) obtaining sensitivity information from dual variables. For linear programs, strong duality always holds (no Slater's condition needed).

Q4: How does strong convexity improve convergence?

💡Answer

Strong convexity (∇2fâǰmI\nabla^2 f \succeq mI) ensures the gradient grows linearly away from the optimum: âˆĨ∇f(x)âˆĨâ‰ĨmâˆĨx−x∗âˆĨ\|\nabla f(x)\| \geq m\|x - x^*\|. This gives linear convergence for gradient descent with rate ΁=L−mL+m\rho = \frac{L-m}{L+m}, compared to sublinear O(1/k)O(1/k) for general convex functions. The condition number Îē=L/m\kappa = L/m controls the rate: Îē=1\kappa = 1 (perfectly conditioned) gives immediate convergence; large Îē\kappa gives slow convergence. Preconditioning reduces Îē\kappa by transforming the problem.

Q5: Explain the dual of an SVM and its significance.

💡Answer

The SVM primal maximizes the margin 2âˆĨwâˆĨ\frac{2}{\|w\|} subject to yi(wTxi+b)â‰Ĩ1y_i(w^Tx_i + b) \geq 1. Introducing dual variables Îąiâ‰Ĩ0\alpha_i \geq 0 and applying KKT yields the dual: max⁡α∑αi−12∑i,jÎąiÎąjyiyjxiTxj\max_\alpha \sum \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_iy_j x_i^Tx_j subject to Îąiâ‰Ĩ0\alpha_i \geq 0 and ∑αiyi=0\sum \alpha_iy_i = 0. Significance: (1) it depends only on dot products, enabling the kernel trick; (2) complementary slackness (Îąi>0\alpha_i > 0 only for support vectors) identifies which training points matter; (3) the dual is a QP with nn variables, independent of the feature dimension — useful when nâ‰Ēdn \ll d.

Q6: What happens when strong duality does not hold?

💡Answer

When strong duality fails, there is a nonzero duality gap p∗−d∗>0p^* - d^* > 0. The dual provides a lower bound but not the exact value. This can happen when: (1) the problem is not convex (Slater's condition is sufficient but not necessary for convex problems); (2) the problem is convex but Slater's condition fails (no strictly feasible point exists); (3) the problem involves integer constraints (combinatorial optimization). In practice, the duality gap can be used as a stopping criterion for interior-point methods — terminate when p∗−d∗<Īĩp^* - d^* < \epsilon.


Practice Problems

Problem 1: Verify Convexity

📝Convexity Check

Show that f(x,y)=ex+y+x2+y2f(x, y) = e^{x+y} + x^2 + y^2 is convex on R2\mathbb{R}^2.

💡Solution

Compute the Hessian:

∇2f=(ex+y+2ex+yex+yex+y+2)\nabla^2 f = \begin{pmatrix} e^{x+y} + 2 & e^{x+y} \\ e^{x+y} & e^{x+y} + 2 \end{pmatrix}

The eigenvalues of (abba)\begin{pmatrix} a & b \\ b & a \end{pmatrix} are a+ba + b and a−ba - b where a=ex+y+2a = e^{x+y} + 2 and b=ex+yb = e^{x+y}.

Eigenvalues: Îģ1=2ex+y+2>0\lambda_1 = 2e^{x+y} + 2 > 0 and Îģ2=2>0\lambda_2 = 2 > 0.

Since ∇2fâ‰ģ0\nabla^2 f \succ 0 everywhere, ff is strictly convex.


Problem 2: Duality Gap

📝Non-Convex Problem

Consider f(x)=x3f(x) = x^3 on the interval [−1,1][-1, 1]. What can you say about the duality gap if we form a Lagrangian dual?

💡Solution

f(x)=x3f(x) = x^3 is not convex (it has an inflection point at x=0x = 0). The unconstrained minimum on [−1,1][-1, 1] is at x=−1x = -1 with f(−1)=−1f(-1) = -1. However, since the problem is non-convex, we cannot guarantee that the dual provides a tight lower bound — there may be a duality gap. The KKT conditions may identify saddle points or local minima that are not global.


Problem 3: Ridge Regression Closed Form

📝Derive Ridge Regression Solution

Derive the closed-form solution for w∗=arg⁥min⁥wâˆĨy−XwâˆĨ2+ÎģâˆĨwâˆĨ2w^* = \arg\min_w \|y - Xw\|^2 + \lambda\|w\|^2.

💡Solution

Expand the objective:

f(w)=yTy−2yTXw+wTXTXw+ÎģwTwf(w) = y^Ty - 2y^TXw + w^TX^TXw + \lambda w^Tw

Take the gradient and set to zero:

∇f(w)=−2XTy+2XTXw+2Îģw=0\nabla f(w) = -2X^Ty + 2X^TXw + 2\lambda w = 0
XTXw+Îģw=XTyX^TXw + \lambda w = X^Ty
(XTX+ÎģI)w=XTy(X^TX + \lambda I)w = X^Ty
w∗=(XTX+ÎģI)−1XTyw^* = (X^TX + \lambda I)^{-1}X^Ty

This exists for all Îģ>0\lambda > 0 because XTX+ÎģIâ‰ģ0X^TX + \lambda I \succ 0. The problem is strictly convex, so this is the unique global minimum.


Problem 4: Strong Convexity Certificate

📝Strong Convexity of Logistic Loss

Show that the logistic loss ℓ(z)=log⁡(1+e−z)\ell(z) = \log(1 + e^{-z}) is convex but not strongly convex.

💡Solution

First derivative: ℓ′(z)=−e−z1+e−z=âˆ’Īƒ(−z)\ell'(z) = \frac{-e^{-z}}{1 + e^{-z}} = -\sigma(-z).

Second derivative: ℓ′′(z)=΃(z)(1âˆ’Īƒ(z))\ell''(z) = \sigma(z)(1 - \sigma(z)) where ΃(z)=11+e−z\sigma(z) = \frac{1}{1+e^{-z}}.

Since ΃(z)∈(0,1)\sigma(z) \in (0, 1), we have ℓ′′(z)>0\ell''(z) > 0 for all zz, so the logistic loss is strictly convex.

However, ℓ′′(z)→0\ell''(z) \to 0 as âˆŖzâˆŖâ†’âˆž|z| \to \infty (since ΃(z)(1âˆ’Īƒ(z))→0\sigma(z)(1-\sigma(z)) \to 0). There is no m>0m > 0 such that ℓ′′(z)â‰Ĩm\ell''(z) \geq m for all zz. Therefore the logistic loss is convex and strictly convex, but not strongly convex.


Quick Reference

📋Key Takeaways

  • Convex Set: A set CC is convex if every convex combination θx+(1−θ)y∈C\theta x + (1-\theta)y \in C for θ∈[0,1]\theta \in [0,1]. Intersections of convex sets are convex.
  • Convex Function: ff is convex if f(θx+(1−θ)y)≤θf(x)+(1−θ)f(y)f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y). First-order condition: tangent hyperplanes are global underestimators. Second-order: ∇2f(x)âǰ0\nabla^2 f(x) \succeq 0.
  • Strong Convexity: ff is mm-strongly convex if ∇2f(x)âǰmI\nabla^2 f(x) \succeq mI. This gives linear convergence rate ΁=L−mL+m\rho = \frac{L-m}{L+m} for gradient descent.
  • Convex Optimization Problem: Minimize convex ff subject to convex gi≤0g_i \leq 0 and affine hj=0h_j = 0. Every local minimum is a global minimum.
  • Problem Hierarchy: LP ⊂\subset QP ⊂\subset SOCP ⊂\subset SDP. Each class includes the previous one.
  • Duality: The Lagrangian dual provides a lower bound. Strong duality (d∗=p∗d^* = p^*) holds for convex problems under Slater's condition.
  • KKT Conditions: Stationarity, primal feasibility, dual feasibility (Îģâ‰Ĩ0\lambda \geq 0), and complementary slackness (Îģigi=0\lambda_i g_i = 0) are necessary and sufficient for optimality when strong duality holds.
  • Applications: SVMs (QP with kernel trick), ridge regression (closed-form QP), LASSO (convex but non-smooth), logistic regression (convex nonlinear).
  • Python: Use CVXPY for modeling and solving convex problems; use SciPy for smaller problems with manual constraint specification.
  • Key Insight: Convexity is the dividing line between "tractable" and "intractable" optimization. If your problem is convex, you can solve it efficiently and certify global optimality.

Cross-References

  • Constrained Optimization: KKT conditions, penalty methods, and interior-point methods → Constrained Optimization
  • Gradient Descent: First-order methods for convex optimization → Gradient Descent
  • Newton's Method: Second-order methods using Hessian curvature → Newton's Method
  • Linear Algebra: Positive Definite Matrices: Hessian and PSD conditions for convexity → Positive Definite
  • Linear Algebra: Norms: Norms used in regularization constraints → Norms
  • Matrix Calculus: Derivatives of matrix-valued functions for SDP and SVM → Matrix Calculus
  • Linear Programming: Specialized methods for LP problems → Linear Programming
  • Quadratic Programming: Specialized methods for QP problems → Quadratic Programming
  • Lagrange Multipliers: Foundation for duality and KKT → Lagrange Multipliers
  • Multivariable Calculus: Gradients, Jacobians, and Hessians → Multivariable Calculus
Lesson Progress61 / 100