Optimization — Finding the Best Solution

ℹ️ Why It Matters

Every ML model training is an optimization problem. "Find the model parameters that minimize the error." Optimization is the engine that powers learning.

What is Optimization?

Find the best solution from all feasible solutions.

Optimization Problem

\min_x f(x) \quad \text{subject to constraints}

Here,

$f(x)$ =Objective function to minimize

Types:

Convex optimization: The good kind — any local minimum is also the global minimum
Non-convex optimization: The hard kind — many local minima (neural networks!)

Convex Functions

A function is convex if a line between any two points on the function lies above the function.

DfConvex Function

A function $f$ is convex if for all $x, y$ and $t \in [0,1]$ :

f(tx + (1-t)y) \leq tf(x) + (1-t)f(y)

Properties of convex functions:

Any local minimum is the global minimum
Gradient descent finds the global minimum
The set of convex functions is closed under addition and scalar multiplication

Common convex functions:

$f(x) = x^2$ (quadratic)
$f(x) = e^x$ (exponential)
$f(x) = -\log(x)$ (negative logarithm)
$f(x) = |x|$ (absolute value)
Any norm: $\|x\|$

ℹ️ Why convexity matters in ML

Linear regression → convex
Logistic regression → convex
SVM → convex
Neural networks → NON-convex (but still work well in practice!)

Gradient Descent Variants

Batch Gradient Descent

ℹ️ Batch Gradient Descent

Uses entire dataset for each update
Guaranteed to converge to global minimum (for convex)
Slow for large datasets

Stochastic Gradient Descent (SGD)

ℹ️ Stochastic Gradient Descent

Uses one random sample per update
Fast but noisy
Can escape local minima due to noise

Mini-Batch SGD

ℹ️ Mini-Batch SGD

Uses a batch of samples (32, 64, 128, 256)
Most commonly used in practice
Balances speed and stability

Momentum

v_t = \beta \times v_{t-1} + \alpha \times \nabla L(\theta)

Here,

$v_t$ =Velocity at time t
$\beta$ =Momentum coefficient (typically 0.9)

\theta = \theta - v_t

💡 Why Momentum Helps

Adds "momentum" to escape shallow local minima.

AdaGrad

ℹ️ AdaGrad

Adapts learning rate for each parameter
Frequently updated parameters get smaller learning rates
Good for sparse data (NLP)

RMSProp

ℹ️ RMSProp

Fixes AdaGrad's problem of ever-decreasing learning rates
Uses exponential moving average of squared gradients

Adam (Adaptive Moment Estimation)

Adam Optimizer

m_t = \beta_1 \times m_{t-1} + (1-\beta_1) \times \nabla L

Here,

$m_t$ =First moment (mean)
$\beta_1$ =Exponential decay rate for first moment

Adam Second Moment

v_t = \beta_2 \times v_{t-1} + (1-\beta_2) \times (\nabla L)^2

Here,

$v_t$ =Second moment (variance)
$\beta_2$ =Exponential decay rate for second moment

Bias correction:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}

\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Adam Update Rule

\theta = \theta - \alpha \times \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Here,

$\alpha$ =Learning rate (typically 0.001)
$\epsilon$ =Small constant (typically 1e-8)

💡 Why Adam is great

Adaptive learning rates for each parameter
Momentum helps escape local minima
Works well with default parameters
Handles sparse gradients

Constrained Optimization

Subject to constraints:

Constrained Optimization

\min_x f(x)

Here,

$g_i(x) \leq 0$ =Inequality constraints
$h_j(x) = 0$ =Equality constraints

Lagrange Multipliers

Idea: Convert constrained optimization to unconstrained by adding penalty terms.

Lagrangian

L(x, \lambda) = f(x) + \sum_i \lambda_i \times g_i(x) + \sum_j \mu_j \times h_j(x)

Here,

$\lambda_i, \mu_j$ =Lagrange multipliers

Set: $\frac{\partial L}{\partial x} = 0$ , $\frac{\partial L}{\partial \lambda} = 0$

📝Example: Lagrange Multipliers

Minimize $f(x,y) = x^2 + y^2$ subject to: $x + y = 1$

L(x,y,\lambda) = x^2 + y^2 + \lambda(x + y - 1)

\frac{\partial L}{\partial x} = 2x + \lambda = 0

\frac{\partial L}{\partial y} = 2y + \lambda = 0

\frac{\partial L}{\partial \lambda} = x + y - 1 = 0

From first two: $x = y = -\lambda/2$

Substituting: $-\lambda/2 - \lambda/2 - 1 = 0 \rightarrow \lambda = -1$

Solution: $x = 1/2$ , $y = 1/2$

KKT Conditions (Karush-Kuhn-Tucker)

The generalization of Lagrange multipliers for inequality constraints:

ℹ️ KKT Conditions

\frac{\partial L}{\partial x} = 0

\lambda_i \times g_i(x) = 0 \quad \text{(Complementary slackness)}

\lambda_i \geq 0

g_i(x) \leq 0

Used in: SVM optimization, portfolio optimization

Convex Optimization Problems

Problem Type	Form	Example
Linear Programming	min cᵀx subject to Ax ≤ b	Resource allocation
Quadratic Programming	min xᵀQx + cᵀx	SVM
Second-Order Cone	min cᵀx subject to
Semidefinite Programming	min tr(CX) subject to AX = B, X ⪰ 0	Relaxations

Linear Programming

\min \mathbf{c}^T \mathbf{x}

Here,

$\mathbf{c}$ =Cost vector
$\mathbf{x}$ =Decision variables

Subject to: $A\mathbf{x} \leq \mathbf{b}$ , $\mathbf{x} \geq 0$

Methods: Simplex algorithm, Interior point methods

Non-Convex Optimization

⚠️ Challenge

Many local minima, saddle points, plateaus.

Strategies:

Multiple random starts: Try many initial points, keep the best
Simulated annealing: Accept worse solutions sometimes to escape local minima
Evolutionary algorithms: Population-based search
Gradient-based methods with momentum: SGD, Adam
Second-order methods: L-BFGS (approximates Hessian)

Saddle Points

Points where gradient is zero but it's NOT a local minimum:

ℹ️ In high dimensions (like neural networks)

Local minima are rare
Saddle points are common
Most critical points are saddle points
This is actually GOOD news — it means gradient descent works better than expected

Hyperparameter Optimization

Finding the best settings for your ML model.

Grid Search:

ℹ️ Grid Search

Try all combinations:

learning_rate: [0.001, 0.01, 0.1]
batch_size: [32, 64, 128]
→ 9 combinations to try

Random Search:

ℹ️ Random Search

Randomly sample from the hyperparameter space. More efficient than grid search in practice.

Bayesian Optimization:

ℹ️ Bayesian Optimization

Build a surrogate model of the objective function. Use it to decide which hyperparameters to try next.

Tools: Optuna, Hyperopt, SMAC

📋Key Takeaways

Optimization finds the best parameters by minimizing a loss function. $\min_x f(x)$ subject to constraints is the formal problem — every model training session is an optimization problem.
Convexity guarantees global solutions. A function is convex if $f(tx + (1-t)y) \leq tf(x) + (1-t)f(y)$ , meaning any local minimum is the global minimum — linear regression, logistic regression, and SVMs are all convex.
Adam is the go-to optimizer. It combines momentum ( $m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla L$ ) with adaptive learning rates ( $v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla L)^2$ ), updating via $\theta = \theta - \alpha \times \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$ .
The learning rate $\alpha$ is the most important hyperparameter. Too large causes overshooting and divergence; too small causes painfully slow convergence. Start with 0.001 for Adam.
Lagrange multipliers convert constrained to unconstrained problems. The Lagrangian $L(x, \lambda) = f(x) + \sum \lambda_i g_i(x)$ transforms constraints into penalty terms — foundational for SVM optimization (KKT conditions).
Non-convex landscapes have saddle points, not just local minima. In high dimensions, most critical points are saddle points, which gradient descent with momentum can escape — explaining why neural networks train well despite being non-convex.