← Math|4 of 100
Mathematics for Data Science & AI

Optimization — Finding the Best Solution

Master optimization for machine learning: gradient descent, Adam optimizer, convex optimization, and hyperparameter tuning.

📂 Optimization📖 Lesson 4 of 100🎓 Free Course

Advertisement

Optimization — Finding the Best Solution

ℹ️ Why It Matters

Every ML model training is an optimization problem. "Find the model parameters that minimize the error." Optimization is the engine that powers learning.


What is Optimization?

Find the best solution from all feasible solutions.

Optimization Problem

minxf(x)subject to constraints\min_x f(x) \quad \text{subject to constraints}

Here,

  • f(x)f(x)=Objective function to minimize

Types:

  • Convex optimization: The good kind — any local minimum is also the global minimum
  • Non-convex optimization: The hard kind — many local minima (neural networks!)

Convex Functions

A function is convex if a line between any two points on the function lies above the function.

DfConvex Function

A function ff is convex if for all x,yx, y and t[0,1]t \in [0,1]:

f(tx+(1t)y)tf(x)+(1t)f(y)f(tx + (1-t)y) \leq tf(x) + (1-t)f(y)

Properties of convex functions:

  • Any local minimum is the global minimum
  • Gradient descent finds the global minimum
  • The set of convex functions is closed under addition and scalar multiplication

Common convex functions:

  • f(x)=x2f(x) = x^2 (quadratic)
  • f(x)=exf(x) = e^x (exponential)
  • f(x)=log(x)f(x) = -\log(x) (negative logarithm)
  • f(x)=xf(x) = |x| (absolute value)
  • Any norm: x\|x\|

ℹ️ Why convexity matters in ML

  • Linear regression → convex
  • Logistic regression → convex
  • SVM → convex
  • Neural networks → NON-convex (but still work well in practice!)

Gradient Descent Variants

Batch Gradient Descent

ℹ️ Batch Gradient Descent

  • Uses entire dataset for each update
  • Guaranteed to converge to global minimum (for convex)
  • Slow for large datasets

Stochastic Gradient Descent (SGD)

ℹ️ Stochastic Gradient Descent

  • Uses one random sample per update
  • Fast but noisy
  • Can escape local minima due to noise

Mini-Batch SGD

ℹ️ Mini-Batch SGD

  • Uses a batch of samples (32, 64, 128, 256)
  • Most commonly used in practice
  • Balances speed and stability

Momentum

Momentum

vt=β×vt1+α×L(θ)v_t = \beta \times v_{t-1} + \alpha \times \nabla L(\theta)

Here,

  • vtv_t=Velocity at time t
  • β\beta=Momentum coefficient (typically 0.9)
θ=θvt\theta = \theta - v_t

💡 Why Momentum Helps

Adds "momentum" to escape shallow local minima.

AdaGrad

ℹ️ AdaGrad

  • Adapts learning rate for each parameter
  • Frequently updated parameters get smaller learning rates
  • Good for sparse data (NLP)

RMSProp

ℹ️ RMSProp

  • Fixes AdaGrad's problem of ever-decreasing learning rates
  • Uses exponential moving average of squared gradients

Adam (Adaptive Moment Estimation)

Adam Optimizer

mt=β1×mt1+(1β1)×Lm_t = \beta_1 \times m_{t-1} + (1-\beta_1) \times \nabla L

Here,

  • mtm_t=First moment (mean)
  • β1\beta_1=Exponential decay rate for first moment

Adam Second Moment

vt=β2×vt1+(1β2)×(L)2v_t = \beta_2 \times v_{t-1} + (1-\beta_2) \times (\nabla L)^2

Here,

  • vtv_t=Second moment (variance)
  • β2\beta_2=Exponential decay rate for second moment

Bias correction:

m^t=mt1β1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}
v^t=vt1β2t\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Adam Update Rule

θ=θα×m^tv^t+ϵ\theta = \theta - \alpha \times \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Here,

  • α\alpha=Learning rate (typically 0.001)
  • ϵ\epsilon=Small constant (typically 1e-8)

💡 Why Adam is great

  • Adaptive learning rates for each parameter
  • Momentum helps escape local minima
  • Works well with default parameters
  • Handles sparse gradients

Constrained Optimization

Subject to constraints:

Constrained Optimization

minxf(x)\min_x f(x)

Here,

  • gi(x)0g_i(x) \leq 0=Inequality constraints
  • hj(x)=0h_j(x) = 0=Equality constraints

Lagrange Multipliers

Idea: Convert constrained optimization to unconstrained by adding penalty terms.

Lagrangian

L(x,λ)=f(x)+iλi×gi(x)+jμj×hj(x)L(x, \lambda) = f(x) + \sum_i \lambda_i \times g_i(x) + \sum_j \mu_j \times h_j(x)

Here,

  • λi,μj\lambda_i, \mu_j=Lagrange multipliers

Set: Lx=0\frac{\partial L}{\partial x} = 0, Lλ=0\frac{\partial L}{\partial \lambda} = 0

📝Example: Lagrange Multipliers

Minimize f(x,y)=x2+y2f(x,y) = x^2 + y^2 subject to: x+y=1x + y = 1

L(x,y,λ)=x2+y2+λ(x+y1)L(x,y,\lambda) = x^2 + y^2 + \lambda(x + y - 1)
Lx=2x+λ=0\frac{\partial L}{\partial x} = 2x + \lambda = 0
Ly=2y+λ=0\frac{\partial L}{\partial y} = 2y + \lambda = 0
Lλ=x+y1=0\frac{\partial L}{\partial \lambda} = x + y - 1 = 0

From first two: x=y=λ/2x = y = -\lambda/2

Substituting: λ/2λ/21=0λ=1-\lambda/2 - \lambda/2 - 1 = 0 \rightarrow \lambda = -1

Solution: x=1/2x = 1/2, y=1/2y = 1/2

KKT Conditions (Karush-Kuhn-Tucker)

The generalization of Lagrange multipliers for inequality constraints:

ℹ️ KKT Conditions

Lx=0\frac{\partial L}{\partial x} = 0
λi×gi(x)=0(Complementary slackness)\lambda_i \times g_i(x) = 0 \quad \text{(Complementary slackness)}
λi0\lambda_i \geq 0
gi(x)0g_i(x) \leq 0

Used in: SVM optimization, portfolio optimization


Convex Optimization Problems

Problem TypeFormExample
Linear Programmingmin cᵀx subject to Ax ≤ bResource allocation
Quadratic Programmingmin xᵀQx + cᵀxSVM
Second-Order Conemin cᵀx subject to
Semidefinite Programmingmin tr(CX) subject to AX = B, X ⪰ 0Relaxations

Linear Programming

Linear Programming

mincTx\min \mathbf{c}^T \mathbf{x}

Here,

  • c\mathbf{c}=Cost vector
  • x\mathbf{x}=Decision variables

Subject to: AxbA\mathbf{x} \leq \mathbf{b}, x0\mathbf{x} \geq 0

Methods: Simplex algorithm, Interior point methods


Non-Convex Optimization

⚠️ Challenge

Many local minima, saddle points, plateaus.

Strategies:

  1. Multiple random starts: Try many initial points, keep the best
  2. Simulated annealing: Accept worse solutions sometimes to escape local minima
  3. Evolutionary algorithms: Population-based search
  4. Gradient-based methods with momentum: SGD, Adam
  5. Second-order methods: L-BFGS (approximates Hessian)

Saddle Points

Points where gradient is zero but it's NOT a local minimum:

ℹ️ In high dimensions (like neural networks)

  • Local minima are rare
  • Saddle points are common
  • Most critical points are saddle points
  • This is actually GOOD news — it means gradient descent works better than expected

Hyperparameter Optimization

Finding the best settings for your ML model.

Grid Search:

ℹ️ Grid Search

Try all combinations:

  • learning_rate: [0.001, 0.01, 0.1]
  • batch_size: [32, 64, 128]
  • → 9 combinations to try

Random Search:

ℹ️ Random Search

Randomly sample from the hyperparameter space. More efficient than grid search in practice.

Bayesian Optimization:

ℹ️ Bayesian Optimization

Build a surrogate model of the objective function. Use it to decide which hyperparameters to try next.

Tools: Optuna, Hyperopt, SMAC


📋Key Takeaways

  • Optimization finds the best parameters by minimizing a loss function. minxf(x)\min_x f(x) subject to constraints is the formal problem — every model training session is an optimization problem.

  • Convexity guarantees global solutions. A function is convex if f(tx+(1t)y)tf(x)+(1t)f(y)f(tx + (1-t)y) \leq tf(x) + (1-t)f(y), meaning any local minimum is the global minimum — linear regression, logistic regression, and SVMs are all convex.

  • Adam is the go-to optimizer. It combines momentum (mt=β1mt1+(1β1)Lm_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla L) with adaptive learning rates (vt=β2vt1+(1β2)(L)2v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla L)^2), updating via θ=θα×m^tv^t+ϵ\theta = \theta - \alpha \times \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}.

  • The learning rate α\alpha is the most important hyperparameter. Too large causes overshooting and divergence; too small causes painfully slow convergence. Start with 0.001 for Adam.

  • Lagrange multipliers convert constrained to unconstrained problems. The Lagrangian L(x,λ)=f(x)+λigi(x)L(x, \lambda) = f(x) + \sum \lambda_i g_i(x) transforms constraints into penalty terms — foundational for SVM optimization (KKT conditions).

  • Non-convex landscapes have saddle points, not just local minima. In high dimensions, most critical points are saddle points, which gradient descent with momentum can escape — explaining why neural networks train well despite being non-convex.

Lesson Progress4 / 100