Optimization — Finding the Best Solution
ℹ️ Why It Matters
Every ML model training is an optimization problem. "Find the model parameters that minimize the error." Optimization is the engine that powers learning.
What is Optimization?
Find the best solution from all feasible solutions.
Optimization Problem
Here,
- =Objective function to minimize
Types:
- Convex optimization: The good kind — any local minimum is also the global minimum
- Non-convex optimization: The hard kind — many local minima (neural networks!)
Convex Functions
A function is convex if a line between any two points on the function lies above the function.
DfConvex Function
A function is convex if for all and :
Properties of convex functions:
- Any local minimum is the global minimum
- Gradient descent finds the global minimum
- The set of convex functions is closed under addition and scalar multiplication
Common convex functions:
- (quadratic)
- (exponential)
- (negative logarithm)
- (absolute value)
- Any norm:
ℹ️ Why convexity matters in ML
- Linear regression → convex
- Logistic regression → convex
- SVM → convex
- Neural networks → NON-convex (but still work well in practice!)
Gradient Descent Variants
Batch Gradient Descent
ℹ️ Batch Gradient Descent
- Uses entire dataset for each update
- Guaranteed to converge to global minimum (for convex)
- Slow for large datasets
Stochastic Gradient Descent (SGD)
ℹ️ Stochastic Gradient Descent
- Uses one random sample per update
- Fast but noisy
- Can escape local minima due to noise
Mini-Batch SGD
ℹ️ Mini-Batch SGD
- Uses a batch of samples (32, 64, 128, 256)
- Most commonly used in practice
- Balances speed and stability
Momentum
Momentum
Here,
- =Velocity at time t
- =Momentum coefficient (typically 0.9)
💡 Why Momentum Helps
Adds "momentum" to escape shallow local minima.
AdaGrad
ℹ️ AdaGrad
- Adapts learning rate for each parameter
- Frequently updated parameters get smaller learning rates
- Good for sparse data (NLP)
RMSProp
ℹ️ RMSProp
- Fixes AdaGrad's problem of ever-decreasing learning rates
- Uses exponential moving average of squared gradients
Adam (Adaptive Moment Estimation)
Adam Optimizer
Here,
- =First moment (mean)
- =Exponential decay rate for first moment
Adam Second Moment
Here,
- =Second moment (variance)
- =Exponential decay rate for second moment
Bias correction:
Adam Update Rule
Here,
- =Learning rate (typically 0.001)
- =Small constant (typically 1e-8)
💡 Why Adam is great
- Adaptive learning rates for each parameter
- Momentum helps escape local minima
- Works well with default parameters
- Handles sparse gradients
Constrained Optimization
Subject to constraints:
Constrained Optimization
Here,
- =Inequality constraints
- =Equality constraints
Lagrange Multipliers
Idea: Convert constrained optimization to unconstrained by adding penalty terms.
Lagrangian
Here,
- =Lagrange multipliers
Set: ,
📝Example: Lagrange Multipliers
Minimize subject to:
From first two:
Substituting:
Solution: ,
KKT Conditions (Karush-Kuhn-Tucker)
The generalization of Lagrange multipliers for inequality constraints:
ℹ️ KKT Conditions
Used in: SVM optimization, portfolio optimization
Convex Optimization Problems
| Problem Type | Form | Example |
|---|---|---|
| Linear Programming | min cᵀx subject to Ax ≤ b | Resource allocation |
| Quadratic Programming | min xᵀQx + cᵀx | SVM |
| Second-Order Cone | min cᵀx subject to | |
| Semidefinite Programming | min tr(CX) subject to AX = B, X ⪰ 0 | Relaxations |
Linear Programming
Linear Programming
Here,
- =Cost vector
- =Decision variables
Subject to: ,
Methods: Simplex algorithm, Interior point methods
Non-Convex Optimization
⚠️ Challenge
Many local minima, saddle points, plateaus.
Strategies:
- Multiple random starts: Try many initial points, keep the best
- Simulated annealing: Accept worse solutions sometimes to escape local minima
- Evolutionary algorithms: Population-based search
- Gradient-based methods with momentum: SGD, Adam
- Second-order methods: L-BFGS (approximates Hessian)
Saddle Points
Points where gradient is zero but it's NOT a local minimum:
ℹ️ In high dimensions (like neural networks)
- Local minima are rare
- Saddle points are common
- Most critical points are saddle points
- This is actually GOOD news — it means gradient descent works better than expected
Hyperparameter Optimization
Finding the best settings for your ML model.
Grid Search:
ℹ️ Grid Search
Try all combinations:
- learning_rate: [0.001, 0.01, 0.1]
- batch_size: [32, 64, 128]
- → 9 combinations to try
Random Search:
ℹ️ Random Search
Randomly sample from the hyperparameter space. More efficient than grid search in practice.
Bayesian Optimization:
ℹ️ Bayesian Optimization
Build a surrogate model of the objective function. Use it to decide which hyperparameters to try next.
Tools: Optuna, Hyperopt, SMAC
📋Key Takeaways
-
Optimization finds the best parameters by minimizing a loss function. subject to constraints is the formal problem — every model training session is an optimization problem.
-
Convexity guarantees global solutions. A function is convex if , meaning any local minimum is the global minimum — linear regression, logistic regression, and SVMs are all convex.
-
Adam is the go-to optimizer. It combines momentum () with adaptive learning rates (), updating via .
-
The learning rate is the most important hyperparameter. Too large causes overshooting and divergence; too small causes painfully slow convergence. Start with 0.001 for Adam.
-
Lagrange multipliers convert constrained to unconstrained problems. The Lagrangian transforms constraints into penalty terms — foundational for SVM optimization (KKT conditions).
-
Non-convex landscapes have saddle points, not just local minima. In high dimensions, most critical points are saddle points, which gradient descent with momentum can escape — explaining why neural networks train well despite being non-convex.