Applications in Deep Learning
💡 Why It Matters
Optimization is the backbone of every modern machine learning system. From selecting the best portfolio of stocks to designing the architecture of a neural network, optimization algorithms determine how quickly we learn, how well we generalize, and whether we converge at all. Mastering these applications transforms you from someone who调用 model.fit() into someone who understands why it works and how to fix it when it doesn't.
Portfolio Optimization (Markowitz Mean-Variance)
Markowitz Portfolio Optimization
📝Example: Two-Asset Portfolio
Asset A: , . Asset B: , . Correlation .
Covariance: .
Covariance matrix:
💡Solution
Solving for , with target return :
Portfolio variance:
Portfolio std dev: (lower than either asset alone due to diversification).
Resource Allocation
DfResource Allocation Problem
Given resources with capacities () and tasks with resource requirements and profits , find task selections to maximize total profit:
This is the 0-1 Knapsack Problem (single resource) or Generalized Assignment Problem (multiple resources). Both are NP-hard, but efficient approximations and dynamic programming solutions exist.
ℹ️ LP Relaxation
The LP relaxation (allowing ) provides an upper bound. Rounding the LP solution yields a feasible integer solution within a factor of 2 of optimal. The integrality gap for the knapsack problem is bounded by , making it effectively a PTAS.
Network Design
DfNetwork Design Optimization
Design a network (e.g., telecommunications, transportation, or supply chain) by selecting edges to minimize cost while satisfying connectivity and flow requirements:
Where is the cost of edge , is the net flow at node , and is the capacity.
ThMinimum Spanning Tree (MST)
For undirected networks, the MST selects edges minimizing total cost while connecting all nodes. Kruskal's algorithm (greedy by edge weight) and Prim's algorithm (greedy by growing tree) both find the MST in time. The MST is the optimal solution to the network design problem when all demands are unit and all node degrees are unconstrained.
Scheduling Problems
DfJob-Shop Scheduling
Given jobs and machines , each job consists of a sequence of operations that must be processed on specific machines for specified durations. The objective is to minimize the makespan (total completion time) or total weighted completion time:
The job-shop problem is strongly NP-hard for . The flow-shop variant (all jobs follow the same machine sequence) is also NP-hard but admits effective heuristics like the Johnson's rule for .
⚠️ Complexity Trap
Don't confuse scheduling complexity classes: Single machine with release dates () is solvable in polynomial time. Two machines with release dates () is already NP-hard. Always check the problem variant before attempting an exact solution.
Feature Selection
L1 Regularization for Feature Selection
ThIrrepresentability Condition
The Lasso selects the correct support (nonzero features) with high probability if and only if the irrepresentability condition holds: the correlation between selected and unselected features must be bounded. Specifically, for some . When this condition fails, the Lasso may include irrelevant features or exclude relevant ones.
Neural Architecture Search (NAS)
DfNeural Architecture Search
NAS automates the design of neural network architectures by treating architecture selection as an optimization problem over a discrete search space :
Search strategies include: random search, grid search, evolutionary algorithms (mutation + crossover of architectures), reinforcement learning (controller network generates architectures, rewarded by validation accuracy), and differentiable NAS (DARTS) which relaxes the discrete search to continuous weights.
💡 DARTS Efficiency
DARTS (Differentiable Architecture Search) reduces NAS from GPU-years to GPU-hours by jointly optimizing architecture weights and network weights via bilevel optimization. The continuous relaxation allows gradient-based updates, but the discovered architectures often require careful post-processing to avoid performance collapse.
Reinforcement Learning as Optimization
Policy Gradient Optimization
ℹ️ Optimization Landscape
RL optimization is non-convex, stochastic, and has moving targets (the data distribution changes as the policy improves). Trust region methods (TRPO, PPO) and actor-critic architectures (A2C, SAC) stabilize training by restricting policy changes and reducing gradient variance.
AutoML
DfAutomated Machine Learning (AutoML)
AutoML automates the entire machine learning pipeline: feature engineering, model selection, hyperparameter tuning, and ensemble construction. The meta-optimization problem is:
Here represents hyperparameters (architecture, learning rate, regularization strength) and represents model parameters. AutoML frameworks include Auto-sklearn (Bayesian optimization + meta-learning), H2O AutoML (stacked ensembles), and Google AutoML (neural architecture search at scale).
⚠️ Computational Cost
AutoML can consume hundreds of GPU-hours. Always set a budget (time or number of trials) and use warm-starting from prior runs or meta-learning to reduce search cost. For most tabular problems, gradient boosting with tuned hyperparameters (via Optuna or Hyperopt) matches AutoML performance at a fraction of the cost.
Python Implementation
Portfolio Optimization
import numpy as np
from scipy.optimize import minimize
def markowitz_optimize(mu, Sigma, target_return):
n = len(mu)
def portfolio_variance(w):
return w @ Sigma @ w
constraints = [
{'type': 'eq', 'fun': lambda w: w @ mu - target_return},
{'type': 'eq', 'fun': lambda w: np.sum(w) - 1.0}
]
bounds = [(0, 1) for _ in range(n)]
result = minimize(portfolio_variance, x0=np.ones(n)/n,
method='SLSQP', bounds=bounds, constraints=constraints)
return result.x
# Example
mu = np.array([0.12, 0.08, 0.15])
Sigma = np.array([[0.04, 0.009, 0.012],
[0.009, 0.0225, 0.008],
[0.012, 0.008, 0.0625]])
w = markowitz_optimize(mu, Sigma, target_return=0.11)
print(f"Weights: {w}")
print(f"Portfolio return: {w @ mu:.4f}")
print(f"Portfolio risk: {np.sqrt(w @ Sigma @ w):.4f}")
Feature Selection with Lasso Path
import numpy as np
from sklearn.linear_model import lasso_path
def feature_selection_lasso(X, y, n_alphas=50):
"""Compute the full Lasso regularization path."""
alphas, coefs, _ = lasso_path(X, y, n_alphas=n_alphas)
# Identify feature entry order
selected_order = []
for i, alpha in enumerate(alphas[::-1]):
nonzero = np.where(np.abs(coefs[:, i]) > 1e-8)[0]
for f in nonzero:
if f not in selected_order:
selected_order.append(f)
return alphas, coefs, selected_order
# Example
np.random.seed(42)
n, p = 100, 20
X = np.random.randn(n, p)
true_support = [0, 3, 7, 12]
beta_true = np.zeros(p)
beta_true[true_support] = [2.0, -1.5, 3.0, -0.8]
y = X @ beta_true + 0.5 * np.random.randn(n)
alphas, coefs, order = feature_selection_lasso(X, y)
print(f"Feature entry order: {order}")
print(f"True support: {true_support}")
Neural Architecture Search (Simplified DARTS)
import torch
import torch.nn as nn
import torch.nn.functional as F
class MixedOperation(nn.Module):
"""DARTS-style mixed operation over candidate edges."""
def __init__(self, C_in, C_out, ops):
super().__init__()
self.ops = nn.ModuleList(ops)
self.arch_weights = nn.Parameter(torch.randn(len(ops)))
def forward(self, x):
weights = F.softmax(self.arch_weights, dim=0)
return sum(w * op(x) for w, op in zip(weights, self.ops))
# Candidate operations
ops = [
nn.Sequential(nn.Conv2d(16, 16, 3, padding=1), nn.BatchNorm2d(16), nn.ReLU()),
nn.Sequential(nn.Conv2d(16, 16, 5, padding=2), nn.BatchNorm2d(16), nn.ReLU()),
nn.MaxPool2d(3, stride=1, padding=1),
nn.Identity(),
]
mixed = MixedOperation(16, 16, ops)
x = torch.randn(1, 16, 32, 32)
out = mixed(x)
print(f"Output shape: {out.shape}")
# After bilevel optimization, extract best architecture
best_idx = mixed.arch_weights.argmax().item()
print(f"Best operation index: {best_idx}")
PPO Policy Optimization
import torch
import torch.nn as nn
class ActorCritic(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(state_dim, 64), nn.ReLU(),
nn.Linear(64, 64), nn.ReLU()
)
self.policy = nn.Linear(64, action_dim)
self.value = nn.Linear(64, 1)
def forward(self, state):
features = self.shared(state)
return F.softmax(self.policy(features), dim=-1), self.value(features)
def ppo_loss(old_log_probs, new_log_probs, advantages, eps=0.2):
ratio = torch.exp(new_log_probs - old_log_probs)
clipped = torch.clamp(ratio, 1 - eps, 1 + eps)
return -torch.min(ratio * advantages, clipped * advantages).mean()
# Example usage
state_dim, action_dim = 4, 2
model = ActorCritic(state_dim, action_dim)
state = torch.randn(1, state_dim)
probs, value = model(state)
print(f"Action probs: {probs}, Value: {value.item():.4f}")
Common Mistakes
| Mistake | Why It Fails | Fix |
|---|---|---|
| Using SGD without momentum on deep networks | Converges slowly, gets stuck in saddle points | Add momentum () or use Adam |
| Setting learning rate too high in Adam | Loss diverges, gradients explode | Start with , use LR finder |
| Forgetting to zero gradients | Gradients accumulate across batches, wrong updates | Call optimizer.zero_grad() before each backward pass |
| Not scaling features for Lasso | Features with larger scale dominate the penalty | Standardize features to zero mean, unit variance |
| Using penalty instead of weight decay in Adam | L2 and weight decay are NOT equivalent in Adam | Use AdamW for proper weight decay |
| Ignoring covariance in portfolio optimization | Underestimates risk, over-concentrates positions | Always use full covariance matrix |
| Stopping NAS too early | Final architecture is suboptimal, no convergence | Use early stopping with patience, not fixed epochs |
| Not clipping gradients in RNNs | Exploding gradients cause NaN loss and divergence | Always clip gradients for recurrent architectures |
| Using discrete search for NAS without warm-start | Prohibitively expensive (years of GPU time) | Use DARTS or Bayesian optimization with warm-starting |
| Over-regularizing with high in Lasso | Model becomes too sparse, high bias | Cross-validate using the one-standard-error rule |
Interview Questions
-
Why is the Lasso able to perform feature selection while Ridge regression is not? The Lasso's penalty creates a diamond-shaped constraint region with corners on the axes. The solution (intersection of the loss ellipse with the constraint) often lands exactly on a corner, setting some coefficients to zero. Ridge's ball has no corners, so coefficients shrink but never reach exactly zero.
-
Explain the efficient frontier in portfolio theory. What assumption breaks it? The efficient frontier is the set of portfolios achieving minimum variance for each target return. It breaks when returns are not normally distributed (fat tails, skewness) or when the covariance matrix is estimated with error (estimation error amplifies with the number of assets).
-
Why is PPO preferred over vanilla policy gradient in practice? PPO constrains policy updates via clipping, preventing destructive large steps that can collapse performance. It uses off-policy data (collected by the old policy) efficiently, reduces variance with advantage estimation, and is more stable and sample-efficient than REINFORCE.
-
When does DARTS fail? What is the "performance collapse" problem? DARTS tends to select skip connections (identity operations) disproportionately, especially in deeper cells, because skip connections have lower training loss early in search. This leads to degenerate architectures. Solutions: enforce a minimum number of non-skip operations, use early stopping before convergence.
-
How would you scale AutoML to millions of hyperparameter configurations? Use Bayesian optimization (Gaussian processes or Tree-structured Parzen Estimators) with warm-starting from prior runs. Meta-learning from similar datasets reduces the search space. For extreme scale, use population-based training (PBT) which evolves a population of models in parallel.
-
What is the difference between the 0-1 knapsack and the fractional knapsack? The 0-1 knapsack (items cannot be divided) is NP-hard, solved by dynamic programming in . The fractional knapsack (items can be divided) is solvable greedily in by taking the highest value-to-weight ratio items first.
-
Why does gradient clipping use the norm rather than clipping each gradient component independently? Clipping by norm preserves the direction of the gradient (the relative magnitudes of components are maintained). Component-wise clipping distorts the gradient direction, which can harm convergence. Norm clipping rescales: .
-
Explain the irrepresentability condition for Lasso feature selection. It requires that the correlation between relevant and irrelevant features is bounded: . If violated, the Lasso includes irrelevant features (false positives) because they are correlated with relevant features. Group Lasso or adaptive Lasso can partially address this.
Practice Problems
📝Problem 1: Portfolio Risk
You have three assets with and covariance matrix where , . Find the minimum variance portfolio with expected return .
💡Solution
Using the Lagrangian with constraints and :
Solve:
Setting up the system and solving numerically:
Portfolio variance: , so .
📝Problem 2: Feature Selection
Given a dataset with 50 features, 20 of which are truly relevant, and you run Lasso with cross-validated . The model selects 25 features. What can you conclude?
💡Solution
The model has 5 false positives and likely some false negatives. Possible causes: (1) Irrepresentability condition is violated — some irrelevant features are correlated with relevant ones. (2) Features have different scales — standardize before applying Lasso. (3) Consider using adaptive Lasso (reweighted ) or stability selection to reduce false discoveries.
📝Problem 3: RL Optimization
In PPO, you set . The old policy gives , and the new policy gives . The advantage estimate is . What is the PPO surrogate loss for this transition?
💡Solution
Probability ratio:
Unclipped objective:
Clipped objective:
PPO loss:
The clipping prevents the update from being too aggressive, even though the advantage is positive (we want to increase this action's probability).
Quick Reference
| Topic | Key Formula / Concept | When to Use |
|---|---|---|
| Portfolio Optimization | s.t. | Asset allocation, risk management |
| Resource Allocation | s.t. | Knapsack, scheduling, budgeting |
| Network Design | s.t. flow constraints | Telecommunications, logistics |
| Scheduling | Minimize makespan with precedence constraints | Manufacturing, cloud computing |
| Feature Selection (Lasso) | High-dimensional regression, interpretability | |
| NAS (DARTS) | Bilevel optimization over architecture and weights | Automated model design |
| Policy Gradient | RL policy optimization | |
| PPO | Clipped surrogate: | Stable RL training |
| AutoML | Hyperparameter optimization + pipeline search | End-to-end ML automation |
Cross-References
- Optimization Fundamentals: See Convex Optimization Basics for foundations of convexity, duality, and gradient methods
- Gradient Descent: See Stochastic Gradient Descent for SGD variants and convergence analysis
- Regularization: See Regularization Techniques for , , elastic net, and their theoretical properties
- Reinforcement Learning: See Policy Gradient Methods for REINFORCE, A2C, and actor-critic algorithms
- Neural Architecture Search: See AutoML and NAS for DARTS, evolutionary search, and Bayesian optimization
- Portfolio Theory: See Markowitz Model for detailed derivation of the efficient frontier and CAPM
- Combinatorial Optimization: See Integer Programming for branch-and-bound, cutting planes, and LP relaxation
- Hyperparameter Tuning: See Bayesian Optimization for Gaussian process surrogates and acquisition functions
- Scaling Laws: See Neural Scaling Laws for compute-optimal training and Chinchilla-optimal model sizing
Key Takeaways
📋Summary
- Optimization pervades ML: From portfolio selection to architecture search, optimization is the common thread connecting diverse applications. Understanding the problem structure (convex vs. non-convex, continuous vs. discrete, constrained vs. unconstrained) determines the right algorithm.
- Portfolio Optimization: Markowitz mean-variance is quadratic programming. The efficient frontier gives the risk-return tradeoff; Sharpe ratio selects the optimal portfolio. Real-world challenges: estimation error in , non-normal returns.
- Resource Allocation: Knapsack and assignment problems are NP-hard but have efficient approximations (LP relaxation, greedy). Always check if the problem has special structure (matroid, submodularity) that enables better guarantees.
- Network Design: MST and min-cost flow are polynomial; general network design is NP-hard. Greedy and primal-dual algorithms give constant-factor approximations.
- Scheduling: Job-shop is strongly NP-hard for . Heuristics (dispatch rules, genetic algorithms) and LP-based bounds are practical. Always check the problem variant ( vs. ).
- Feature Selection: Lasso () selects sparse models; Ridge () shrinks but doesn't select. Adaptive Lasso and stability selection reduce false discoveries. Standardize features before applying Lasso.
- NAS: DARTS enables gradient-based architecture search but suffers from performance collapse. Use early stopping and enforce operation diversity. For tabular data, tuned gradient boosting often matches NAS-discovered architectures.
- RL as Optimization: Policy gradient methods optimize directly. PPO's clipping prevents destructive updates. Trust region methods (TRPO, PPO) are more stable than vanilla policy gradient.
- AutoML: Automates the ML pipeline but can be expensive. Bayesian optimization with warm-starting is the standard approach. For most problems, manual tuning of gradient boosting matches AutoML at lower cost.
- Python Implementation: Use
scipy.optimizefor portfolio problems,sklearn.linear_model.lasso_pathfor feature selection, and PyTorch/TensorFlow for NAS and RL. Always validate with cross-validation. - Common Pitfalls: Not zeroing gradients, using L2 penalty instead of weight decay in Adam, ignoring covariance estimation error, not clipping gradients in RNNs, and over-regularizing with high .
- Interview Essentials: Know why Lasso selects features (geometry of ball), why PPO is preferred (clipping prevents destructive updates), and the difference between 0-1 and fractional knapsack.