Applications in Deep Learning

💡 Why It Matters

Optimization is the backbone of every modern machine learning system. From selecting the best portfolio of stocks to designing the architecture of a neural network, optimization algorithms determine how quickly we learn, how well we generalize, and whether we converge at all. Mastering these applications transforms you from someone who调用 model.fit() into someone who understands why it works and how to fix it when it doesn't.

Portfolio Optimization (Markowitz Mean-Variance)

Markowitz Portfolio Optimization

📝Example: Two-Asset Portfolio

Asset A: $\mu_A = 0.12$ , $\sigma_A = 0.20$ . Asset B: $\mu_B = 0.08$ , $\sigma_B = 0.15$ . Correlation $\rho = 0.3$ .

Covariance: $\sigma_{AB} = \rho \cdot \sigma_A \cdot \sigma_B = 0.3 \times 0.20 \times 0.15 = 0.009$ .

Covariance matrix: $\boldsymbol{\Sigma} = \begin{pmatrix} 0.04 & 0.009 \\ 0.009 & 0.0225 \end{pmatrix}$

💡Solution

Solving for $w_A = 0.6$ , $w_B = 0.4$ with target return $r_{\text{target}} = 0.104$ :

Portfolio variance: $(0.6)^2(0.04) + 2(0.6)(0.4)(0.009) + (0.4)^2(0.0225) = 0.02246$

Portfolio std dev: $\sqrt{0.02246} = 0.1499$ (lower than either asset alone due to diversification).

Resource Allocation

DfResource Allocation Problem

Given $m$ resources with capacities $c_j$ ( $j = 1, \dots, m$ ) and $n$ tasks with resource requirements $a_{ij}$ and profits $p_i$ , find task selections $x_i \in \{0, 1\}$ to maximize total profit:

\max_{\boldsymbol{x}} \quad \sum_{i=1}^{n} p_i x_i

\text{s.t.} \quad \sum_{i=1}^{n} a_{ij} x_i \leq c_j \quad \forall j = 1, \dots, m

x_i \in \{0, 1\} \quad \forall i = 1, \dots, n

This is the 0-1 Knapsack Problem (single resource) or Generalized Assignment Problem (multiple resources). Both are NP-hard, but efficient approximations and dynamic programming solutions exist.

ℹ️ LP Relaxation

The LP relaxation (allowing $x_i \in [0, 1]$ ) provides an upper bound. Rounding the LP solution yields a feasible integer solution within a factor of 2 of optimal. The integrality gap for the knapsack problem is bounded by $(1 + 1/\text{OPT})$ , making it effectively a PTAS.

Network Design

DfNetwork Design Optimization

Design a network (e.g., telecommunications, transportation, or supply chain) by selecting edges to minimize cost while satisfying connectivity and flow requirements:

\min_{\boldsymbol{x}} \quad \sum_{(i,j) \in E} c_{ij} x_{ij}

\text{s.t.} \quad \sum_{j:(i,j)\in E} x_{ij} - \sum_{j:(j,i)\in E} x_{ji} = d_i \quad \forall i \in V

x_{ij} \geq 0, \quad x_{ij} \leq u_{ij} \quad \forall (i,j) \in E

Where $c_{ij}$ is the cost of edge $(i,j)$ , $d_i$ is the net flow at node $i$ , and $u_{ij}$ is the capacity.

ThMinimum Spanning Tree (MST)

For undirected networks, the MST selects edges minimizing total cost while connecting all nodes. Kruskal's algorithm (greedy by edge weight) and Prim's algorithm (greedy by growing tree) both find the MST in $O(E \log V)$ time. The MST is the optimal solution to the network design problem when all demands are unit and all node degrees are unconstrained.

Scheduling Problems

DfJob-Shop Scheduling

Given $n$ jobs $J_1, \dots, J_n$ and $m$ machines $M_1, \dots, M_m$ , each job $J_i$ consists of a sequence of operations that must be processed on specific machines for specified durations. The objective is to minimize the makespan (total completion time) or total weighted completion time:

\min \quad C_{\max} = \max_{i} C_i

\text{s.t.} \quad C_i \geq r_i + \sum_{k} p_{ik} \quad \text{(precedence)}

\text{No two operations on the same machine overlap.}

The job-shop problem is strongly NP-hard for $m \geq 2$ . The flow-shop variant (all jobs follow the same machine sequence) is also NP-hard but admits effective heuristics like the Johnson's rule for $m = 2$ .

⚠️ Complexity Trap

Don't confuse scheduling complexity classes: Single machine with release dates ( $1|r_i|C_{\max}$ ) is solvable in polynomial time. Two machines with release dates ( $2|r_i|C_{\max}$ ) is already NP-hard. Always check the problem variant before attempting an exact solution.

Feature Selection

L1 Regularization for Feature Selection

ThIrrepresentability Condition

The Lasso selects the correct support (nonzero features) with high probability if and only if the irrepresentability condition holds: the correlation between selected and unselected features must be bounded. Specifically, $\|\boldsymbol{X}_S^\top \boldsymbol{X}_{S^c}\|_\infty < 1 - \delta$ for some $\delta > 0$ . When this condition fails, the Lasso may include irrelevant features or exclude relevant ones.

Neural Architecture Search (NAS)

DfNeural Architecture Search

NAS automates the design of neural network architectures by treating architecture selection as an optimization problem over a discrete search space $\mathcal{A}$ :

\boldsymbol{\alpha}^* = \arg\min_{\boldsymbol{\alpha} \in \mathcal{A}} \mathcal{L}_{\text{val}}(\boldsymbol{w}^*(\boldsymbol{\alpha}), \boldsymbol{\alpha})

\text{where} \quad \boldsymbol{w}^*(\boldsymbol{\alpha}) = \arg\min_{\boldsymbol{w}} \mathcal{L}_{\text{train}}(\boldsymbol{w}, \boldsymbol{\alpha})

Search strategies include: random search, grid search, evolutionary algorithms (mutation + crossover of architectures), reinforcement learning (controller network generates architectures, rewarded by validation accuracy), and differentiable NAS (DARTS) which relaxes the discrete search to continuous weights.

💡 DARTS Efficiency

DARTS (Differentiable Architecture Search) reduces NAS from GPU-years to GPU-hours by jointly optimizing architecture weights $\boldsymbol{\alpha}$ and network weights $\boldsymbol{w}$ via bilevel optimization. The continuous relaxation allows gradient-based updates, but the discovered architectures often require careful post-processing to avoid performance collapse.

Reinforcement Learning as Optimization

Policy Gradient Optimization

ℹ️ Optimization Landscape

RL optimization is non-convex, stochastic, and has moving targets (the data distribution changes as the policy improves). Trust region methods (TRPO, PPO) and actor-critic architectures (A2C, SAC) stabilize training by restricting policy changes and reducing gradient variance.

AutoML

DfAutomated Machine Learning (AutoML)

AutoML automates the entire machine learning pipeline: feature engineering, model selection, hyperparameter tuning, and ensemble construction. The meta-optimization problem is:

\min_{\boldsymbol{\lambda} \in \Lambda} \quad \mathcal{L}_{\text{val}}(f_{\boldsymbol{\lambda}^*(\boldsymbol{\lambda})}, D_{\text{val}})

\text{where} \quad \boldsymbol{\lambda}^*(\boldsymbol{\lambda}) = \arg\min_{\boldsymbol{\theta}} \mathcal{L}_{\text{train}}(f_{\boldsymbol{\theta}}, D_{\text{train}}; \boldsymbol{\lambda})

Here $\boldsymbol{\lambda}$ represents hyperparameters (architecture, learning rate, regularization strength) and $\boldsymbol{\theta}$ represents model parameters. AutoML frameworks include Auto-sklearn (Bayesian optimization + meta-learning), H2O AutoML (stacked ensembles), and Google AutoML (neural architecture search at scale).

⚠️ Computational Cost

AutoML can consume hundreds of GPU-hours. Always set a budget (time or number of trials) and use warm-starting from prior runs or meta-learning to reduce search cost. For most tabular problems, gradient boosting with tuned hyperparameters (via Optuna or Hyperopt) matches AutoML performance at a fraction of the cost.

Python Implementation

Portfolio Optimization

import numpy as np
from scipy.optimize import minimize

def markowitz_optimize(mu, Sigma, target_return):
    n = len(mu)
    
    def portfolio_variance(w):
        return w @ Sigma @ w
    
    constraints = [
        {'type': 'eq', 'fun': lambda w: w @ mu - target_return},
        {'type': 'eq', 'fun': lambda w: np.sum(w) - 1.0}
    ]
    bounds = [(0, 1) for _ in range(n)]
    
    result = minimize(portfolio_variance, x0=np.ones(n)/n,
                      method='SLSQP', bounds=bounds, constraints=constraints)
    return result.x

# Example
mu = np.array([0.12, 0.08, 0.15])
Sigma = np.array([[0.04, 0.009, 0.012],
                  [0.009, 0.0225, 0.008],
                  [0.012, 0.008, 0.0625]])

w = markowitz_optimize(mu, Sigma, target_return=0.11)
print(f"Weights: {w}")
print(f"Portfolio return: {w @ mu:.4f}")
print(f"Portfolio risk: {np.sqrt(w @ Sigma @ w):.4f}")

Feature Selection with Lasso Path

import numpy as np
from sklearn.linear_model import lasso_path

def feature_selection_lasso(X, y, n_alphas=50):
    """Compute the full Lasso regularization path."""
    alphas, coefs, _ = lasso_path(X, y, n_alphas=n_alphas)
    
    # Identify feature entry order
    selected_order = []
    for i, alpha in enumerate(alphas[::-1]):
        nonzero = np.where(np.abs(coefs[:, i]) > 1e-8)[0]
        for f in nonzero:
            if f not in selected_order:
                selected_order.append(f)
    
    return alphas, coefs, selected_order

# Example
np.random.seed(42)
n, p = 100, 20
X = np.random.randn(n, p)
true_support = [0, 3, 7, 12]
beta_true = np.zeros(p)
beta_true[true_support] = [2.0, -1.5, 3.0, -0.8]
y = X @ beta_true + 0.5 * np.random.randn(n)

alphas, coefs, order = feature_selection_lasso(X, y)
print(f"Feature entry order: {order}")
print(f"True support: {true_support}")

Neural Architecture Search (Simplified DARTS)

import torch
import torch.nn as nn
import torch.nn.functional as F

class MixedOperation(nn.Module):
    """DARTS-style mixed operation over candidate edges."""
    def __init__(self, C_in, C_out, ops):
        super().__init__()
        self.ops = nn.ModuleList(ops)
        self.arch_weights = nn.Parameter(torch.randn(len(ops)))
    
    def forward(self, x):
        weights = F.softmax(self.arch_weights, dim=0)
        return sum(w * op(x) for w, op in zip(weights, self.ops))

# Candidate operations
ops = [
    nn.Sequential(nn.Conv2d(16, 16, 3, padding=1), nn.BatchNorm2d(16), nn.ReLU()),
    nn.Sequential(nn.Conv2d(16, 16, 5, padding=2), nn.BatchNorm2d(16), nn.ReLU()),
    nn.MaxPool2d(3, stride=1, padding=1),
    nn.Identity(),
]

mixed = MixedOperation(16, 16, ops)
x = torch.randn(1, 16, 32, 32)
out = mixed(x)
print(f"Output shape: {out.shape}")

# After bilevel optimization, extract best architecture
best_idx = mixed.arch_weights.argmax().item()
print(f"Best operation index: {best_idx}")

PPO Policy Optimization

import torch
import torch.nn as nn

class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, 64), nn.ReLU(),
            nn.Linear(64, 64), nn.ReLU()
        )
        self.policy = nn.Linear(64, action_dim)
        self.value = nn.Linear(64, 1)
    
    def forward(self, state):
        features = self.shared(state)
        return F.softmax(self.policy(features), dim=-1), self.value(features)

def ppo_loss(old_log_probs, new_log_probs, advantages, eps=0.2):
    ratio = torch.exp(new_log_probs - old_log_probs)
    clipped = torch.clamp(ratio, 1 - eps, 1 + eps)
    return -torch.min(ratio * advantages, clipped * advantages).mean()

# Example usage
state_dim, action_dim = 4, 2
model = ActorCritic(state_dim, action_dim)
state = torch.randn(1, state_dim)
probs, value = model(state)
print(f"Action probs: {probs}, Value: {value.item():.4f}")

Common Mistakes

Mistake	Why It Fails	Fix
Using SGD without momentum on deep networks	Converges slowly, gets stuck in saddle points	Add momentum ( $\beta = 0.9$ ) or use Adam
Setting learning rate too high in Adam	Loss diverges, gradients explode	Start with $\alpha = 0.001$ , use LR finder
Forgetting to zero gradients	Gradients accumulate across batches, wrong updates	Call `optimizer.zero_grad()` before each backward pass
Not scaling features for Lasso	Features with larger scale dominate the penalty	Standardize features to zero mean, unit variance
Using $\ell_2$ penalty instead of weight decay in Adam	L2 and weight decay are NOT equivalent in Adam	Use AdamW for proper weight decay
Ignoring covariance in portfolio optimization	Underestimates risk, over-concentrates positions	Always use full covariance matrix $\boldsymbol{\Sigma}$
Stopping NAS too early	Final architecture is suboptimal, no convergence	Use early stopping with patience, not fixed epochs
Not clipping gradients in RNNs	Exploding gradients cause NaN loss and divergence	Always clip gradients for recurrent architectures
Using discrete search for NAS without warm-start	Prohibitively expensive (years of GPU time)	Use DARTS or Bayesian optimization with warm-starting
Over-regularizing with high $\lambda$ in Lasso	Model becomes too sparse, high bias	Cross-validate $\lambda$ using the one-standard-error rule

Interview Questions

Why is the Lasso able to perform feature selection while Ridge regression is not? The Lasso's $\ell_1$ penalty creates a diamond-shaped constraint region with corners on the axes. The solution (intersection of the loss ellipse with the constraint) often lands exactly on a corner, setting some coefficients to zero. Ridge's $\ell_2$ ball has no corners, so coefficients shrink but never reach exactly zero.
Explain the efficient frontier in portfolio theory. What assumption breaks it? The efficient frontier is the set of portfolios achieving minimum variance for each target return. It breaks when returns are not normally distributed (fat tails, skewness) or when the covariance matrix is estimated with error (estimation error amplifies with the number of assets).
Why is PPO preferred over vanilla policy gradient in practice? PPO constrains policy updates via clipping, preventing destructive large steps that can collapse performance. It uses off-policy data (collected by the old policy) efficiently, reduces variance with advantage estimation, and is more stable and sample-efficient than REINFORCE.
When does DARTS fail? What is the "performance collapse" problem? DARTS tends to select skip connections (identity operations) disproportionately, especially in deeper cells, because skip connections have lower training loss early in search. This leads to degenerate architectures. Solutions: enforce a minimum number of non-skip operations, use early stopping before convergence.
How would you scale AutoML to millions of hyperparameter configurations? Use Bayesian optimization (Gaussian processes or Tree-structured Parzen Estimators) with warm-starting from prior runs. Meta-learning from similar datasets reduces the search space. For extreme scale, use population-based training (PBT) which evolves a population of models in parallel.
What is the difference between the 0-1 knapsack and the fractional knapsack? The 0-1 knapsack (items cannot be divided) is NP-hard, solved by dynamic programming in $O(nW)$ . The fractional knapsack (items can be divided) is solvable greedily in $O(n \log n)$ by taking the highest value-to-weight ratio items first.
Why does gradient clipping use the norm rather than clipping each gradient component independently? Clipping by norm preserves the direction of the gradient (the relative magnitudes of components are maintained). Component-wise clipping distorts the gradient direction, which can harm convergence. Norm clipping rescales: $\boldsymbol{g} \leftarrow \boldsymbol{g} \cdot \min(1, \tau / \|\boldsymbol{g}\|)$ .
Explain the irrepresentability condition for Lasso feature selection. It requires that the correlation between relevant and irrelevant features is bounded: $\|\boldsymbol{X}_S^\top \boldsymbol{X}_{S^c}\|_\infty < 1 - \delta$ . If violated, the Lasso includes irrelevant features (false positives) because they are correlated with relevant features. Group Lasso or adaptive Lasso can partially address this.

Practice Problems

📝Problem 1: Portfolio Risk

You have three assets with $\boldsymbol{\mu} = [0.10, 0.15, 0.08]^\top$ and covariance matrix $\boldsymbol{\Sigma}$ where $\sigma_{11} = 0.04, \sigma_{22} = 0.09, \sigma_{33} = 0.025$ , $\sigma_{12} = 0.006, \sigma_{13} = 0.004, \sigma_{23} = 0.003$ . Find the minimum variance portfolio with expected return $0.11$ .

💡Solution

Using the Lagrangian with constraints $\boldsymbol{w}^\top \boldsymbol{\mu} = 0.11$ and $\boldsymbol{w}^\top \mathbf{1} = 1$ :

Solve: $\boldsymbol{\Sigma} \boldsymbol{w} = \lambda_1 \boldsymbol{\mu} + \lambda_2 \mathbf{1}$

Setting up the system and solving numerically: $\boldsymbol{w} \approx [0.45, 0.25, 0.30]^\top$

Portfolio variance: $\boldsymbol{w}^\top \boldsymbol{\Sigma} \boldsymbol{w} \approx 0.0269$ , so $\sigma_p \approx 0.164$ .

📝Problem 2: Feature Selection

Given a dataset with 50 features, 20 of which are truly relevant, and you run Lasso with cross-validated $\lambda$ . The model selects 25 features. What can you conclude?

💡Solution

The model has 5 false positives and likely some false negatives. Possible causes: (1) Irrepresentability condition is violated — some irrelevant features are correlated with relevant ones. (2) Features have different scales — standardize before applying Lasso. (3) Consider using adaptive Lasso (reweighted $\ell_1$ ) or stability selection to reduce false discoveries.

📝Problem 3: RL Optimization

In PPO, you set $\epsilon = 0.2$ . The old policy gives $\pi_{\text{old}}(a|s) = 0.3$ , and the new policy gives $\pi_{\text{new}}(a|s) = 0.6$ . The advantage estimate is $\hat{A} = 1.5$ . What is the PPO surrogate loss for this transition?

💡Solution

Probability ratio: $r = 0.6 / 0.3 = 2.0$

Unclipped objective: $r \cdot \hat{A} = 2.0 \times 1.5 = 3.0$

Clipped objective: $\text{clip}(r, 0.8, 1.2) \cdot \hat{A} = 1.2 \times 1.5 = 1.8$

PPO loss: $-\min(3.0, 1.8) = -1.8$

The clipping prevents the update from being too aggressive, even though the advantage is positive (we want to increase this action's probability).

Quick Reference

Topic	Key Formula / Concept	When to Use
Portfolio Optimization	$\min \boldsymbol{w}^\top \boldsymbol{\Sigma} \boldsymbol{w}$ s.t. $\boldsymbol{w}^\top \boldsymbol{\mu} = r$	Asset allocation, risk management
Resource Allocation	$\max \sum p_i x_i$ s.t. $\sum a_{ij} x_i \leq c_j$	Knapsack, scheduling, budgeting
Network Design	$\min \sum c_{ij} x_{ij}$ s.t. flow constraints	Telecommunications, logistics
Scheduling	Minimize makespan $C_{\max}$ with precedence constraints	Manufacturing, cloud computing
Feature Selection (Lasso)	$\min \\|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\\|_2^2 + \lambda\\|\boldsymbol{\beta}\\|_1$	High-dimensional regression, interpretability
NAS (DARTS)	Bilevel optimization over architecture $\boldsymbol{\alpha}$ and weights $\boldsymbol{w}$	Automated model design
Policy Gradient	$\nabla_{\boldsymbol{\theta}} J = \mathbb{E}[\sum \gamma^t r_t \nabla \log \pi]$	RL policy optimization
PPO	Clipped surrogate: $\min(r\hat{A}, \text{clip}(r, 1\pm\epsilon)\hat{A})$	Stable RL training
AutoML	Hyperparameter optimization + pipeline search	End-to-end ML automation

Cross-References

Optimization Fundamentals: See Convex Optimization Basics for foundations of convexity, duality, and gradient methods
Gradient Descent: See Stochastic Gradient Descent for SGD variants and convergence analysis
Regularization: See Regularization Techniques for $\ell_1$ , $\ell_2$ , elastic net, and their theoretical properties
Reinforcement Learning: See Policy Gradient Methods for REINFORCE, A2C, and actor-critic algorithms
Neural Architecture Search: See AutoML and NAS for DARTS, evolutionary search, and Bayesian optimization
Portfolio Theory: See Markowitz Model for detailed derivation of the efficient frontier and CAPM
Combinatorial Optimization: See Integer Programming for branch-and-bound, cutting planes, and LP relaxation
Hyperparameter Tuning: See Bayesian Optimization for Gaussian process surrogates and acquisition functions
Scaling Laws: See Neural Scaling Laws for compute-optimal training and Chinchilla-optimal model sizing

Key Takeaways

📋Summary

Optimization pervades ML: From portfolio selection to architecture search, optimization is the common thread connecting diverse applications. Understanding the problem structure (convex vs. non-convex, continuous vs. discrete, constrained vs. unconstrained) determines the right algorithm.
Portfolio Optimization: Markowitz mean-variance is quadratic programming. The efficient frontier gives the risk-return tradeoff; Sharpe ratio selects the optimal portfolio. Real-world challenges: estimation error in $\boldsymbol{\Sigma}$ , non-normal returns.
Resource Allocation: Knapsack and assignment problems are NP-hard but have efficient approximations (LP relaxation, greedy). Always check if the problem has special structure (matroid, submodularity) that enables better guarantees.
Network Design: MST and min-cost flow are polynomial; general network design is NP-hard. Greedy and primal-dual algorithms give constant-factor approximations.
Scheduling: Job-shop is strongly NP-hard for $m \geq 2$ . Heuristics (dispatch rules, genetic algorithms) and LP-based bounds are practical. Always check the problem variant ( $1|r_i|C_{\max}$ vs. $J_m|C_{\max}$ ).
Feature Selection: Lasso ( $\ell_1$ ) selects sparse models; Ridge ( $\ell_2$ ) shrinks but doesn't select. Adaptive Lasso and stability selection reduce false discoveries. Standardize features before applying Lasso.
NAS: DARTS enables gradient-based architecture search but suffers from performance collapse. Use early stopping and enforce operation diversity. For tabular data, tuned gradient boosting often matches NAS-discovered architectures.
RL as Optimization: Policy gradient methods optimize $\mathbb{E}[\sum \gamma^t r_t]$ directly. PPO's clipping prevents destructive updates. Trust region methods (TRPO, PPO) are more stable than vanilla policy gradient.
AutoML: Automates the ML pipeline but can be expensive. Bayesian optimization with warm-starting is the standard approach. For most problems, manual tuning of gradient boosting matches AutoML at lower cost.
Python Implementation: Use scipy.optimize for portfolio problems, sklearn.linear_model.lasso_path for feature selection, and PyTorch/TensorFlow for NAS and RL. Always validate with cross-validation.
Common Pitfalls: Not zeroing gradients, using L2 penalty instead of weight decay in Adam, ignoring covariance estimation error, not clipping gradients in RNNs, and over-regularizing with high $\lambda$ .
Interview Essentials: Know why Lasso selects features (geometry of $\ell_1$ ball), why PPO is preferred (clipping prevents destructive updates), and the difference between 0-1 and fractional knapsack.