← Math|70 of 100
Optimization

Applications in Deep Learning

See how optimization powers neural network training, GANs, and reinforcement learning.

📂 DL Applications📖 Lesson 70 of 100🎓 Free Course

Advertisement

Applications in Deep Learning

💡 Why It Matters

Optimization is the backbone of every modern machine learning system. From selecting the best portfolio of stocks to designing the architecture of a neural network, optimization algorithms determine how quickly we learn, how well we generalize, and whether we converge at all. Mastering these applications transforms you from someone who调用 model.fit() into someone who understands why it works and how to fix it when it doesn't.


Portfolio Optimization (Markowitz Mean-Variance)

Markowitz Portfolio Optimization

📝Example: Two-Asset Portfolio

Asset A: μA=0.12\mu_A = 0.12, σA=0.20\sigma_A = 0.20. Asset B: μB=0.08\mu_B = 0.08, σB=0.15\sigma_B = 0.15. Correlation ρ=0.3\rho = 0.3.

Covariance: σAB=ρσAσB=0.3×0.20×0.15=0.009\sigma_{AB} = \rho \cdot \sigma_A \cdot \sigma_B = 0.3 \times 0.20 \times 0.15 = 0.009.

Covariance matrix: Σ=(0.040.0090.0090.0225)\boldsymbol{\Sigma} = \begin{pmatrix} 0.04 & 0.009 \\ 0.009 & 0.0225 \end{pmatrix}

💡Solution

Solving for wA=0.6w_A = 0.6, wB=0.4w_B = 0.4 with target return rtarget=0.104r_{\text{target}} = 0.104:

Portfolio variance: (0.6)2(0.04)+2(0.6)(0.4)(0.009)+(0.4)2(0.0225)=0.02246(0.6)^2(0.04) + 2(0.6)(0.4)(0.009) + (0.4)^2(0.0225) = 0.02246

Portfolio std dev: 0.02246=0.1499\sqrt{0.02246} = 0.1499 (lower than either asset alone due to diversification).


Resource Allocation

DfResource Allocation Problem

Given mm resources with capacities cjc_j (j=1,,mj = 1, \dots, m) and nn tasks with resource requirements aija_{ij} and profits pip_i, find task selections xi{0,1}x_i \in \{0, 1\} to maximize total profit:

maxxi=1npixi\max_{\boldsymbol{x}} \quad \sum_{i=1}^{n} p_i x_i
s.t.i=1naijxicjj=1,,m\text{s.t.} \quad \sum_{i=1}^{n} a_{ij} x_i \leq c_j \quad \forall j = 1, \dots, m
xi{0,1}i=1,,nx_i \in \{0, 1\} \quad \forall i = 1, \dots, n

This is the 0-1 Knapsack Problem (single resource) or Generalized Assignment Problem (multiple resources). Both are NP-hard, but efficient approximations and dynamic programming solutions exist.

ℹ️ LP Relaxation

The LP relaxation (allowing xi[0,1]x_i \in [0, 1]) provides an upper bound. Rounding the LP solution yields a feasible integer solution within a factor of 2 of optimal. The integrality gap for the knapsack problem is bounded by (1+1/OPT)(1 + 1/\text{OPT}), making it effectively a PTAS.


Network Design

DfNetwork Design Optimization

Design a network (e.g., telecommunications, transportation, or supply chain) by selecting edges to minimize cost while satisfying connectivity and flow requirements:

minx(i,j)Ecijxij\min_{\boldsymbol{x}} \quad \sum_{(i,j) \in E} c_{ij} x_{ij}
s.t.j:(i,j)Exijj:(j,i)Exji=diiV\text{s.t.} \quad \sum_{j:(i,j)\in E} x_{ij} - \sum_{j:(j,i)\in E} x_{ji} = d_i \quad \forall i \in V
xij0,xijuij(i,j)Ex_{ij} \geq 0, \quad x_{ij} \leq u_{ij} \quad \forall (i,j) \in E

Where cijc_{ij} is the cost of edge (i,j)(i,j), did_i is the net flow at node ii, and uiju_{ij} is the capacity.

ThMinimum Spanning Tree (MST)

For undirected networks, the MST selects edges minimizing total cost while connecting all nodes. Kruskal's algorithm (greedy by edge weight) and Prim's algorithm (greedy by growing tree) both find the MST in O(ElogV)O(E \log V) time. The MST is the optimal solution to the network design problem when all demands are unit and all node degrees are unconstrained.


Scheduling Problems

DfJob-Shop Scheduling

Given nn jobs J1,,JnJ_1, \dots, J_n and mm machines M1,,MmM_1, \dots, M_m, each job JiJ_i consists of a sequence of operations that must be processed on specific machines for specified durations. The objective is to minimize the makespan (total completion time) or total weighted completion time:

minCmax=maxiCi\min \quad C_{\max} = \max_{i} C_i
s.t.Ciri+kpik(precedence)\text{s.t.} \quad C_i \geq r_i + \sum_{k} p_{ik} \quad \text{(precedence)}
No two operations on the same machine overlap.\text{No two operations on the same machine overlap.}

The job-shop problem is strongly NP-hard for m2m \geq 2. The flow-shop variant (all jobs follow the same machine sequence) is also NP-hard but admits effective heuristics like the Johnson's rule for m=2m = 2.

⚠️ Complexity Trap

Don't confuse scheduling complexity classes: Single machine with release dates (1riCmax1|r_i|C_{\max}) is solvable in polynomial time. Two machines with release dates (2riCmax2|r_i|C_{\max}) is already NP-hard. Always check the problem variant before attempting an exact solution.


Feature Selection

L1 Regularization for Feature Selection

ThIrrepresentability Condition

The Lasso selects the correct support (nonzero features) with high probability if and only if the irrepresentability condition holds: the correlation between selected and unselected features must be bounded. Specifically, XSXSc<1δ\|\boldsymbol{X}_S^\top \boldsymbol{X}_{S^c}\|_\infty < 1 - \delta for some δ>0\delta > 0. When this condition fails, the Lasso may include irrelevant features or exclude relevant ones.


Neural Architecture Search (NAS)

DfNeural Architecture Search

NAS automates the design of neural network architectures by treating architecture selection as an optimization problem over a discrete search space A\mathcal{A}:

α=argminαALval(w(α),α)\boldsymbol{\alpha}^* = \arg\min_{\boldsymbol{\alpha} \in \mathcal{A}} \mathcal{L}_{\text{val}}(\boldsymbol{w}^*(\boldsymbol{\alpha}), \boldsymbol{\alpha})
wherew(α)=argminwLtrain(w,α)\text{where} \quad \boldsymbol{w}^*(\boldsymbol{\alpha}) = \arg\min_{\boldsymbol{w}} \mathcal{L}_{\text{train}}(\boldsymbol{w}, \boldsymbol{\alpha})

Search strategies include: random search, grid search, evolutionary algorithms (mutation + crossover of architectures), reinforcement learning (controller network generates architectures, rewarded by validation accuracy), and differentiable NAS (DARTS) which relaxes the discrete search to continuous weights.

💡 DARTS Efficiency

DARTS (Differentiable Architecture Search) reduces NAS from GPU-years to GPU-hours by jointly optimizing architecture weights α\boldsymbol{\alpha} and network weights w\boldsymbol{w} via bilevel optimization. The continuous relaxation allows gradient-based updates, but the discovered architectures often require careful post-processing to avoid performance collapse.


Reinforcement Learning as Optimization

Policy Gradient Optimization

ℹ️ Optimization Landscape

RL optimization is non-convex, stochastic, and has moving targets (the data distribution changes as the policy improves). Trust region methods (TRPO, PPO) and actor-critic architectures (A2C, SAC) stabilize training by restricting policy changes and reducing gradient variance.


AutoML

DfAutomated Machine Learning (AutoML)

AutoML automates the entire machine learning pipeline: feature engineering, model selection, hyperparameter tuning, and ensemble construction. The meta-optimization problem is:

minλΛLval(fλ(λ),Dval)\min_{\boldsymbol{\lambda} \in \Lambda} \quad \mathcal{L}_{\text{val}}(f_{\boldsymbol{\lambda}^*(\boldsymbol{\lambda})}, D_{\text{val}})
whereλ(λ)=argminθLtrain(fθ,Dtrain;λ)\text{where} \quad \boldsymbol{\lambda}^*(\boldsymbol{\lambda}) = \arg\min_{\boldsymbol{\theta}} \mathcal{L}_{\text{train}}(f_{\boldsymbol{\theta}}, D_{\text{train}}; \boldsymbol{\lambda})

Here λ\boldsymbol{\lambda} represents hyperparameters (architecture, learning rate, regularization strength) and θ\boldsymbol{\theta} represents model parameters. AutoML frameworks include Auto-sklearn (Bayesian optimization + meta-learning), H2O AutoML (stacked ensembles), and Google AutoML (neural architecture search at scale).

⚠️ Computational Cost

AutoML can consume hundreds of GPU-hours. Always set a budget (time or number of trials) and use warm-starting from prior runs or meta-learning to reduce search cost. For most tabular problems, gradient boosting with tuned hyperparameters (via Optuna or Hyperopt) matches AutoML performance at a fraction of the cost.


Python Implementation

Portfolio Optimization

import numpy as np
from scipy.optimize import minimize

def markowitz_optimize(mu, Sigma, target_return):
    n = len(mu)
    
    def portfolio_variance(w):
        return w @ Sigma @ w
    
    constraints = [
        {'type': 'eq', 'fun': lambda w: w @ mu - target_return},
        {'type': 'eq', 'fun': lambda w: np.sum(w) - 1.0}
    ]
    bounds = [(0, 1) for _ in range(n)]
    
    result = minimize(portfolio_variance, x0=np.ones(n)/n,
                      method='SLSQP', bounds=bounds, constraints=constraints)
    return result.x

# Example
mu = np.array([0.12, 0.08, 0.15])
Sigma = np.array([[0.04, 0.009, 0.012],
                  [0.009, 0.0225, 0.008],
                  [0.012, 0.008, 0.0625]])

w = markowitz_optimize(mu, Sigma, target_return=0.11)
print(f"Weights: {w}")
print(f"Portfolio return: {w @ mu:.4f}")
print(f"Portfolio risk: {np.sqrt(w @ Sigma @ w):.4f}")

Feature Selection with Lasso Path

import numpy as np
from sklearn.linear_model import lasso_path

def feature_selection_lasso(X, y, n_alphas=50):
    """Compute the full Lasso regularization path."""
    alphas, coefs, _ = lasso_path(X, y, n_alphas=n_alphas)
    
    # Identify feature entry order
    selected_order = []
    for i, alpha in enumerate(alphas[::-1]):
        nonzero = np.where(np.abs(coefs[:, i]) > 1e-8)[0]
        for f in nonzero:
            if f not in selected_order:
                selected_order.append(f)
    
    return alphas, coefs, selected_order

# Example
np.random.seed(42)
n, p = 100, 20
X = np.random.randn(n, p)
true_support = [0, 3, 7, 12]
beta_true = np.zeros(p)
beta_true[true_support] = [2.0, -1.5, 3.0, -0.8]
y = X @ beta_true + 0.5 * np.random.randn(n)

alphas, coefs, order = feature_selection_lasso(X, y)
print(f"Feature entry order: {order}")
print(f"True support: {true_support}")

Neural Architecture Search (Simplified DARTS)

import torch
import torch.nn as nn
import torch.nn.functional as F

class MixedOperation(nn.Module):
    """DARTS-style mixed operation over candidate edges."""
    def __init__(self, C_in, C_out, ops):
        super().__init__()
        self.ops = nn.ModuleList(ops)
        self.arch_weights = nn.Parameter(torch.randn(len(ops)))
    
    def forward(self, x):
        weights = F.softmax(self.arch_weights, dim=0)
        return sum(w * op(x) for w, op in zip(weights, self.ops))

# Candidate operations
ops = [
    nn.Sequential(nn.Conv2d(16, 16, 3, padding=1), nn.BatchNorm2d(16), nn.ReLU()),
    nn.Sequential(nn.Conv2d(16, 16, 5, padding=2), nn.BatchNorm2d(16), nn.ReLU()),
    nn.MaxPool2d(3, stride=1, padding=1),
    nn.Identity(),
]

mixed = MixedOperation(16, 16, ops)
x = torch.randn(1, 16, 32, 32)
out = mixed(x)
print(f"Output shape: {out.shape}")

# After bilevel optimization, extract best architecture
best_idx = mixed.arch_weights.argmax().item()
print(f"Best operation index: {best_idx}")

PPO Policy Optimization

import torch
import torch.nn as nn

class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, 64), nn.ReLU(),
            nn.Linear(64, 64), nn.ReLU()
        )
        self.policy = nn.Linear(64, action_dim)
        self.value = nn.Linear(64, 1)
    
    def forward(self, state):
        features = self.shared(state)
        return F.softmax(self.policy(features), dim=-1), self.value(features)

def ppo_loss(old_log_probs, new_log_probs, advantages, eps=0.2):
    ratio = torch.exp(new_log_probs - old_log_probs)
    clipped = torch.clamp(ratio, 1 - eps, 1 + eps)
    return -torch.min(ratio * advantages, clipped * advantages).mean()

# Example usage
state_dim, action_dim = 4, 2
model = ActorCritic(state_dim, action_dim)
state = torch.randn(1, state_dim)
probs, value = model(state)
print(f"Action probs: {probs}, Value: {value.item():.4f}")

Common Mistakes

MistakeWhy It FailsFix
Using SGD without momentum on deep networksConverges slowly, gets stuck in saddle pointsAdd momentum (β=0.9\beta = 0.9) or use Adam
Setting learning rate too high in AdamLoss diverges, gradients explodeStart with α=0.001\alpha = 0.001, use LR finder
Forgetting to zero gradientsGradients accumulate across batches, wrong updatesCall optimizer.zero_grad() before each backward pass
Not scaling features for LassoFeatures with larger scale dominate the penaltyStandardize features to zero mean, unit variance
Using 2\ell_2 penalty instead of weight decay in AdamL2 and weight decay are NOT equivalent in AdamUse AdamW for proper weight decay
Ignoring covariance in portfolio optimizationUnderestimates risk, over-concentrates positionsAlways use full covariance matrix Σ\boldsymbol{\Sigma}
Stopping NAS too earlyFinal architecture is suboptimal, no convergenceUse early stopping with patience, not fixed epochs
Not clipping gradients in RNNsExploding gradients cause NaN loss and divergenceAlways clip gradients for recurrent architectures
Using discrete search for NAS without warm-startProhibitively expensive (years of GPU time)Use DARTS or Bayesian optimization with warm-starting
Over-regularizing with high λ\lambda in LassoModel becomes too sparse, high biasCross-validate λ\lambda using the one-standard-error rule

Interview Questions

  1. Why is the Lasso able to perform feature selection while Ridge regression is not? The Lasso's 1\ell_1 penalty creates a diamond-shaped constraint region with corners on the axes. The solution (intersection of the loss ellipse with the constraint) often lands exactly on a corner, setting some coefficients to zero. Ridge's 2\ell_2 ball has no corners, so coefficients shrink but never reach exactly zero.

  2. Explain the efficient frontier in portfolio theory. What assumption breaks it? The efficient frontier is the set of portfolios achieving minimum variance for each target return. It breaks when returns are not normally distributed (fat tails, skewness) or when the covariance matrix is estimated with error (estimation error amplifies with the number of assets).

  3. Why is PPO preferred over vanilla policy gradient in practice? PPO constrains policy updates via clipping, preventing destructive large steps that can collapse performance. It uses off-policy data (collected by the old policy) efficiently, reduces variance with advantage estimation, and is more stable and sample-efficient than REINFORCE.

  4. When does DARTS fail? What is the "performance collapse" problem? DARTS tends to select skip connections (identity operations) disproportionately, especially in deeper cells, because skip connections have lower training loss early in search. This leads to degenerate architectures. Solutions: enforce a minimum number of non-skip operations, use early stopping before convergence.

  5. How would you scale AutoML to millions of hyperparameter configurations? Use Bayesian optimization (Gaussian processes or Tree-structured Parzen Estimators) with warm-starting from prior runs. Meta-learning from similar datasets reduces the search space. For extreme scale, use population-based training (PBT) which evolves a population of models in parallel.

  6. What is the difference between the 0-1 knapsack and the fractional knapsack? The 0-1 knapsack (items cannot be divided) is NP-hard, solved by dynamic programming in O(nW)O(nW). The fractional knapsack (items can be divided) is solvable greedily in O(nlogn)O(n \log n) by taking the highest value-to-weight ratio items first.

  7. Why does gradient clipping use the norm rather than clipping each gradient component independently? Clipping by norm preserves the direction of the gradient (the relative magnitudes of components are maintained). Component-wise clipping distorts the gradient direction, which can harm convergence. Norm clipping rescales: ggmin(1,τ/g)\boldsymbol{g} \leftarrow \boldsymbol{g} \cdot \min(1, \tau / \|\boldsymbol{g}\|).

  8. Explain the irrepresentability condition for Lasso feature selection. It requires that the correlation between relevant and irrelevant features is bounded: XSXSc<1δ\|\boldsymbol{X}_S^\top \boldsymbol{X}_{S^c}\|_\infty < 1 - \delta. If violated, the Lasso includes irrelevant features (false positives) because they are correlated with relevant features. Group Lasso or adaptive Lasso can partially address this.


Practice Problems

📝Problem 1: Portfolio Risk

You have three assets with μ=[0.10,0.15,0.08]\boldsymbol{\mu} = [0.10, 0.15, 0.08]^\top and covariance matrix Σ\boldsymbol{\Sigma} where σ11=0.04,σ22=0.09,σ33=0.025\sigma_{11} = 0.04, \sigma_{22} = 0.09, \sigma_{33} = 0.025, σ12=0.006,σ13=0.004,σ23=0.003\sigma_{12} = 0.006, \sigma_{13} = 0.004, \sigma_{23} = 0.003. Find the minimum variance portfolio with expected return 0.110.11.

💡Solution

Using the Lagrangian with constraints wμ=0.11\boldsymbol{w}^\top \boldsymbol{\mu} = 0.11 and w1=1\boldsymbol{w}^\top \mathbf{1} = 1:

Solve: Σw=λ1μ+λ21\boldsymbol{\Sigma} \boldsymbol{w} = \lambda_1 \boldsymbol{\mu} + \lambda_2 \mathbf{1}

Setting up the system and solving numerically: w[0.45,0.25,0.30]\boldsymbol{w} \approx [0.45, 0.25, 0.30]^\top

Portfolio variance: wΣw0.0269\boldsymbol{w}^\top \boldsymbol{\Sigma} \boldsymbol{w} \approx 0.0269, so σp0.164\sigma_p \approx 0.164.

📝Problem 2: Feature Selection

Given a dataset with 50 features, 20 of which are truly relevant, and you run Lasso with cross-validated λ\lambda. The model selects 25 features. What can you conclude?

💡Solution

The model has 5 false positives and likely some false negatives. Possible causes: (1) Irrepresentability condition is violated — some irrelevant features are correlated with relevant ones. (2) Features have different scales — standardize before applying Lasso. (3) Consider using adaptive Lasso (reweighted 1\ell_1) or stability selection to reduce false discoveries.

📝Problem 3: RL Optimization

In PPO, you set ϵ=0.2\epsilon = 0.2. The old policy gives πold(as)=0.3\pi_{\text{old}}(a|s) = 0.3, and the new policy gives πnew(as)=0.6\pi_{\text{new}}(a|s) = 0.6. The advantage estimate is A^=1.5\hat{A} = 1.5. What is the PPO surrogate loss for this transition?

💡Solution

Probability ratio: r=0.6/0.3=2.0r = 0.6 / 0.3 = 2.0

Unclipped objective: rA^=2.0×1.5=3.0r \cdot \hat{A} = 2.0 \times 1.5 = 3.0

Clipped objective: clip(r,0.8,1.2)A^=1.2×1.5=1.8\text{clip}(r, 0.8, 1.2) \cdot \hat{A} = 1.2 \times 1.5 = 1.8

PPO loss: min(3.0,1.8)=1.8-\min(3.0, 1.8) = -1.8

The clipping prevents the update from being too aggressive, even though the advantage is positive (we want to increase this action's probability).


Quick Reference

TopicKey Formula / ConceptWhen to Use
Portfolio OptimizationminwΣw\min \boldsymbol{w}^\top \boldsymbol{\Sigma} \boldsymbol{w} s.t. wμ=r\boldsymbol{w}^\top \boldsymbol{\mu} = rAsset allocation, risk management
Resource Allocationmaxpixi\max \sum p_i x_i s.t. aijxicj\sum a_{ij} x_i \leq c_jKnapsack, scheduling, budgeting
Network Designmincijxij\min \sum c_{ij} x_{ij} s.t. flow constraintsTelecommunications, logistics
SchedulingMinimize makespan CmaxC_{\max} with precedence constraintsManufacturing, cloud computing
Feature Selection (Lasso)minyXβ22+λβ1\min \|\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\|_2^2 + \lambda\|\boldsymbol{\beta}\|_1High-dimensional regression, interpretability
NAS (DARTS)Bilevel optimization over architecture α\boldsymbol{\alpha} and weights w\boldsymbol{w}Automated model design
Policy GradientθJ=E[γtrtlogπ]\nabla_{\boldsymbol{\theta}} J = \mathbb{E}[\sum \gamma^t r_t \nabla \log \pi]RL policy optimization
PPOClipped surrogate: min(rA^,clip(r,1±ϵ)A^)\min(r\hat{A}, \text{clip}(r, 1\pm\epsilon)\hat{A})Stable RL training
AutoMLHyperparameter optimization + pipeline searchEnd-to-end ML automation

Cross-References

  • Optimization Fundamentals: See Convex Optimization Basics for foundations of convexity, duality, and gradient methods
  • Gradient Descent: See Stochastic Gradient Descent for SGD variants and convergence analysis
  • Regularization: See Regularization Techniques for 1\ell_1, 2\ell_2, elastic net, and their theoretical properties
  • Reinforcement Learning: See Policy Gradient Methods for REINFORCE, A2C, and actor-critic algorithms
  • Neural Architecture Search: See AutoML and NAS for DARTS, evolutionary search, and Bayesian optimization
  • Portfolio Theory: See Markowitz Model for detailed derivation of the efficient frontier and CAPM
  • Combinatorial Optimization: See Integer Programming for branch-and-bound, cutting planes, and LP relaxation
  • Hyperparameter Tuning: See Bayesian Optimization for Gaussian process surrogates and acquisition functions
  • Scaling Laws: See Neural Scaling Laws for compute-optimal training and Chinchilla-optimal model sizing

Key Takeaways

📋Summary

  • Optimization pervades ML: From portfolio selection to architecture search, optimization is the common thread connecting diverse applications. Understanding the problem structure (convex vs. non-convex, continuous vs. discrete, constrained vs. unconstrained) determines the right algorithm.
  • Portfolio Optimization: Markowitz mean-variance is quadratic programming. The efficient frontier gives the risk-return tradeoff; Sharpe ratio selects the optimal portfolio. Real-world challenges: estimation error in Σ\boldsymbol{\Sigma}, non-normal returns.
  • Resource Allocation: Knapsack and assignment problems are NP-hard but have efficient approximations (LP relaxation, greedy). Always check if the problem has special structure (matroid, submodularity) that enables better guarantees.
  • Network Design: MST and min-cost flow are polynomial; general network design is NP-hard. Greedy and primal-dual algorithms give constant-factor approximations.
  • Scheduling: Job-shop is strongly NP-hard for m2m \geq 2. Heuristics (dispatch rules, genetic algorithms) and LP-based bounds are practical. Always check the problem variant (1riCmax1|r_i|C_{\max} vs. JmCmaxJ_m|C_{\max}).
  • Feature Selection: Lasso (1\ell_1) selects sparse models; Ridge (2\ell_2) shrinks but doesn't select. Adaptive Lasso and stability selection reduce false discoveries. Standardize features before applying Lasso.
  • NAS: DARTS enables gradient-based architecture search but suffers from performance collapse. Use early stopping and enforce operation diversity. For tabular data, tuned gradient boosting often matches NAS-discovered architectures.
  • RL as Optimization: Policy gradient methods optimize E[γtrt]\mathbb{E}[\sum \gamma^t r_t] directly. PPO's clipping prevents destructive updates. Trust region methods (TRPO, PPO) are more stable than vanilla policy gradient.
  • AutoML: Automates the ML pipeline but can be expensive. Bayesian optimization with warm-starting is the standard approach. For most problems, manual tuning of gradient boosting matches AutoML at lower cost.
  • Python Implementation: Use scipy.optimize for portfolio problems, sklearn.linear_model.lasso_path for feature selection, and PyTorch/TensorFlow for NAS and RL. Always validate with cross-validation.
  • Common Pitfalls: Not zeroing gradients, using L2 penalty instead of weight decay in Adam, ignoring covariance estimation error, not clipping gradients in RNNs, and over-regularizing with high λ\lambda.
  • Interview Essentials: Know why Lasso selects features (geometry of 1\ell_1 ball), why PPO is preferred (clipping prevents destructive updates), and the difference between 0-1 and fractional knapsack.
Lesson Progress70 / 100