Neural Architecture Search
NAS automates the design of neural network architectures, replacing manual engineering with algorithmic search. It has discovered architectures that outperform human-designed networks.
NAS Framework
DfNAS Components
NAS consists of three components:
- Search space : Set of possible architectures (operations, connectivity)
- Search strategy: Algorithm to explore (reinforcement learning, evolutionary, gradient-based)
- Performance estimation: Evaluate architectures efficiently (proxy tasks, weight sharing)
Search Space
DfCell-Based Search Space
Most modern NAS uses a cell-based search space:
- Normal cell: Preserves spatial dimensions
- Reduction cell: Halves spatial dimensions, doubles channels
- Each cell is a DAG with nodes (typically 4-6)
- Edges are operations (conv 3x3, sep conv 5x5, max pool, etc.)
A network is built by stacking these cells.
Cell Output
Here,
- =Output of intermediate node j
- =Output of operation on edge (i,j)
- =Sum over all incoming edges
DARTS (Differentiable Architecture Search)
DfDARTS
DARTS (Liu et al., 2019) makes NAS differentiable by relaxing discrete architecture choices to continuous weights:
Architecture parameters and network weights are optimized jointly via bilevel optimization.
DARTS Continuous Relaxation
Here,
- =Architecture weight for operation o on edge (i,j)
- =Set of candidate operations
- =Output of operation o on input x
ℹ️ DARTS Training
DARTS alternates between:
- Step 1: Update weights on training data (minimize )
- Step 2: Update architecture params on validation data (minimize )
After search, the final architecture is derived by selecting the operation with highest on each edge.
EfficientNet
DfEfficientNet
EfficientNet (Tan & Le, 2019) uses compound scaling to jointly scale depth, width, and resolution:
subject to
- : depth coefficient
- : width coefficient
- : resolution coefficient
- : compound coefficient controlling overall scale
Found via NAS, then scaled uniformly.
Compound Scaling
Here,
- =Depth (number of layers)
- =Width (number of channels)
- =Resolution (input image size)
- =Compound coefficient
Once-for-All (OFA)
DfOnce-for-All Network
OFA (Cai et al., 2020) trains a single supernetwork that contains all possible sub-networks:
- Progress shrinking: Train full network, then progressively fine-tune smaller subnets
- Elastic depth/width/kernel: Support flexible depth, width, and kernel size
- Deployment-specific search: Find optimal subnet for target hardware without retraining
This eliminates the need for architecture search per deployment.
Search Strategies
DfReinforcement Learning (NASNet)
Use RNN controller to generate architecture descriptions:
- Controller predicts architecture as a sequence of decisions
- Train architecture, evaluate on validation set
- Use accuracy as reward to update controller via REINFORCE
Pros: Flexible search space. Cons: Very expensive (thousands of GPU hours).
DfEvolutionary Search (AmoebaNet)
Use genetic algorithms to evolve architectures:
- Population of architectures
- Tournament selection based on fitness (validation accuracy)
- Mutation: change operations, connectivity
- Crossover: combine architectures
Pros: Parallelizable. Cons: Still expensive.
PyTorch Implementation
📝Example: DARTS Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
# Candidate operations
OPS = {
'none': lambda C: Zero(C),
'skip_connect': lambda C: nn.Identity(),
'sep_conv_3x3': lambda C: SepConv(C, C, 3, 1, 1),
'sep_conv_5x5': lambda C: SepConv(C, C, 5, 1, 2),
'dil_conv_3x3': lambda C: DilConv(C, C, 3, 1, 2, 2),
'max_pool_3x3': lambda C: nn.MaxPool2d(3, 1, 1),
'avg_pool_3x3': lambda C: nn.AvgPool2d(3, 1, 1),
}
class MixedOp(nn.Module):
"""Mixed operation with architecture weights."""
def __init__(self, C):
super().__init__()
self.ops = nn.ModuleList([OPS[name](C) for name in OPS])
def forward(self, x, weights):
return sum(w * op(x) for w, op in zip(weights, self.ops))
class DARTSCell(nn.Module):
"""DARTS cell with learnable architecture parameters."""
def __init__(self, C, num_nodes=4):
super().__init__()
self.num_nodes = num_nodes
self.ops = nn.ModuleList()
for j in range(2, num_nodes + 2): # input nodes = 0, 1
for i in range(j):
self.ops.append(MixedOp(C))
# Architecture parameters
num_edges = sum(range(2, num_nodes + 2))
self.alphas_normal = nn.Parameter(
torch.randn(num_edges, len(OPS)) * 1e-3
)
self.alphas_reduce = nn.Parameter(
torch.randn(num_edges, len(OPS)) * 1e-3
)
def forward(self, s0, s1):
states = [s0, s1]
edges = 0
for j in range(2, self.num_nodes + 2):
node_inputs = []
for i in range(j):
weights = F.softmax(self.alphas_normal[edges], dim=0)
node_inputs.append(self.ops[edges](states[i], weights))
edges += 1
states.append(sum(node_inputs))
return states[-1]
class DARTS(nn.Module):
def __init__(self, C=16, num_classes=10, layers=8):
super().__init__()
self.stem = nn.Sequential(
nn.Conv2d(3, C, 3, 1, 1, bias=False),
nn.BatchNorm2d(C)
)
self.cells = nn.ModuleList([
DARTSCell(C) for _ in range(layers)
])
self.classifier = nn.Linear(C, num_classes)
def forward(self, x):
s0 = s1 = self.stem(x)
for cell in self.cells:
s0, s1 = s1, cell(s0, s1)
out = F.adaptive_avg_pool2d(s1, 1).view(s1.size(0), -1)
return self.classifier(out)
def derive_architecture(model):
"""Extract final architecture from trained DARTS model."""
alphas = F.softmax(model.cells[0].alphas_normal, dim=1)
ops_per_edge = alphas.argmax(dim=1)
return ops_per_edge
Practical Considerations
💡 NAS Best Practices
- Search on proxy dataset: Use CIFAR-10 for search, transfer to ImageNet
- Use weight sharing: Train supernetwork once, evaluate subnets by weight inheritance
- Set GPU budget: DARTS: ~1-4 GPU days, RL: ~1000-2000 GPU days
- Regularize search: Add latency constraint for hardware-aware NAS
- Avoid degenerate solutions: DARTS may collapse to skip connections — use decay on skip ops
Practice Exercises
-
DARTS on CIFAR-10: Implement and run DARTS search. Visualize the discovered architecture.
-
Compound scaling: Reproduce EfficientNet scaling experiments. Plot accuracy vs. FLOPs.
-
Hardware-aware NAS: Add latency objective to DARTS. Find Pareto-optimal architectures.
-
Once-for-All: Train OFA supernetwork on MNIST. Extract subnets for different FLOP budgets.
Key Takeaways
📋Summary: Neural Architecture Search
- NAS automates architecture design with search space, strategy, and estimation
- DARTS: Differentiable relaxation enables gradient-based architecture search
- Cell-based search: Discover cells, then stack to form network
- EfficientNet: Compound scaling of depth, width, and resolution
- Once-for-All: Train one supernetwork, deploy many subnets
- Weight sharing reduces search cost from 1000+ GPU hours to 1-4 GPU days
- Hardware-aware NAS optimizes for target latency/memory
- Discovered architectures often outperform human-designed ones
- Practical NAS: search on proxy tasks, transfer to target
- See also: MLOps for deployment tracking