Neural Architecture Search — Automated ML

ProductionNASFree Lesson

Advertisement

Neural Architecture Search

NAS automates the design of neural network architectures, replacing manual engineering with algorithmic search. It has discovered architectures that outperform human-designed networks.


NAS Framework

DfNAS Components

NAS consists of three components:

  1. Search space A\mathcal{A}: Set of possible architectures (operations, connectivity)
  2. Search strategy: Algorithm to explore A\mathcal{A} (reinforcement learning, evolutionary, gradient-based)
  3. Performance estimation: Evaluate architectures efficiently (proxy tasks, weight sharing)

Search Space

DfCell-Based Search Space

Most modern NAS uses a cell-based search space:

  • Normal cell: Preserves spatial dimensions
  • Reduction cell: Halves spatial dimensions, doubles channels
  • Each cell is a DAG with NN nodes (typically 4-6)
  • Edges are operations (conv 3x3, sep conv 5x5, max pool, etc.)

A network is built by stacking these cells.

Cell Output

oj=i<joˉi(i,j)o_j = \sum_{i < j} \bar{o}_i^{(i,j)}

Here,

  • ojo_j=Output of intermediate node j
  • oˉi(i,j)\bar{o}_i^{(i,j)}=Output of operation on edge (i,j)
  • \sum=Sum over all incoming edges

DARTS (Differentiable Architecture Search)

DfDARTS

DARTS (Liu et al., 2019) makes NAS differentiable by relaxing discrete architecture choices to continuous weights:

oˉ(i,j)(x)=oOexp(αo(i,j))oexp(αo(i,j))o(x)\bar{o}^{(i,j)}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o^{(i,j)})}{\sum_{o'} \exp(\alpha_{o'}^{(i,j)})} \cdot o(x)

Architecture parameters α\alpha and network weights ww are optimized jointly via bilevel optimization.

minαLval(w(α),α)s.t.w(α)=argminwLtrain(w,α)\min_\alpha \mathcal{L}_{\text{val}}(w^*(\alpha), \alpha) \quad \text{s.t.} \quad w^*(\alpha) = \arg\min_w \mathcal{L}_{\text{train}}(w, \alpha)

DARTS Continuous Relaxation

oˉ(i,j)(x)=oOexp(αo(i,j))oexp(αo(i,j))o(x)\bar{o}^{(i,j)}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o^{(i,j)})}{\sum_{o'} \exp(\alpha_{o'}^{(i,j)})} \cdot o(x)

Here,

  • αo(i,j)\alpha_o^{(i,j)}=Architecture weight for operation o on edge (i,j)
  • O\mathcal{O}=Set of candidate operations
  • o(x)o(x)=Output of operation o on input x

ℹ️ DARTS Training

DARTS alternates between:

  1. Step 1: Update weights ww on training data (minimize Ltrain\mathcal{L}_{\text{train}})
  2. Step 2: Update architecture params α\alpha on validation data (minimize Lval\mathcal{L}_{\text{val}})

After search, the final architecture is derived by selecting the operation with highest α\alpha on each edge.


EfficientNet

DfEfficientNet

EfficientNet (Tan & Le, 2019) uses compound scaling to jointly scale depth, width, and resolution:

d=αϕ,w=βϕ,r=γϕd = \alpha^\phi, \quad w = \beta^\phi, \quad r = \gamma^\phi

subject to αβ2γ22\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2

  • α\alpha: depth coefficient
  • β\beta: width coefficient
  • γ\gamma: resolution coefficient
  • ϕ\phi: compound coefficient controlling overall scale

Found via NAS, then scaled uniformly.

Compound Scaling

d=αϕ,w=βϕ,r=γϕs.t.αβ2γ22d = \alpha^\phi, \quad w = \beta^\phi, \quad r = \gamma^\phi \quad \text{s.t.} \quad \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2

Here,

  • dd=Depth (number of layers)
  • ww=Width (number of channels)
  • rr=Resolution (input image size)
  • ϕ\phi=Compound coefficient

Once-for-All (OFA)

DfOnce-for-All Network

OFA (Cai et al., 2020) trains a single supernetwork that contains all possible sub-networks:

  1. Progress shrinking: Train full network, then progressively fine-tune smaller subnets
  2. Elastic depth/width/kernel: Support flexible depth, width, and kernel size
  3. Deployment-specific search: Find optimal subnet for target hardware without retraining

This eliminates the need for architecture search per deployment.


Search Strategies

DfReinforcement Learning (NASNet)

Use RNN controller to generate architecture descriptions:

  • Controller predicts architecture as a sequence of decisions
  • Train architecture, evaluate on validation set
  • Use accuracy as reward to update controller via REINFORCE

Pros: Flexible search space. Cons: Very expensive (thousands of GPU hours).

DfEvolutionary Search (AmoebaNet)

Use genetic algorithms to evolve architectures:

  • Population of architectures
  • Tournament selection based on fitness (validation accuracy)
  • Mutation: change operations, connectivity
  • Crossover: combine architectures

Pros: Parallelizable. Cons: Still expensive.


PyTorch Implementation

📝Example: DARTS Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

# Candidate operations
OPS = {
    'none': lambda C: Zero(C),
    'skip_connect': lambda C: nn.Identity(),
    'sep_conv_3x3': lambda C: SepConv(C, C, 3, 1, 1),
    'sep_conv_5x5': lambda C: SepConv(C, C, 5, 1, 2),
    'dil_conv_3x3': lambda C: DilConv(C, C, 3, 1, 2, 2),
    'max_pool_3x3': lambda C: nn.MaxPool2d(3, 1, 1),
    'avg_pool_3x3': lambda C: nn.AvgPool2d(3, 1, 1),
}


class MixedOp(nn.Module):
    """Mixed operation with architecture weights."""
    def __init__(self, C):
        super().__init__()
        self.ops = nn.ModuleList([OPS[name](C) for name in OPS])

    def forward(self, x, weights):
        return sum(w * op(x) for w, op in zip(weights, self.ops))


class DARTSCell(nn.Module):
    """DARTS cell with learnable architecture parameters."""
    def __init__(self, C, num_nodes=4):
        super().__init__()
        self.num_nodes = num_nodes
        self.ops = nn.ModuleList()

        for j in range(2, num_nodes + 2):  # input nodes = 0, 1
            for i in range(j):
                self.ops.append(MixedOp(C))

        # Architecture parameters
        num_edges = sum(range(2, num_nodes + 2))
        self.alphas_normal = nn.Parameter(
            torch.randn(num_edges, len(OPS)) * 1e-3
        )
        self.alphas_reduce = nn.Parameter(
            torch.randn(num_edges, len(OPS)) * 1e-3
        )

    def forward(self, s0, s1):
        states = [s0, s1]
        edges = 0

        for j in range(2, self.num_nodes + 2):
            node_inputs = []
            for i in range(j):
                weights = F.softmax(self.alphas_normal[edges], dim=0)
                node_inputs.append(self.ops[edges](states[i], weights))
                edges += 1
            states.append(sum(node_inputs))

        return states[-1]


class DARTS(nn.Module):
    def __init__(self, C=16, num_classes=10, layers=8):
        super().__init__()
        self.stem = nn.Sequential(
            nn.Conv2d(3, C, 3, 1, 1, bias=False),
            nn.BatchNorm2d(C)
        )
        self.cells = nn.ModuleList([
            DARTSCell(C) for _ in range(layers)
        ])
        self.classifier = nn.Linear(C, num_classes)

    def forward(self, x):
        s0 = s1 = self.stem(x)
        for cell in self.cells:
            s0, s1 = s1, cell(s0, s1)
        out = F.adaptive_avg_pool2d(s1, 1).view(s1.size(0), -1)
        return self.classifier(out)


def derive_architecture(model):
    """Extract final architecture from trained DARTS model."""
    alphas = F.softmax(model.cells[0].alphas_normal, dim=1)
    ops_per_edge = alphas.argmax(dim=1)
    return ops_per_edge

Practical Considerations

💡 NAS Best Practices

  1. Search on proxy dataset: Use CIFAR-10 for search, transfer to ImageNet
  2. Use weight sharing: Train supernetwork once, evaluate subnets by weight inheritance
  3. Set GPU budget: DARTS: ~1-4 GPU days, RL: ~1000-2000 GPU days
  4. Regularize search: Add latency constraint for hardware-aware NAS
  5. Avoid degenerate solutions: DARTS may collapse to skip connections — use decay on skip ops

Practice Exercises

  1. DARTS on CIFAR-10: Implement and run DARTS search. Visualize the discovered architecture.

  2. Compound scaling: Reproduce EfficientNet scaling experiments. Plot accuracy vs. FLOPs.

  3. Hardware-aware NAS: Add latency objective to DARTS. Find Pareto-optimal architectures.

  4. Once-for-All: Train OFA supernetwork on MNIST. Extract subnets for different FLOP budgets.


Key Takeaways

📋Summary: Neural Architecture Search

  • NAS automates architecture design with search space, strategy, and estimation
  • DARTS: Differentiable relaxation enables gradient-based architecture search
  • Cell-based search: Discover cells, then stack to form network
  • EfficientNet: Compound scaling of depth, width, and resolution
  • Once-for-All: Train one supernetwork, deploy many subnets
  • Weight sharing reduces search cost from 1000+ GPU hours to 1-4 GPU days
  • Hardware-aware NAS optimizes for target latency/memory
  • Discovered architectures often outperform human-designed ones
  • Practical NAS: search on proxy tasks, transfer to target
  • See also: MLOps for deployment tracking

Advertisement

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement