Deep Learning Systems Design — Distributed Training and Production

ProductionSystemsFree Lesson

Advertisement

Deep Learning Systems Design

Building production ML systems requires understanding distributed training, mixed precision, monitoring, and deployment. This covers the engineering side of deep learning at scale.


Distributed Data Parallelism

DfData Parallelism

Split mini-batches across GPUs, each holding a full model copy:

  1. Each GPU processes a different data shard
  2. Compute gradients independently
  3. AllReduce to average gradients across GPUs
  4. Each GPU updates its model copy

Effective when model fits in single GPU memory.

AllReduce Gradient Averaging

gˉ=1Ni=1Ngi\bar{g} = \frac{1}{N} \sum_{i=1}^{N} g_i

Here,

  • NN=Number of GPUs
  • gig_i=Gradient from GPU i
  • gˉ\bar{g}=Averaged gradient

Model Parallelism

DfModel Parallelism

Split the model across GPUs when it's too large for a single GPU:

  • Tensor parallelism: Split individual layers across GPUs (e.g., split weight matrices)
  • Pipeline parallelism: Split layers across GPUs, process micro-batches in pipeline
  • Expert parallelism: In MoE, different experts on different GPUs

Pipeline parallelism schedules micro-batches to minimize idle time (GPipe, 1F1B scheduling).

Pipeline Bubble Overhead

Bubble fraction=(p1)m+p1\text{Bubble fraction} = \frac{(p-1)}{m + p - 1}

Here,

  • pp=Number of pipeline stages (GPUs)
  • mm=Number of micro-batches

Mixed Precision Training

DfMixed Precision

Use FP16 for compute (2x speedup, 2x memory savings) while keeping FP32 master weights for numerical stability:

  1. Maintain FP32 master weights
  2. Convert to FP16 for forward/backward
  3. Compute gradients in FP16
  4. Convert gradients to FP32 for update
  5. Loss scaling: Multiply loss by large scalar to prevent gradient underflow

💡 PyTorch AMP

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for data, target in loader:
    optimizer.zero_grad()
    with autocast():  # FP16 forward
        output = model(data)
        loss = criterion(output, target)

    scaler.scale(loss).backward()  # Scale gradients
    scaler.step(optimizer)         # Unscale + clip + step
    scaler.update()                # Adjust scale factor

Gradient Accumulation

DfGradient Accumulation

Simulate larger batch sizes by accumulating gradients over multiple forward-backward passes before updating:

geffective=1Kk=1Kgkg_{\text{effective}} = \frac{1}{K} \sum_{k=1}^{K} g_k

where KK is the accumulation steps. Effective batch size = K×micro_batch×num_GPUsK \times \text{micro\_batch} \times \text{num\_GPUs}.

Effective Batch Size

Beffective=K×Bmicro×NGPUsB_{\text{effective}} = K \times B_{\text{micro}} \times N_{\text{GPUs}}

Here,

  • KK=Gradient accumulation steps
  • BmicroB_{\text{micro}}=Per-GPU batch size
  • NGPUsN_{\text{GPUs}}=Number of GPUs

Communication Overhead

ThCommunication-Computation Tradeoff

Distributed training speedup is limited by communication overhead:

Speedup=N1+BparamsBcompute\text{Speedup} = \frac{N}{1 + \frac{B_{\text{params}}}{B_{\text{compute}}}}

where BparamsB_{\text{params}} is parameter communication cost and BcomputeB_{\text{compute}} is computation time. Large batch sizes amortize communication cost.

Communication-Computation Ratio

ρ=Communication timeComputation time=2Psize(param)/bandwidthFLOPs/FLOPS\rho = \frac{\text{Communication time}}{\text{Computation time}} = \frac{2 \cdot P \cdot \text{size}(\text{param}) / \text{bandwidth}}{FLOPs / \text{FLOPS}}

Here,

  • PP=Number of GPUs
  • FLOPsFLOPs=Floating-point operations per iteration
  • bandwidthbandwidth=Interconnect bandwidth (GB/s)

ℹ️ Communication Backends

  • NCCL (NVIDIA): GPU-to-GPU optimized, supports NVLink, NVSwitch
  • Gloo: CPU and GPU, good for heterogeneous setups
  • MPI: Standard for HPC clusters
  • NVLink: 600 GB/s per GPU (A100) vs PCIe 64 GB/s

ML System Architecture

DfML System Components

A production ML system typically includes:

  1. Data pipeline: Ingestion, preprocessing, feature store
  2. Training: Distributed training with experiment tracking
  3. Model registry: Versioned model storage
  4. Serving: Low-latency inference (batch/real-time)
  5. Monitoring: Data drift, model performance, latency
  6. Feedback loop: A/B testing, online learning

Monitoring

DfML Monitoring

Continuous monitoring of:

  • Data drift: Distribution shift in input features (PSI, KS test)
  • Concept drift: Relationship between inputs and outputs changes
  • Model performance: Accuracy, latency, throughput
  • System health: GPU utilization, memory, network I/O
  • Fairness: Bias metrics across demographic groups

Population Stability Index (PSI)

PSI=i=1B(piqi)ln(piqi)\text{PSI} = \sum_{i=1}^{B} (p_i - q_i) \cdot \ln\left(\frac{p_i}{q_i}\right)

Here,

  • pip_i=Proportion of observations in bin i for reference
  • qiq_i=Proportion of observations in bin i for current
  • BB=Number of bins

A/B Testing

DfA/B Testing for ML

Randomly assign users to control (old model) and treatment (new model) groups:

  1. Define metric (accuracy, revenue, engagement)
  2. Determine sample size (power analysis)
  3. Run experiment for sufficient time
  4. Statistical significance testing (t-test, chi-squared)
  5. Deploy winner

For DL models: compare inference latency, accuracy, and business metrics.

Minimum Sample Size

n=(Zα/2+Zβ)22σ2δ2n = \frac{(Z_{\alpha/2} + Z_\beta)^2 \cdot 2\sigma^2}{\delta^2}

Here,

  • Zα/2Z_{\alpha/2}=Critical value for significance level \alpha
  • ZβZ_\beta=Critical value for power 1-\beta
  • σ\sigma=Standard deviation of metric
  • δ\delta=Minimum detectable effect size

PyTorch Implementation

📝Example: Distributed Data Parallel

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler


def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)


def cleanup():
    dist.destroy_process_group()


def train(rank, world_size):
    setup(rank, world_size)

    model = nn.Sequential(
        nn.Linear(784, 256), nn.ReLU(),
        nn.Linear(256, 128), nn.ReLU(),
        nn.Linear(128, 10)
    ).to(rank)

    # Wrap with DDP
    model = DDP(model, device_ids=[rank])

    # Distributed sampler
    sampler = DistributedSampler(
        dataset, num_replicas=world_size, rank=rank, shuffle=True
    )
    loader = DataLoader(dataset, batch_size=64, sampler=sampler)

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    for epoch in range(10):
        sampler.set_epoch(epoch)  # Important for shuffling
        for data, target in loader:
            data, target = data.to(rank), target.to(rank)
            optimizer.zero_grad()
            output = model(data)
            loss = nn.functional.cross_entropy(output, target)
            loss.backward()  # DDP averages gradients automatically
            optimizer.step()

    cleanup()


# Launch with: torchrun --nproc_per_node=4 train.py

📝Example: Mixed Precision + Gradient Accumulation

from torch.cuda.amp import autocast, GradScaler

def train_with_accumulation(model, loader, optimizer, accumulation_steps=4):
    scaler = GradScaler()
    model.train()
    optimizer.zero_grad()

    for i, (data, target) in enumerate(loader):
        data, target = data.cuda(), target.cuda()

        with autocast():
            output = model(data)
            loss = nn.functional.cross_entropy(output, target)
            loss = loss / accumulation_steps

        scaler.scale(loss).backward()

        if (i + 1) % accumulation_steps == 0:
            scaler.unscale_(optimizer)
            nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

# Effective batch size = 32 * 4 * 4 GPUs = 512

System Design Patterns

💡 DL System Architecture Patterns

  1. Lambda architecture: Batch layer + speed layer + serving layer
  2. Kappa architecture: Stream-only, reprocess all data as events
  3. Feature store: Centralized feature computation and serving
  4. Model registry: Versioned model storage with metadata
  5. Shadow deployment: Run new model alongside old, compare without affecting users
  6. Canary deployment: Gradually route traffic to new model (1% → 10% → 100%)

Practice Exercises

  1. DDP training: Train ResNet-50 on 4 GPUs using DDP. Measure speedup vs. single GPU.

  2. Mixed precision: Compare FP32 vs AMP training time and memory on ImageNet.

  3. Pipeline parallelism: Implement pipeline parallelism for a 24-layer Transformer.

  4. Monitoring dashboard: Set up data drift monitoring with PSI for a production model.


Key Takeaways

📋Summary: DL Systems Design

  • Data parallelism: Split data across GPUs, AllReduce gradients — most common
  • Model parallelism: Split model when too large for single GPU
  • Mixed precision: FP16 computation + FP32 master weights = 2x speedup
  • Gradient accumulation: Simulate larger batch sizes with limited memory
  • Communication overhead: Limiting factor — use large batches, efficient interconnects
  • Monitoring: Track data drift, concept drift, model performance, fairness
  • A/B testing: Statistical testing for model comparison in production
  • Pipeline parallelism: GPipe/1F1B scheduling to minimize bubble overhead
  • NCCL: Standard for GPU-to-GPU communication
  • See also: MLOps for experiment tracking

Advertisement

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement