Deep Learning Systems Design
Building production ML systems requires understanding distributed training, mixed precision, monitoring, and deployment. This covers the engineering side of deep learning at scale.
Distributed Data Parallelism
DfData Parallelism
Split mini-batches across GPUs, each holding a full model copy:
- Each GPU processes a different data shard
- Compute gradients independently
- AllReduce to average gradients across GPUs
- Each GPU updates its model copy
Effective when model fits in single GPU memory.
AllReduce Gradient Averaging
Here,
- =Number of GPUs
- =Gradient from GPU i
- =Averaged gradient
Model Parallelism
DfModel Parallelism
Split the model across GPUs when it's too large for a single GPU:
- Tensor parallelism: Split individual layers across GPUs (e.g., split weight matrices)
- Pipeline parallelism: Split layers across GPUs, process micro-batches in pipeline
- Expert parallelism: In MoE, different experts on different GPUs
Pipeline parallelism schedules micro-batches to minimize idle time (GPipe, 1F1B scheduling).
Pipeline Bubble Overhead
Here,
- =Number of pipeline stages (GPUs)
- =Number of micro-batches
Mixed Precision Training
DfMixed Precision
Use FP16 for compute (2x speedup, 2x memory savings) while keeping FP32 master weights for numerical stability:
- Maintain FP32 master weights
- Convert to FP16 for forward/backward
- Compute gradients in FP16
- Convert gradients to FP32 for update
- Loss scaling: Multiply loss by large scalar to prevent gradient underflow
💡 PyTorch AMP
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for data, target in loader:
optimizer.zero_grad()
with autocast(): # FP16 forward
output = model(data)
loss = criterion(output, target)
scaler.scale(loss).backward() # Scale gradients
scaler.step(optimizer) # Unscale + clip + step
scaler.update() # Adjust scale factor
Gradient Accumulation
DfGradient Accumulation
Simulate larger batch sizes by accumulating gradients over multiple forward-backward passes before updating:
where is the accumulation steps. Effective batch size = .
Effective Batch Size
Here,
- =Gradient accumulation steps
- =Per-GPU batch size
- =Number of GPUs
Communication Overhead
ThCommunication-Computation Tradeoff
Distributed training speedup is limited by communication overhead:
where is parameter communication cost and is computation time. Large batch sizes amortize communication cost.
Communication-Computation Ratio
Here,
- =Number of GPUs
- =Floating-point operations per iteration
- =Interconnect bandwidth (GB/s)
ℹ️ Communication Backends
- NCCL (NVIDIA): GPU-to-GPU optimized, supports NVLink, NVSwitch
- Gloo: CPU and GPU, good for heterogeneous setups
- MPI: Standard for HPC clusters
- NVLink: 600 GB/s per GPU (A100) vs PCIe 64 GB/s
ML System Architecture
DfML System Components
A production ML system typically includes:
- Data pipeline: Ingestion, preprocessing, feature store
- Training: Distributed training with experiment tracking
- Model registry: Versioned model storage
- Serving: Low-latency inference (batch/real-time)
- Monitoring: Data drift, model performance, latency
- Feedback loop: A/B testing, online learning
Monitoring
DfML Monitoring
Continuous monitoring of:
- Data drift: Distribution shift in input features (PSI, KS test)
- Concept drift: Relationship between inputs and outputs changes
- Model performance: Accuracy, latency, throughput
- System health: GPU utilization, memory, network I/O
- Fairness: Bias metrics across demographic groups
Population Stability Index (PSI)
Here,
- =Proportion of observations in bin i for reference
- =Proportion of observations in bin i for current
- =Number of bins
A/B Testing
DfA/B Testing for ML
Randomly assign users to control (old model) and treatment (new model) groups:
- Define metric (accuracy, revenue, engagement)
- Determine sample size (power analysis)
- Run experiment for sufficient time
- Statistical significance testing (t-test, chi-squared)
- Deploy winner
For DL models: compare inference latency, accuracy, and business metrics.
Minimum Sample Size
Here,
- =Critical value for significance level \alpha
- =Critical value for power 1-\beta
- =Standard deviation of metric
- =Minimum detectable effect size
PyTorch Implementation
📝Example: Distributed Data Parallel
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
def setup(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def cleanup():
dist.destroy_process_group()
def train(rank, world_size):
setup(rank, world_size)
model = nn.Sequential(
nn.Linear(784, 256), nn.ReLU(),
nn.Linear(256, 128), nn.ReLU(),
nn.Linear(128, 10)
).to(rank)
# Wrap with DDP
model = DDP(model, device_ids=[rank])
# Distributed sampler
sampler = DistributedSampler(
dataset, num_replicas=world_size, rank=rank, shuffle=True
)
loader = DataLoader(dataset, batch_size=64, sampler=sampler)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(10):
sampler.set_epoch(epoch) # Important for shuffling
for data, target in loader:
data, target = data.to(rank), target.to(rank)
optimizer.zero_grad()
output = model(data)
loss = nn.functional.cross_entropy(output, target)
loss.backward() # DDP averages gradients automatically
optimizer.step()
cleanup()
# Launch with: torchrun --nproc_per_node=4 train.py
📝Example: Mixed Precision + Gradient Accumulation
from torch.cuda.amp import autocast, GradScaler
def train_with_accumulation(model, loader, optimizer, accumulation_steps=4):
scaler = GradScaler()
model.train()
optimizer.zero_grad()
for i, (data, target) in enumerate(loader):
data, target = data.cuda(), target.cuda()
with autocast():
output = model(data)
loss = nn.functional.cross_entropy(output, target)
loss = loss / accumulation_steps
scaler.scale(loss).backward()
if (i + 1) % accumulation_steps == 0:
scaler.unscale_(optimizer)
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
# Effective batch size = 32 * 4 * 4 GPUs = 512
System Design Patterns
💡 DL System Architecture Patterns
- Lambda architecture: Batch layer + speed layer + serving layer
- Kappa architecture: Stream-only, reprocess all data as events
- Feature store: Centralized feature computation and serving
- Model registry: Versioned model storage with metadata
- Shadow deployment: Run new model alongside old, compare without affecting users
- Canary deployment: Gradually route traffic to new model (1% → 10% → 100%)
Practice Exercises
-
DDP training: Train ResNet-50 on 4 GPUs using DDP. Measure speedup vs. single GPU.
-
Mixed precision: Compare FP32 vs AMP training time and memory on ImageNet.
-
Pipeline parallelism: Implement pipeline parallelism for a 24-layer Transformer.
-
Monitoring dashboard: Set up data drift monitoring with PSI for a production model.
Key Takeaways
📋Summary: DL Systems Design
- Data parallelism: Split data across GPUs, AllReduce gradients — most common
- Model parallelism: Split model when too large for single GPU
- Mixed precision: FP16 computation + FP32 master weights = 2x speedup
- Gradient accumulation: Simulate larger batch sizes with limited memory
- Communication overhead: Limiting factor — use large batches, efficient interconnects
- Monitoring: Track data drift, concept drift, model performance, fairness
- A/B testing: Statistical testing for model comparison in production
- Pipeline parallelism: GPipe/1F1B scheduling to minimize bubble overhead
- NCCL: Standard for GPU-to-GPU communication
- See also: MLOps for experiment tracking