Inference Optimization
Model Parallelism for Inference — Serving Models Too Large for One GPU
When a model exceeds single-GPU memory, model parallelism splits it across multiple GPUs. This guide covers tensor parallelism, pipeline parallelism, and hybrid strategies for inference.
- Tensor Parallelism — Split individual layers across GPUs
- Pipeline Parallelism — Split model stages across GPUs
- Expert Parallelism — Distribute MoE experts across GPUs
The model does not care how many GPUs it runs on — it only cares about serving responses.
Model Parallelism for LLM Inference
Modern LLMs like LLaMA-2 70B (140GB in FP16) and GPT-4 (~360GB) exceed the memory of any single GPU. Model parallelism distributes the model across multiple GPUs to enable inference on models that cannot fit in one device.
DfInference Parallelism
Inference parallelism distributes a model's parameters, computations, or both across multiple GPUs to enable serving models that exceed single-GPU memory capacity. Unlike training parallelism, inference parallelism focuses on latency reduction and memory distribution.
Tensor Parallelism for Inference
How It Works
DfTensor Parallel Inference
In tensor parallel inference, individual layers are split across GPUs. Each GPU holds a portion of the weight matrices, and results are combined via all-reduce after each layer.
Linear Layer Y = XW:
GPU 0: Y_0 = X @ W_0 (W_0 = first half of columns)
GPU 1: Y_1 = X @ W_1 (W_1 = second half of columns)
Y = [Y_0 | Y_1] (concatenated)
Tensor Parallel Memory per GPU
Here,
- =Memory per GPU with tensor parallelism
- =Total model memory
- =Tensor parallel degree
import torch.distributed as dist
from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel, RowwiseParallel
def tensor_parallel_inference(model, world_size):
"""Apply tensor parallelism for inference."""
# Define parallel plan
tp_plan = {
"self_attn.q_proj": ColwiseParallel(),
"self_attn.k_proj": ColwiseParallel(),
"self_attn.v_proj": ColwiseParallel(),
"self_attn.o_proj": RowwiseParallel(),
"mlp.gate_proj": ColwiseParallel(),
"mlp.up_proj": ColwiseParallel(),
"mlp.down_proj": RowwiseParallel(),
}
# Parallelize the model
model = parallelize_module(model, tp_plan)
return model
Communication Pattern
Tensor parallelism requires all-reduce after every layer (forward) or two all-reduces per layer (backward). For inference, this means 2 x L all-reduce operations per token, where L is the number of layers.
Tensor Parallel Inference Latency
Here,
- =Total inference time
- =Computation time per GPU
- =Communication time per all-reduce
- =Number of layers
Pipeline Parallelism for Inference
How It Works
DfPipeline Parallel Inference
In pipeline parallel inference, different model layers reside on different GPUs. Tokens flow sequentially through the pipeline stages. For single-request inference, there is no parallelism benefit — pipeline parallelism is useful for serving multiple requests simultaneously.
Pipeline Stages:
GPU 0: Layers 0-19 -> KV cache for layers 0-19
GPU 1: Layers 20-39 -> KV cache for layers 20-39
GPU 2: Layers 40-59 -> KV cache for layers 40-59
GPU 3: Layers 60-79 -> Final output
Request 1: GPU0 -> GPU1 -> GPU2 -> GPU3 -> Response
Request 2: GPU0 -> GPU1 -> GPU2 -> GPU3 -> Response
Request 3: GPU0 -> GPU1 -> GPU2 -> GPU3 -> Response
Pipeline parallelism alone does not reduce latency for a single request — it only enables serving models too large for one GPU. To reduce latency, combine with tensor parallelism.
Hybrid Parallelism
TP + PP Combination
DfHybrid Parallelism
Hybrid parallelism combines tensor parallelism (for latency reduction) and pipeline parallelism (for memory distribution). Tensor parallelism is applied within pipeline stages, and pipeline parallelism splits the model across stages.
Hybrid TP=2, PP=4 on 8 GPUs:
Pipeline Stage 0: GPU0 (TP rank 0) + GPU1 (TP rank 1) -> Layers 0-19
Pipeline Stage 1: GPU2 (TP rank 0) + GPU3 (TP rank 1) -> Layers 20-39
Pipeline Stage 2: GPU4 (TP rank 0) + GPU5 (TP rank 1) -> Layers 40-59
Pipeline Stage 3: GPU6 (TP rank 0) + GPU7 (TP rank 1) -> Layers 60-79
Serving a 175B Model on 8 GPUs
For a 175B parameter model (350GB in FP16):
- Tensor parallel only (TP=8): 44GB per GPU, but 8 all-reduces per layer
- Pipeline only (PP=8): 44GB per GPU, but no latency reduction
- Hybrid (TP=2, PP=4): 88GB per GPU, 2 all-reduces per layer, 4 stages
Best choice: TP=2, PP=4 balances memory, latency, and communication overhead.
Expert Parallelism for MoE Models
DfExpert Parallelism
Expert parallelism distributes Mixture of Experts (MoE) model experts across GPUs. Each GPU holds a subset of experts, and tokens are routed to the appropriate GPU via all-to-all communication.
def expert_parallel_forward(x, expert_indices, expert_params):
"""Forward pass with expert parallelism."""
# Route tokens to appropriate GPUs
local_experts = [i for i in range(num_experts) if expert_owner[i] == local_rank]
# Gather tokens for local experts
local_tokens = all_to_all_dispatch(x, expert_indices, local_experts)
# Compute local expert outputs
local_outputs = []
for expert_id in local_experts:
expert_output = expert_forward(local_tokens[expert_id], expert_params[expert_id])
local_outputs.append(expert_output)
# Scatter outputs back
output = all_to_all_combine(local_outputs, expert_indices)
return output
Expert parallelism is particularly efficient because it requires only one all-to-all communication per layer, compared to tensor parallelism's all-reduce. For MoE models with 8-64 experts, expert parallelism scales almost linearly.
Serving Frameworks
Tensor Parallelism in Practice
| Framework | TP Support | Max TP | Communication | Notes |
|---|---|---|---|---|
| vLLM | Yes | 8 GPUs | NCCL | Best for general serving |
| TensorRT-LLM | Yes | 8 GPUs | NCCL | Fastest inference |
| SGLang | Yes | 8 GPUs | NCCL | Best for structured generation |
| TGI | Yes | 8 GPUs | NCCL | Production-ready |
Loading Models with TP
# vLLM tensor parallel serving
from vllm import LLM, SamplingParams
# Serve LLaMA-2 70B with TP=2
llm = LLM(
model="meta-llama/Llama-2-70b-chat-hf",
tensor_parallel_size=2,
gpu_memory_utilization=0.9,
max_model_len=4096,
)
# Generate
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["What is machine learning?"], sampling_params)
Communication Optimization
Overlapping Communication and Computation
DfCommunication-Computation Overlap
Overlap communication (all-reduce) with computation (matmul) by splitting the all-reduce into reduce-scatter and all-gather phases, executing them concurrently with layer computation.
def overlapped_tp_forward(layer, x, comm_group):
"""Tensor parallel forward with communication overlap."""
# Start async reduce-scatter for previous layer output
handle = dist.reduce_scatter_async(x, group=comm_group, async_op=True)
# Compute current layer while communication proceeds
output = layer(x)
# Wait for communication to complete
handle.wait()
return output
Practice Exercises
-
Parallelism Design: Design a parallelism strategy for serving a 405B parameter model on 16 GPUs. Justify your choice of TP and PP degrees.
-
Latency Analysis: Calculate the inference latency for a 70B model with TP=2 vs TP=4 on 4 GPUs. Assume 100GB/s interconnect bandwidth.
-
Memory Analysis: If a 70B model requires 140GB in FP16, how much memory does each GPU need with TP=2, TP=4, and TP=8?
-
Communication Volume: For a 70B model with TP=4, calculate the total communication volume per token generation step.
Key Takeaways
Summary: Model Parallelism for Inference
- Tensor parallelism splits layers across GPUs, reducing latency with 2 all-reduces per layer
- Pipeline parallelism splits model stages, enables serving models too large for one GPU
- Hybrid TP+PP balances memory, latency, and communication overhead
- Expert parallelism distributes MoE experts with efficient all-to-all communication
- Communication bandwidth is the key constraint for tensor parallelism
- TP=2-4 is typical for inference; higher TP requires very high-bandwidth interconnects
- Production frameworks (vLLM, TensorRT-LLM) implement all parallelism strategies
- Overlapping communication with computation hides latency overhead
What to Learn Next
-> Distributed Training for LLMs Parallelism strategies for training.
-> Mixture of Experts MoE architectures and expert routing.
-> Flash Attention and Memory Efficiency IO-aware attention algorithms.
-> KV Cache Optimization Reducing memory usage of the key-value cache.
-> LLM Inference Optimization Broader inference optimization strategies.
-> Building Production LLM Applications End-to-end production systems.