Inference Optimization

Model Parallelism for Inference — Serving Models Too Large for One GPU

When a model exceeds single-GPU memory, model parallelism splits it across multiple GPUs. This guide covers tensor parallelism, pipeline parallelism, and hybrid strategies for inference.

Tensor Parallelism — Split individual layers across GPUs
Pipeline Parallelism — Split model stages across GPUs
Expert Parallelism — Distribute MoE experts across GPUs

The model does not care how many GPUs it runs on — it only cares about serving responses.

Model Parallelism for LLM Inference

Modern LLMs like LLaMA-2 70B (140GB in FP16) and GPT-4 (~360GB) exceed the memory of any single GPU. Model parallelism distributes the model across multiple GPUs to enable inference on models that cannot fit in one device.

DfInference Parallelism

Inference parallelism distributes a model's parameters, computations, or both across multiple GPUs to enable serving models that exceed single-GPU memory capacity. Unlike training parallelism, inference parallelism focuses on latency reduction and memory distribution.

Tensor Parallelism for Inference

How It Works

DfTensor Parallel Inference

In tensor parallel inference, individual layers are split across GPUs. Each GPU holds a portion of the weight matrices, and results are combined via all-reduce after each layer.

Architecture Diagram

Linear Layer Y = XW:

GPU 0: Y_0 = X @ W_0  (W_0 = first half of columns)
GPU 1: Y_1 = X @ W_1  (W_1 = second half of columns)

Y = [Y_0 | Y_1]  (concatenated)

Tensor Parallel Memory per GPU

M_{\text{TP}} = \frac{M_{\text{total}}}{G_{\text{TP}}}

Here,

$M_{\text{TP}}$ =Memory per GPU with tensor parallelism
$M_{\text{total}}$ =Total model memory
$G_{\text{TP}}$ =Tensor parallel degree

import torch.distributed as dist
from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel, RowwiseParallel

def tensor_parallel_inference(model, world_size):
    """Apply tensor parallelism for inference."""
    # Define parallel plan
    tp_plan = {
        "self_attn.q_proj": ColwiseParallel(),
        "self_attn.k_proj": ColwiseParallel(),
        "self_attn.v_proj": ColwiseParallel(),
        "self_attn.o_proj": RowwiseParallel(),
        "mlp.gate_proj": ColwiseParallel(),
        "mlp.up_proj": ColwiseParallel(),
        "mlp.down_proj": RowwiseParallel(),
    }
    
    # Parallelize the model
    model = parallelize_module(model, tp_plan)
    return model

Communication Pattern

Tensor parallelism requires all-reduce after every layer (forward) or two all-reduces per layer (backward). For inference, this means 2 x L all-reduce operations per token, where L is the number of layers.

Tensor Parallel Inference Latency

T_{\text{TP}} = T_{\text{compute}} + T_{\text{comm}} \times 2L

Here,

$T_{\text{TP}}$ =Total inference time
$T_{\text{compute}}$ =Computation time per GPU
$T_{\text{comm}}$ =Communication time per all-reduce
$L$ =Number of layers

Pipeline Parallelism for Inference

How It Works

DfPipeline Parallel Inference

In pipeline parallel inference, different model layers reside on different GPUs. Tokens flow sequentially through the pipeline stages. For single-request inference, there is no parallelism benefit — pipeline parallelism is useful for serving multiple requests simultaneously.

Architecture Diagram

Pipeline Stages:
GPU 0: Layers 0-19   -> KV cache for layers 0-19
GPU 1: Layers 20-39  -> KV cache for layers 20-39
GPU 2: Layers 40-59  -> KV cache for layers 40-59
GPU 3: Layers 60-79  -> Final output

Request 1: GPU0 -> GPU1 -> GPU2 -> GPU3 -> Response
Request 2:         GPU0 -> GPU1 -> GPU2 -> GPU3 -> Response
Request 3:                 GPU0 -> GPU1 -> GPU2 -> GPU3 -> Response

Pipeline parallelism alone does not reduce latency for a single request — it only enables serving models too large for one GPU. To reduce latency, combine with tensor parallelism.

Hybrid Parallelism

TP + PP Combination

DfHybrid Parallelism

Hybrid parallelism combines tensor parallelism (for latency reduction) and pipeline parallelism (for memory distribution). Tensor parallelism is applied within pipeline stages, and pipeline parallelism splits the model across stages.

Architecture Diagram

Hybrid TP=2, PP=4 on 8 GPUs:

Pipeline Stage 0: GPU0 (TP rank 0) + GPU1 (TP rank 1) -> Layers 0-19
Pipeline Stage 1: GPU2 (TP rank 0) + GPU3 (TP rank 1) -> Layers 20-39
Pipeline Stage 2: GPU4 (TP rank 0) + GPU5 (TP rank 1) -> Layers 40-59
Pipeline Stage 3: GPU6 (TP rank 0) + GPU7 (TP rank 1) -> Layers 60-79

Serving a 175B Model on 8 GPUs

For a 175B parameter model (350GB in FP16):

Tensor parallel only (TP=8): 44GB per GPU, but 8 all-reduces per layer
Pipeline only (PP=8): 44GB per GPU, but no latency reduction
Hybrid (TP=2, PP=4): 88GB per GPU, 2 all-reduces per layer, 4 stages

Best choice: TP=2, PP=4 balances memory, latency, and communication overhead.

Expert Parallelism for MoE Models

DfExpert Parallelism

Expert parallelism distributes Mixture of Experts (MoE) model experts across GPUs. Each GPU holds a subset of experts, and tokens are routed to the appropriate GPU via all-to-all communication.

def expert_parallel_forward(x, expert_indices, expert_params):
    """Forward pass with expert parallelism."""
    # Route tokens to appropriate GPUs
    local_experts = [i for i in range(num_experts) if expert_owner[i] == local_rank]
    
    # Gather tokens for local experts
    local_tokens = all_to_all_dispatch(x, expert_indices, local_experts)
    
    # Compute local expert outputs
    local_outputs = []
    for expert_id in local_experts:
        expert_output = expert_forward(local_tokens[expert_id], expert_params[expert_id])
        local_outputs.append(expert_output)
    
    # Scatter outputs back
    output = all_to_all_combine(local_outputs, expert_indices)
    return output

Expert parallelism is particularly efficient because it requires only one all-to-all communication per layer, compared to tensor parallelism's all-reduce. For MoE models with 8-64 experts, expert parallelism scales almost linearly.

Serving Frameworks

Tensor Parallelism in Practice

Framework	TP Support	Max TP	Communication	Notes
vLLM	Yes	8 GPUs	NCCL	Best for general serving
TensorRT-LLM	Yes	8 GPUs	NCCL	Fastest inference
SGLang	Yes	8 GPUs	NCCL	Best for structured generation
TGI	Yes	8 GPUs	NCCL	Production-ready

Loading Models with TP

# vLLM tensor parallel serving
from vllm import LLM, SamplingParams

# Serve LLaMA-2 70B with TP=2
llm = LLM(
    model="meta-llama/Llama-2-70b-chat-hf",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
    max_model_len=4096,
)

# Generate
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["What is machine learning?"], sampling_params)

Communication Optimization

Overlapping Communication and Computation

DfCommunication-Computation Overlap

Overlap communication (all-reduce) with computation (matmul) by splitting the all-reduce into reduce-scatter and all-gather phases, executing them concurrently with layer computation.

def overlapped_tp_forward(layer, x, comm_group):
    """Tensor parallel forward with communication overlap."""
    # Start async reduce-scatter for previous layer output
    handle = dist.reduce_scatter_async(x, group=comm_group, async_op=True)
    
    # Compute current layer while communication proceeds
    output = layer(x)
    
    # Wait for communication to complete
    handle.wait()
    return output

Practice Exercises

Parallelism Design: Design a parallelism strategy for serving a 405B parameter model on 16 GPUs. Justify your choice of TP and PP degrees.
Latency Analysis: Calculate the inference latency for a 70B model with TP=2 vs TP=4 on 4 GPUs. Assume 100GB/s interconnect bandwidth.
Memory Analysis: If a 70B model requires 140GB in FP16, how much memory does each GPU need with TP=2, TP=4, and TP=8?
Communication Volume: For a 70B model with TP=4, calculate the total communication volume per token generation step.

Key Takeaways

Summary: Model Parallelism for Inference

Tensor parallelism splits layers across GPUs, reducing latency with 2 all-reduces per layer
Pipeline parallelism splits model stages, enables serving models too large for one GPU
Hybrid TP+PP balances memory, latency, and communication overhead
Expert parallelism distributes MoE experts with efficient all-to-all communication
Communication bandwidth is the key constraint for tensor parallelism
TP=2-4 is typical for inference; higher TP requires very high-bandwidth interconnects
Production frameworks (vLLM, TensorRT-LLM) implement all parallelism strategies
Overlapping communication with computation hides latency overhead

What to Learn Next

-> Distributed Training for LLMs Parallelism strategies for training.

-> Mixture of Experts MoE architectures and expert routing.

-> Flash Attention and Memory Efficiency IO-aware attention algorithms.

-> KV Cache Optimization Reducing memory usage of the key-value cache.

-> LLM Inference Optimization Broader inference optimization strategies.

-> Building Production LLM Applications End-to-end production systems.

Model Parallelism and Tensor Parallelism

Model Parallelism for Inference — Serving Models Too Large for One GPU

Model Parallelism for LLM Inference

DfInference Parallelism

Tensor Parallelism for Inference

How It Works

DfTensor Parallel Inference

Tensor Parallel Memory per GPU

Communication Pattern

Tensor Parallel Inference Latency

Pipeline Parallelism for Inference

How It Works

DfPipeline Parallel Inference

Hybrid Parallelism

TP + PP Combination

DfHybrid Parallelism

Serving a 175B Model on 8 GPUs

Expert Parallelism for MoE Models

DfExpert Parallelism

Serving Frameworks

Tensor Parallelism in Practice

Loading Models with TP

Communication Optimization

Overlapping Communication and Computation

DfCommunication-Computation Overlap

Practice Exercises

Key Takeaways

Summary: Model Parallelism for Inference

What to Learn Next

Need Expert LLM Help?