CW

Model Parallelism and Tensor Parallelism

Inference OptimizationParallel InferenceFree Lesson

Advertisement

Inference Optimization

Model Parallelism for Inference — Serving Models Too Large for One GPU

When a model exceeds single-GPU memory, model parallelism splits it across multiple GPUs. This guide covers tensor parallelism, pipeline parallelism, and hybrid strategies for inference.

  • Tensor Parallelism — Split individual layers across GPUs
  • Pipeline Parallelism — Split model stages across GPUs
  • Expert Parallelism — Distribute MoE experts across GPUs

The model does not care how many GPUs it runs on — it only cares about serving responses.

Model Parallelism for LLM Inference

Modern LLMs like LLaMA-2 70B (140GB in FP16) and GPT-4 (~360GB) exceed the memory of any single GPU. Model parallelism distributes the model across multiple GPUs to enable inference on models that cannot fit in one device.

DfInference Parallelism

Inference parallelism distributes a model's parameters, computations, or both across multiple GPUs to enable serving models that exceed single-GPU memory capacity. Unlike training parallelism, inference parallelism focuses on latency reduction and memory distribution.

Tensor Parallelism for Inference

How It Works

DfTensor Parallel Inference

In tensor parallel inference, individual layers are split across GPUs. Each GPU holds a portion of the weight matrices, and results are combined via all-reduce after each layer.

Architecture Diagram
Linear Layer Y = XW:

GPU 0: Y_0 = X @ W_0  (W_0 = first half of columns)
GPU 1: Y_1 = X @ W_1  (W_1 = second half of columns)

Y = [Y_0 | Y_1]  (concatenated)

Tensor Parallel Memory per GPU

MTP=MtotalGTPM_{\text{TP}} = \frac{M_{\text{total}}}{G_{\text{TP}}}

Here,

  • MTPM_{\text{TP}}=Memory per GPU with tensor parallelism
  • MtotalM_{\text{total}}=Total model memory
  • GTPG_{\text{TP}}=Tensor parallel degree
import torch.distributed as dist
from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel, RowwiseParallel

def tensor_parallel_inference(model, world_size):
    """Apply tensor parallelism for inference."""
    # Define parallel plan
    tp_plan = {
        "self_attn.q_proj": ColwiseParallel(),
        "self_attn.k_proj": ColwiseParallel(),
        "self_attn.v_proj": ColwiseParallel(),
        "self_attn.o_proj": RowwiseParallel(),
        "mlp.gate_proj": ColwiseParallel(),
        "mlp.up_proj": ColwiseParallel(),
        "mlp.down_proj": RowwiseParallel(),
    }
    
    # Parallelize the model
    model = parallelize_module(model, tp_plan)
    return model

Communication Pattern

Tensor parallelism requires all-reduce after every layer (forward) or two all-reduces per layer (backward). For inference, this means 2 x L all-reduce operations per token, where L is the number of layers.

Tensor Parallel Inference Latency

TTP=Tcompute+Tcomm×2LT_{\text{TP}} = T_{\text{compute}} + T_{\text{comm}} \times 2L

Here,

  • TTPT_{\text{TP}}=Total inference time
  • TcomputeT_{\text{compute}}=Computation time per GPU
  • TcommT_{\text{comm}}=Communication time per all-reduce
  • LL=Number of layers

Pipeline Parallelism for Inference

How It Works

DfPipeline Parallel Inference

In pipeline parallel inference, different model layers reside on different GPUs. Tokens flow sequentially through the pipeline stages. For single-request inference, there is no parallelism benefit — pipeline parallelism is useful for serving multiple requests simultaneously.

Architecture Diagram
Pipeline Stages:
GPU 0: Layers 0-19   -> KV cache for layers 0-19
GPU 1: Layers 20-39  -> KV cache for layers 20-39
GPU 2: Layers 40-59  -> KV cache for layers 40-59
GPU 3: Layers 60-79  -> Final output

Request 1: GPU0 -> GPU1 -> GPU2 -> GPU3 -> Response
Request 2:         GPU0 -> GPU1 -> GPU2 -> GPU3 -> Response
Request 3:                 GPU0 -> GPU1 -> GPU2 -> GPU3 -> Response

Pipeline parallelism alone does not reduce latency for a single request — it only enables serving models too large for one GPU. To reduce latency, combine with tensor parallelism.

Hybrid Parallelism

TP + PP Combination

DfHybrid Parallelism

Hybrid parallelism combines tensor parallelism (for latency reduction) and pipeline parallelism (for memory distribution). Tensor parallelism is applied within pipeline stages, and pipeline parallelism splits the model across stages.

Architecture Diagram
Hybrid TP=2, PP=4 on 8 GPUs:

Pipeline Stage 0: GPU0 (TP rank 0) + GPU1 (TP rank 1) -> Layers 0-19
Pipeline Stage 1: GPU2 (TP rank 0) + GPU3 (TP rank 1) -> Layers 20-39
Pipeline Stage 2: GPU4 (TP rank 0) + GPU5 (TP rank 1) -> Layers 40-59
Pipeline Stage 3: GPU6 (TP rank 0) + GPU7 (TP rank 1) -> Layers 60-79

Serving a 175B Model on 8 GPUs

For a 175B parameter model (350GB in FP16):

  • Tensor parallel only (TP=8): 44GB per GPU, but 8 all-reduces per layer
  • Pipeline only (PP=8): 44GB per GPU, but no latency reduction
  • Hybrid (TP=2, PP=4): 88GB per GPU, 2 all-reduces per layer, 4 stages

Best choice: TP=2, PP=4 balances memory, latency, and communication overhead.

Expert Parallelism for MoE Models

DfExpert Parallelism

Expert parallelism distributes Mixture of Experts (MoE) model experts across GPUs. Each GPU holds a subset of experts, and tokens are routed to the appropriate GPU via all-to-all communication.

def expert_parallel_forward(x, expert_indices, expert_params):
    """Forward pass with expert parallelism."""
    # Route tokens to appropriate GPUs
    local_experts = [i for i in range(num_experts) if expert_owner[i] == local_rank]
    
    # Gather tokens for local experts
    local_tokens = all_to_all_dispatch(x, expert_indices, local_experts)
    
    # Compute local expert outputs
    local_outputs = []
    for expert_id in local_experts:
        expert_output = expert_forward(local_tokens[expert_id], expert_params[expert_id])
        local_outputs.append(expert_output)
    
    # Scatter outputs back
    output = all_to_all_combine(local_outputs, expert_indices)
    return output

Expert parallelism is particularly efficient because it requires only one all-to-all communication per layer, compared to tensor parallelism's all-reduce. For MoE models with 8-64 experts, expert parallelism scales almost linearly.

Serving Frameworks

Tensor Parallelism in Practice

FrameworkTP SupportMax TPCommunicationNotes
vLLMYes8 GPUsNCCLBest for general serving
TensorRT-LLMYes8 GPUsNCCLFastest inference
SGLangYes8 GPUsNCCLBest for structured generation
TGIYes8 GPUsNCCLProduction-ready

Loading Models with TP

# vLLM tensor parallel serving
from vllm import LLM, SamplingParams

# Serve LLaMA-2 70B with TP=2
llm = LLM(
    model="meta-llama/Llama-2-70b-chat-hf",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
    max_model_len=4096,
)

# Generate
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["What is machine learning?"], sampling_params)

Communication Optimization

Overlapping Communication and Computation

DfCommunication-Computation Overlap

Overlap communication (all-reduce) with computation (matmul) by splitting the all-reduce into reduce-scatter and all-gather phases, executing them concurrently with layer computation.

def overlapped_tp_forward(layer, x, comm_group):
    """Tensor parallel forward with communication overlap."""
    # Start async reduce-scatter for previous layer output
    handle = dist.reduce_scatter_async(x, group=comm_group, async_op=True)
    
    # Compute current layer while communication proceeds
    output = layer(x)
    
    # Wait for communication to complete
    handle.wait()
    return output

Practice Exercises

  1. Parallelism Design: Design a parallelism strategy for serving a 405B parameter model on 16 GPUs. Justify your choice of TP and PP degrees.

  2. Latency Analysis: Calculate the inference latency for a 70B model with TP=2 vs TP=4 on 4 GPUs. Assume 100GB/s interconnect bandwidth.

  3. Memory Analysis: If a 70B model requires 140GB in FP16, how much memory does each GPU need with TP=2, TP=4, and TP=8?

  4. Communication Volume: For a 70B model with TP=4, calculate the total communication volume per token generation step.

Key Takeaways

Summary: Model Parallelism for Inference

  • Tensor parallelism splits layers across GPUs, reducing latency with 2 all-reduces per layer
  • Pipeline parallelism splits model stages, enables serving models too large for one GPU
  • Hybrid TP+PP balances memory, latency, and communication overhead
  • Expert parallelism distributes MoE experts with efficient all-to-all communication
  • Communication bandwidth is the key constraint for tensor parallelism
  • TP=2-4 is typical for inference; higher TP requires very high-bandwidth interconnects
  • Production frameworks (vLLM, TensorRT-LLM) implement all parallelism strategies
  • Overlapping communication with computation hides latency overhead

What to Learn Next

-> Distributed Training for LLMs Parallelism strategies for training.

-> Mixture of Experts MoE architectures and expert routing.

-> Flash Attention and Memory Efficiency IO-aware attention algorithms.

-> KV Cache Optimization Reducing memory usage of the key-value cache.

-> LLM Inference Optimization Broader inference optimization strategies.

-> Building Production LLM Applications End-to-end production systems.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement