Inference Optimization
Continuous Batching — Maximizing GPU Utilization
Static batching wastes GPU resources waiting for the longest request to complete. Continuous batching dynamically adds and removes requests during generation, keeping the GPU fully utilized.
- Dynamic Scheduling — Add new requests as soon as slots free up
- Iteration-Level Scheduling — Make scheduling decisions every generation step
- Throughput Optimization — 3-5x throughput improvement over static batching
The GPU should never wait for a request, and no request should wait for the GPU.
Continuous Batching for LLMs
Traditional static batching requires all requests in a batch to complete before processing the next batch. This leads to significant waste — if one request finishes early, its GPU resources sit idle until the entire batch completes.
DfContinuous Batching
Continuous batching (also called iteration-level scheduling or in-flight batching) allows new requests to join a batch as soon as any request in the batch completes its generation. This maintains high GPU utilization throughout the serving process.
Static vs Continuous Batching
The Problem with Static Batching
Static Batching:
Request 1: [====done====]......waiting......
Request 2: [========done========]...waiting..
Request 3: [================done================]
^^^^^^^^^^
GPU idle time (wasted)
Continuous Batching Solution
Continuous Batching:
Step 1: [Req1, Req2, Req3]
Step 2: [Req1, Req2, Req3] -> Req1 done, slot free
Step 3: [Req4, Req2, Req3] -> Req4 joins immediately
Step 4: [Req4, Req5, Req3] -> Req5 joins when Req2 done
Continuous batching requires iteration-level scheduling — making scheduling decisions at every generation step rather than at the batch level. This is implemented in vLLM, TensorRT-LLM, and TGI.
Throughput Analysis
Batch Throughput
Here,
- =Number of requests processed
- =Average output tokens per request
- =Total processing time
Throughput Comparison
With 100 requests (avg 50 output tokens each):
- Static batching (batch=10): 10 batches x 50 steps = 500 GPU steps
- Continuous batching: ~120 GPU steps (no idle time)
- Speedup: 500/120 = 4.2x throughput improvement
Implementation Architecture
import asyncio
from dataclasses import dataclass
from typing import List, Optional
import heapq
@dataclass
class Request:
request_id: str
input_ids: torch.Tensor
max_tokens: int
priority: int = 0
tokens_generated: int = 0
done: bool = False
class ContinuousBatchScheduler:
def __init__(self, model, max_batch_size=32):
self.model = model
self.max_batch_size = max_batch_size
self.active_requests: List[Request] = []
self.pending_requests: asyncio.Queue = asyncio.Queue()
async def add_request(self, request: Request):
await self.pending_requests.put(request)
async def generate(self):
while True:
# Fill batch from pending requests
while len(self.active_requests) < self.max_batch_size:
if self.pending_requests.empty():
break
request = await self.pending_requests.get()
self.active_requests.append(request)
if not self.active_requests:
await asyncio.sleep(0.001)
continue
# Run one generation step
input_batch = torch.stack([r.input_ids for r in self.active_requests])
with torch.no_grad():
outputs = self.model(input_batch)
# Process each request
completed = []
for i, request in enumerate(self.active_requests):
next_token = sample_token(outputs.logits[i])
request.input_ids = torch.cat([request.input_ids, next_token.unsqueeze(0)])
request.tokens_generated += 1
if next_token.item() == EOS_TOKEN or request.tokens_generated >= request.max_tokens:
request.done = True
completed.append(request)
# Remove completed requests
for req in completed:
self.active_requests.remove(req)
Scheduling Policies
First-In-First-Out (FIFO)
class FIFOScheduler:
def select_next(self, pending, active, max_batch):
if len(active) >= max_batch:
return None
if not pending:
return None
return pending.pop(0)
Shortest-Job-First (SJF)
DfShortest-Job-First Scheduling
SJF scheduling prioritizes requests expected to produce shorter outputs. This reduces average latency and increases throughput by completing more requests per unit time.
class SJFScheduler:
def select_next(self, pending, active, max_batch):
if len(active) >= max_batch:
return None
if not pending:
return None
# Sort by estimated output length
pending.sort(key=lambda r: r.estimated_output_length)
return pending.pop(0)
Preemptive Scheduling
DfPreemptive Scheduling
Preemptive scheduling can pause low-priority requests to make room for high-priority ones. Paused requests are saved to CPU memory and resumed when GPU resources become available.
class PreemptiveScheduler:
def __init__(self, preempt_threshold=0.1):
self.preempt_threshold = preempt_threshold
def maybe_preempt(self, active_requests, new_request):
if new_request.priority > min(r.priority for r in active_requests):
# Preempt lowest priority request
victim = min(active_requests, key=lambda r: r.priority)
self.move_to_cpu(victim)
active_requests.remove(victim)
active_requests.append(new_request)
Memory Management
KV Cache Management
DfKV Cache Memory
The KV cache stores key and value tensors for all previous tokens. For a 70B model with sequence length 4096, the KV cache requires ~40GB — often the memory bottleneck in serving.
KV Cache Memory per Request
Here,
- =Number of layers
- =Number of attention heads
- =Head dimension
- =Sequence length
For continuous batching, KV cache must be managed dynamically as requests join and leave the batch. PagedAttention (vLLM) and RadixAttention (SGLang) are key innovations for efficient KV cache management.
Production Systems
vLLM Architecture
User Request -> [API Server] -> [Scheduler] -> [Model Worker]
| | |
v v v
[Request Queue] [Batch Manager] [KV Cache Manager]
|
v
[Token Sampler]
|
v
[Response Stream]
Performance Comparison
| System | Throughput (tokens/s) | Latency (p99) | GPU Utilization |
|---|---|---|---|
| HuggingFace (naive) | 1,000 | 500ms | 30-40% |
| vLLM (continuous) | 5,000 | 200ms | 80-90% |
| TensorRT-LLM | 8,000 | 150ms | 85-95% |
| SGLang | 6,000 | 180ms | 82-92% |
Practice Exercises
-
Throughput Analysis: Compare the throughput of static vs continuous batching for a workload with 50% short (<10 tokens) and 50% long (>500 tokens) requests.
-
Scheduler Design: Implement a priority-based scheduler that minimizes tail latency (p99) while maintaining high throughput.
-
KV Cache Budget: If you have 80GB GPU memory and the model uses 40GB, how many concurrent requests can you serve with continuous batching?
-
Preemption Policy: Design a preemption policy that minimizes the number of preempted requests while ensuring high-priority requests complete on time.
Key Takeaways
Summary: Continuous Batching
- Static batching wastes GPU resources waiting for the longest request
- Continuous batching adds/removes requests at every generation step
- 3-5x throughput improvement over static batching
- Scheduling policies (FIFO, SJF, preemptive) affect latency and fairness
- KV cache management is the key memory bottleneck
- PagedAttention and RadixAttention enable efficient memory sharing
- Production systems (vLLM, TensorRT-LLM, SGLang) implement continuous batching
- GPU utilization increases from 30-40% to 80-95% with continuous batching
What to Learn Next
-> KV Cache Optimization Reducing memory usage of the key-value cache.
-> Speculative Decoding Generating multiple tokens per step for faster inference.
-> LLM Inference Optimization Broader strategies for making LLM inference faster.
-> Flash Attention and Memory Efficiency IO-aware attention algorithms.
-> Building Production LLM Applications End-to-end production LLM systems.
-> Model Parallelism and Tensor Parallelism Splitting models across GPUs.