LLM Production
Cost Optimization for LLMs — Maximizing Value Per Token
LLM inference costs can dominate production budgets. Strategic optimization across token usage, caching, batching, and model selection yields 10-100x cost reductions.
- Token Economics — Understanding and reducing token consumption
- Caching — Semantic caching and prefix caching strategies
- Model Selection — Right-sizing models for task complexity
The cheapest token is the one you never generate.
Cost Optimization for LLMs
LLM inference is expensive. A single GPT-4 query costs 10-100x more than a traditional API call. At scale, inference costs can exceed $100K/month. This guide provides a systematic framework for optimizing LLM costs while maintaining quality.
DfLLM Cost Function
The total cost of an LLM interaction includes input tokens, output tokens, and infrastructure overhead: C_total = C_input + C_output + C_infra, where each component depends on model choice, token counts, and serving infrastructure.
Token Economics
Understanding Token Costs
Cost Per Request
Here,
- =Total cost per request
- =Number of input tokens
- =Price per input token
- =Number of output tokens
- =Price per output token
- =Infrastructure overhead per request
Token Reduction Strategies
DfPrompt Compression
Prompt compression reduces input token count through: (1) removing redundant context, (2) using shorter system prompts, (3) summarizing conversation history, (4) few-shot example selection, and (5) structured output constraints.
Strategy Impact:
| Strategy | Token Reduction | Quality Impact |
|---|---|---|
| System prompt optimization | 10-30% | Minimal |
| Conversation summarization | 20-50% | Low-Medium |
| Few-shot selection | 15-40% | Task-dependent |
| Output length constraints | N/A (output) | Medium |
| Structured output | 20-40% (output) | Minimal |
Output tokens are typically 3-10x more expensive than input tokens. Constraining output length or using structured output formats (JSON vs. free text) can significantly reduce costs.
Caching Strategies
Semantic Caching
DfSemantic Caching
Semantic caching stores responses based on semantic similarity rather than exact match. When a new query is sufficiently similar to a cached query, the cached response is returned without invoking the LLM.
Cache Hit Rate
Here,
- =Cache hit rate
- =Requests served from cache
- =Total requests
Cost Savings from Caching
If 30% of requests hit the semantic cache and each request costs $0.05:
- Monthly requests: 1,000,000
- Cost without cache: $50,000
- Cost with cache (30% hit rate): $35,000
- Monthly savings: $15,000 (30% reduction)
Prefix Caching
For workloads with shared system prompts, prefix caching avoids redundant prefill computation.
Prefill Savings from Prefix Caching
Here,
- =Length of cached prefix tokens
- =Total input length
Batching and Batching Optimization
Dynamic Batching
DfDynamic Batching
Dynamic batching groups multiple requests into a single forward pass, amortizing GPU overhead across requests. Unlike static batching, dynamic batching allows requests to join and leave the batch dynamically.
Throughput from Batching
Here,
- =Batch size
- =Parallelization efficiency (0 < α < 1)
- =Single-request throughput
Request Coalescing
For high-traffic applications, buffer requests for 50-100ms before processing. This allows natural batching without significant latency increase, often improving throughput by 3-5x.
Model Selection and Right-Sizing
Task-Model Matching
DfTask-Model Matching
Task-model matching assigns queries to the smallest model that can handle them adequately. Simple queries (FAQ, formatting) use small models; complex queries (reasoning, code) use larger models.
Decision Framework:
| Query Complexity | Recommended Model | Cost Ratio |
|---|---|---|
| Simple (FAQ, formatting) | 7B / Small API | 1x |
| Medium (summarization, extraction) | 13-30B / Medium API | 3-5x |
| Complex (reasoning, code generation) | 70B+ / Large API | 10-20x |
| Expert (multi-step reasoning, research) | GPT-4 / Claude Opus | 50-100x |
Cascading
DfModel Cascading
Model cascading routes requests through a sequence of models, starting with the cheapest. If the smaller model's confidence is below a threshold, the request is escalated to a larger model.
Expected Cost with Cascading
Here,
- =Cost of small model
- =Probability small model is confident
- =Cost of large model
Infrastructure Optimization
Quantization for Cost Reduction
DfQuantization Cost Savings
Quantization reduces model size and memory requirements, enabling deployment on cheaper hardware. INT8 quantization reduces memory by ~50%, INT4 by ~75%, with varying quality impact.
| Precision | Memory Reduction | Quality Impact | Cost Reduction |
|---|---|---|---|
| FP16 | Baseline | None | Baseline |
| INT8 | 50% | Minimal | 30-40% |
| INT4 | 75% | Low-Medium | 50-60% |
| GPTQ/INT4 | 75% | Low | 50-60% |
Self-hosting with quantized models can be 5-10x cheaper than API calls at scale (>100K requests/day), but requires DevOps expertise and infrastructure management.
Practice Exercises
-
Conceptual: Compare the cost profiles of semantic caching versus prefix caching. Under what workloads is each approach more effective?
-
Mathematical: Calculate the break-even point between self-hosting a quantized 70B model versus using GPT-4 API, given: API cost 2/hour for A100 GPU, average 500 tokens per request.
-
Practical: Design a model cascading system that routes queries to 7B, 13B, or 70B models based on estimated complexity. What signals would you use for routing?
-
Research: Compare the cost-effectiveness of LoRA fine-tuning a small model versus prompting a large model for a domain-specific task.
Key Takeaways:
- Output tokens are 3-10x more expensive than input tokens; constrain output when possible
- Semantic caching can reduce costs by 20-40% for repetitive workloads
- Dynamic batching improves throughput by 3-5x with minimal latency increase
- Model cascading routes simple queries to cheap models, complex queries to expensive ones
- Self-hosting with quantization becomes cost-effective at scale (>100K requests/day)
What to Learn Next
-> LLM Serving Architectures vLLM, TGI, TensorRT-LLM, and serving patterns for production deployments.
-> LLM Monitoring and Observability Logging, tracing, metrics, and drift detection for production systems.
-> LLM Fine-Tuning Pipelines End-to-end fine-tuning infrastructure and data management.
-> LLM Security Best Practices Protecting systems from adversarial attacks and data privacy risks.
-> Multi-Tenant LLM Systems Tenant isolation, resource sharing, and customization at scale.
-> AB Testing for LLMs Experiment design, statistical significance, and canary deployments.