CW

Cost Optimization for LLMs

ProductionCost EngineeringFree Lesson

Advertisement

LLM Production

Cost Optimization for LLMs — Maximizing Value Per Token

LLM inference costs can dominate production budgets. Strategic optimization across token usage, caching, batching, and model selection yields 10-100x cost reductions.

  • Token Economics — Understanding and reducing token consumption
  • Caching — Semantic caching and prefix caching strategies
  • Model Selection — Right-sizing models for task complexity

The cheapest token is the one you never generate.

Cost Optimization for LLMs

LLM inference is expensive. A single GPT-4 query costs 10-100x more than a traditional API call. At scale, inference costs can exceed $100K/month. This guide provides a systematic framework for optimizing LLM costs while maintaining quality.

DfLLM Cost Function

The total cost of an LLM interaction includes input tokens, output tokens, and infrastructure overhead: C_total = C_input + C_output + C_infra, where each component depends on model choice, token counts, and serving infrastructure.

Token Economics

Understanding Token Costs

Cost Per Request

Crequest=ninputcdotpinput+noutputcdotpoutput+CoverheadC_{request} = n_{input} \\cdot p_{input} + n_{output} \\cdot p_{output} + C_{overhead}

Here,

  • CrequestC_{request}=Total cost per request
  • ninputn_{input}=Number of input tokens
  • pinputp_{input}=Price per input token
  • noutputn_{output}=Number of output tokens
  • poutputp_{output}=Price per output token
  • CoverheadC_{overhead}=Infrastructure overhead per request

Token Reduction Strategies

DfPrompt Compression

Prompt compression reduces input token count through: (1) removing redundant context, (2) using shorter system prompts, (3) summarizing conversation history, (4) few-shot example selection, and (5) structured output constraints.

Strategy Impact:

StrategyToken ReductionQuality Impact
System prompt optimization10-30%Minimal
Conversation summarization20-50%Low-Medium
Few-shot selection15-40%Task-dependent
Output length constraintsN/A (output)Medium
Structured output20-40% (output)Minimal

Output tokens are typically 3-10x more expensive than input tokens. Constraining output length or using structured output formats (JSON vs. free text) can significantly reduce costs.

Caching Strategies

Semantic Caching

DfSemantic Caching

Semantic caching stores responses based on semantic similarity rather than exact match. When a new query is sufficiently similar to a cached query, the cached response is returned without invoking the LLM.

Cache Hit Rate

Rhit=fracNcachedNtotaltimes100R_{hit} = \\frac{N_{cached}}{N_{total}} \\times 100\\%

Here,

  • RhitR_{hit}=Cache hit rate
  • NcachedN_{cached}=Requests served from cache
  • NtotalN_{total}=Total requests

Cost Savings from Caching

If 30% of requests hit the semantic cache and each request costs $0.05:

  • Monthly requests: 1,000,000
  • Cost without cache: $50,000
  • Cost with cache (30% hit rate): $35,000
  • Monthly savings: $15,000 (30% reduction)

Prefix Caching

For workloads with shared system prompts, prefix caching avoids redundant prefill computation.

Prefill Savings from Prefix Caching

textSavingsprefill=fracLprefixLtotaltimes100\\text{Savings}_{prefill} = \\frac{L_{prefix}}{L_{total}} \\times 100\\%

Here,

  • LprefixL_{prefix}=Length of cached prefix tokens
  • LtotalL_{total}=Total input length

Batching and Batching Optimization

Dynamic Batching

DfDynamic Batching

Dynamic batching groups multiple requests into a single forward pass, amortizing GPU overhead across requests. Unlike static batching, dynamic batching allows requests to join and leave the batch dynamically.

Throughput from Batching

textThroughputbatch=BtimesfractextThroughputsingleBalpha\\text{Throughput}_{batch} = B \\times \\frac{\\text{Throughput}_{single}}{B^{\\alpha}}

Here,

  • BB=Batch size
  • α\alpha=Parallelization efficiency (0 < α < 1)
  • Throughputsingle\text{Throughput}_{single}=Single-request throughput

Request Coalescing

For high-traffic applications, buffer requests for 50-100ms before processing. This allows natural batching without significant latency increase, often improving throughput by 3-5x.

Model Selection and Right-Sizing

Task-Model Matching

DfTask-Model Matching

Task-model matching assigns queries to the smallest model that can handle them adequately. Simple queries (FAQ, formatting) use small models; complex queries (reasoning, code) use larger models.

Decision Framework:

Query ComplexityRecommended ModelCost Ratio
Simple (FAQ, formatting)7B / Small API1x
Medium (summarization, extraction)13-30B / Medium API3-5x
Complex (reasoning, code generation)70B+ / Large API10-20x
Expert (multi-step reasoning, research)GPT-4 / Claude Opus50-100x

Cascading

DfModel Cascading

Model cascading routes requests through a sequence of models, starting with the cheapest. If the smaller model's confidence is below a threshold, the request is escalated to a larger model.

Expected Cost with Cascading

E[C]=Csmall+(1pconfident)cdotClargeE[C] = C_{small} + (1 - p_{confident}) \\cdot C_{large}

Here,

  • CsmallC_{small}=Cost of small model
  • pconfidentp_{confident}=Probability small model is confident
  • ClargeC_{large}=Cost of large model

Infrastructure Optimization

Quantization for Cost Reduction

DfQuantization Cost Savings

Quantization reduces model size and memory requirements, enabling deployment on cheaper hardware. INT8 quantization reduces memory by ~50%, INT4 by ~75%, with varying quality impact.

PrecisionMemory ReductionQuality ImpactCost Reduction
FP16BaselineNoneBaseline
INT850%Minimal30-40%
INT475%Low-Medium50-60%
GPTQ/INT475%Low50-60%

Self-hosting with quantized models can be 5-10x cheaper than API calls at scale (>100K requests/day), but requires DevOps expertise and infrastructure management.

Practice Exercises

  1. Conceptual: Compare the cost profiles of semantic caching versus prefix caching. Under what workloads is each approach more effective?

  2. Mathematical: Calculate the break-even point between self-hosting a quantized 70B model versus using GPT-4 API, given: API cost 0.03/1Ktokens,selfhostingcost0.03/1K tokens, self-hosting cost2/hour for A100 GPU, average 500 tokens per request.

  3. Practical: Design a model cascading system that routes queries to 7B, 13B, or 70B models based on estimated complexity. What signals would you use for routing?

  4. Research: Compare the cost-effectiveness of LoRA fine-tuning a small model versus prompting a large model for a domain-specific task.

Key Takeaways:

  • Output tokens are 3-10x more expensive than input tokens; constrain output when possible
  • Semantic caching can reduce costs by 20-40% for repetitive workloads
  • Dynamic batching improves throughput by 3-5x with minimal latency increase
  • Model cascading routes simple queries to cheap models, complex queries to expensive ones
  • Self-hosting with quantization becomes cost-effective at scale (>100K requests/day)

What to Learn Next

-> LLM Serving Architectures vLLM, TGI, TensorRT-LLM, and serving patterns for production deployments.

-> LLM Monitoring and Observability Logging, tracing, metrics, and drift detection for production systems.

-> LLM Fine-Tuning Pipelines End-to-end fine-tuning infrastructure and data management.

-> LLM Security Best Practices Protecting systems from adversarial attacks and data privacy risks.

-> Multi-Tenant LLM Systems Tenant isolation, resource sharing, and customization at scale.

-> AB Testing for LLMs Experiment design, statistical significance, and canary deployments.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement