LLM Production

Cost Optimization for LLMs — Maximizing Value Per Token

LLM inference costs can dominate production budgets. Strategic optimization across token usage, caching, batching, and model selection yields 10-100x cost reductions.

Token Economics — Understanding and reducing token consumption
Caching — Semantic caching and prefix caching strategies
Model Selection — Right-sizing models for task complexity

The cheapest token is the one you never generate.

Cost Optimization for LLMs

LLM inference is expensive. A single GPT-4 query costs 10-100x more than a traditional API call. At scale, inference costs can exceed $100K/month. This guide provides a systematic framework for optimizing LLM costs while maintaining quality.

DfLLM Cost Function

The total cost of an LLM interaction includes input tokens, output tokens, and infrastructure overhead: C_total = C_input + C_output + C_infra, where each component depends on model choice, token counts, and serving infrastructure.

Token Economics

Understanding Token Costs

Cost Per Request

C_{request} = n_{input} \\cdot p_{input} + n_{output} \\cdot p_{output} + C_{overhead}

Here,

$C_{request}$ =Total cost per request
$n_{input}$ =Number of input tokens
$p_{input}$ =Price per input token
$n_{output}$ =Number of output tokens
$p_{output}$ =Price per output token
$C_{overhead}$ =Infrastructure overhead per request

Token Reduction Strategies

DfPrompt Compression

Prompt compression reduces input token count through: (1) removing redundant context, (2) using shorter system prompts, (3) summarizing conversation history, (4) few-shot example selection, and (5) structured output constraints.

Strategy Impact:

Strategy	Token Reduction	Quality Impact
System prompt optimization	10-30%	Minimal
Conversation summarization	20-50%	Low-Medium
Few-shot selection	15-40%	Task-dependent
Output length constraints	N/A (output)	Medium
Structured output	20-40% (output)	Minimal

Output tokens are typically 3-10x more expensive than input tokens. Constraining output length or using structured output formats (JSON vs. free text) can significantly reduce costs.

Caching Strategies

Semantic Caching

DfSemantic Caching

Semantic caching stores responses based on semantic similarity rather than exact match. When a new query is sufficiently similar to a cached query, the cached response is returned without invoking the LLM.

Cache Hit Rate

R_{hit} = \\frac{N_{cached}}{N_{total}} \\times 100\\%

Here,

$R_{hit}$ =Cache hit rate
$N_{cached}$ =Requests served from cache
$N_{total}$ =Total requests

Cost Savings from Caching

If 30% of requests hit the semantic cache and each request costs $0.05:

Monthly requests: 1,000,000
Cost without cache: $50,000
Cost with cache (30% hit rate): $35,000
Monthly savings: $15,000 (30% reduction)

Prefix Caching

For workloads with shared system prompts, prefix caching avoids redundant prefill computation.

Prefill Savings from Prefix Caching

\\text{Savings}_{prefill} = \\frac{L_{prefix}}{L_{total}} \\times 100\\%

Here,

$L_{prefix}$ =Length of cached prefix tokens
$L_{total}$ =Total input length

Batching and Batching Optimization

Dynamic Batching

DfDynamic Batching

Dynamic batching groups multiple requests into a single forward pass, amortizing GPU overhead across requests. Unlike static batching, dynamic batching allows requests to join and leave the batch dynamically.

Throughput from Batching

\\text{Throughput}_{batch} = B \\times \\frac{\\text{Throughput}_{single}}{B^{\\alpha}}

Here,

$B$ =Batch size
$\alpha$ =Parallelization efficiency (0 < α < 1)
$\text{Throughput}_{single}$ =Single-request throughput

Request Coalescing

For high-traffic applications, buffer requests for 50-100ms before processing. This allows natural batching without significant latency increase, often improving throughput by 3-5x.

Model Selection and Right-Sizing

Task-Model Matching

DfTask-Model Matching

Task-model matching assigns queries to the smallest model that can handle them adequately. Simple queries (FAQ, formatting) use small models; complex queries (reasoning, code) use larger models.

Decision Framework:

Query Complexity	Recommended Model	Cost Ratio
Simple (FAQ, formatting)	7B / Small API	1x
Medium (summarization, extraction)	13-30B / Medium API	3-5x
Complex (reasoning, code generation)	70B+ / Large API	10-20x
Expert (multi-step reasoning, research)	GPT-4 / Claude Opus	50-100x

Cascading

DfModel Cascading

Model cascading routes requests through a sequence of models, starting with the cheapest. If the smaller model's confidence is below a threshold, the request is escalated to a larger model.

Expected Cost with Cascading

E[C] = C_{small} + (1 - p_{confident}) \\cdot C_{large}

Here,

$C_{small}$ =Cost of small model
$p_{confident}$ =Probability small model is confident
$C_{large}$ =Cost of large model

Infrastructure Optimization

Quantization for Cost Reduction

DfQuantization Cost Savings

Quantization reduces model size and memory requirements, enabling deployment on cheaper hardware. INT8 quantization reduces memory by ~50%, INT4 by ~75%, with varying quality impact.

Precision	Memory Reduction	Quality Impact	Cost Reduction
FP16	Baseline	None	Baseline
INT8	50%	Minimal	30-40%
INT4	75%	Low-Medium	50-60%
GPTQ/INT4	75%	Low	50-60%

Self-hosting with quantized models can be 5-10x cheaper than API calls at scale (>100K requests/day), but requires DevOps expertise and infrastructure management.

Practice Exercises

Conceptual: Compare the cost profiles of semantic caching versus prefix caching. Under what workloads is each approach more effective?
Mathematical: Calculate the break-even point between self-hosting a quantized 70B model versus using GPT-4 API, given: API cost $0.03/1K tokens, self-hosting cost$ 2/hour for A100 GPU, average 500 tokens per request.
Practical: Design a model cascading system that routes queries to 7B, 13B, or 70B models based on estimated complexity. What signals would you use for routing?
Research: Compare the cost-effectiveness of LoRA fine-tuning a small model versus prompting a large model for a domain-specific task.

Key Takeaways:

Output tokens are 3-10x more expensive than input tokens; constrain output when possible
Semantic caching can reduce costs by 20-40% for repetitive workloads
Dynamic batching improves throughput by 3-5x with minimal latency increase
Model cascading routes simple queries to cheap models, complex queries to expensive ones
Self-hosting with quantization becomes cost-effective at scale (>100K requests/day)

What to Learn Next

-> LLM Serving Architectures vLLM, TGI, TensorRT-LLM, and serving patterns for production deployments.

-> LLM Monitoring and Observability Logging, tracing, metrics, and drift detection for production systems.

-> LLM Fine-Tuning Pipelines End-to-end fine-tuning infrastructure and data management.

-> LLM Security Best Practices Protecting systems from adversarial attacks and data privacy risks.

-> Multi-Tenant LLM Systems Tenant isolation, resource sharing, and customization at scale.

-> AB Testing for LLMs Experiment design, statistical significance, and canary deployments.

Cost Optimization for LLMs

Cost Optimization for LLMs — Maximizing Value Per Token

Cost Optimization for LLMs

DfLLM Cost Function

Token Economics

Understanding Token Costs

Cost Per Request

Token Reduction Strategies

DfPrompt Compression

Caching Strategies

Semantic Caching

DfSemantic Caching

Cache Hit Rate

Cost Savings from Caching

Prefix Caching

Prefill Savings from Prefix Caching

Batching and Batching Optimization

Dynamic Batching

DfDynamic Batching

Throughput from Batching

Request Coalescing

Model Selection and Right-Sizing

Task-Model Matching

DfTask-Model Matching

Cascading

DfModel Cascading

Expected Cost with Cascading

Infrastructure Optimization

Quantization for Cost Reduction

DfQuantization Cost Savings

Practice Exercises

What to Learn Next

Need Expert LLM Help?