LLM Production
A/B Testing for LLMs — Rigorous Experimentation at Scale
A/B testing LLMs requires careful statistical design due to high variance in output quality, subjective evaluation criteria, and the high cost of inference.
- Experiment Design — Randomization, power analysis, and sample sizing
- Statistical Methods — Hypothesis testing, confidence intervals, sequential testing
- Deployment Strategies — Canary releases, shadow deployments, multi-armed bandits
In God we trust; all others bring data.
A/B Testing for LLMs
A/B testing LLMs presents unique challenges compared to traditional software experimentation. Output quality is often subjective, variance is high, and the cost of running experiments (inference compute) is significant. This guide provides a rigorous framework for LLM experimentation.
DfLLM A/B Testing
LLM A/B testing is a controlled experiment where users are randomly assigned to receive responses from different model versions (or configurations), enabling causal inference about the impact of changes on quality metrics, user satisfaction, and business outcomes.
Experimental Design
Randomization and Assignment
DfRandom Assignment
Random assignment ensures each user has an equal probability of being assigned to any treatment group, eliminating selection bias. For LLM experiments, randomization should be at the user level (not request level) to prevent inconsistent experiences within sessions.
Minimum Sample Size
Here,
- =Required sample size per group
- =Critical value for significance level
- =Critical value for power (1-β)
- =Standard deviation of the metric
- =Minimum detectable effect
Sample Size Calculation
To detect a 5% improvement in user satisfaction (from 70% to 75%) with 80% power and 95% significance:
- σ ≈ 0.45 (satisfaction is Bernoulli)
- δ = 0.05
- z_0.025 = 1.96, z_0.20 = 0.84
- n = (1.96 + 0.84)^2 × 2 × 0.45^2 / 0.05^2 = 2,540 users per group
Metrics Framework
Primary Metrics (Statistical Power):
- User satisfaction rating
- Task completion rate
- Response helpfulness score
Secondary Metrics (Directional):
- Response latency
- Token usage
- Follow-up question rate
Guardrail Metrics (Must Not Degrade):
- Safety violation rate
- Hallucination rate
- P95 latency
Always define your primary metric and minimum detectable effect before running the experiment. Post-hoc metric selection invalidates statistical guarantees.
Statistical Testing
Two-Sample Z-Test for Proportions
For binary metrics like user satisfaction:
Z-Test for Proportions
Here,
- =Observed proportions in groups A and B
- =Pooled proportion
- =Sample sizes per group
Sequential Testing
Traditional hypothesis testing requires a fixed sample size. Sequential testing allows early stopping when results are conclusive.
DfSequential Testing
Sequential testing (also called adaptive testing) allows periodic evaluation of experiment results without inflating Type I error rates, using methods like alpha spending functions or always-valid confidence intervals.
Always-Valid P-Value (mixture LRT)
Here,
- =Always-valid p-value at time t
- =True effect size parameter
- =MLE at time t
- =Significance level
Deployment Strategies
Canary Deployment
DfCanary Deployment
A canary deployment routes a small percentage of traffic (typically 1-5%) to a new model version while monitoring key metrics. If metrics remain stable, traffic is gradually increased until full rollout.
Traffic Split Over Time:
Week 1: ████████████████████████████████░░ 5% canary
Week 2: ██████████████████████████████░░░░ 25% canary
Week 3: ████████████████████████░░░░░░░░░░ 50% canary
Week 4: ████████████████████████████████░░ 100% rollout
Shadow Deployment
DfShadow Deployment
A shadow deployment runs the new model in parallel with the production model, processing the same inputs. The new model's outputs are logged but not served to users, enabling offline comparison without user impact.
Multi-Armed Bandits
DfMulti-Armed Bandit
A multi-armed bandit approach dynamically allocates traffic to the best-performing model variant, balancing exploration (gathering information) with exploitation (serving the best model). This is more sample-efficient than fixed-ratio A/B testing.
UCB1 Selection Rule
Here,
- =Estimated mean reward for arm a
- =Current time step
- =Number of times arm a was pulled
Common Pitfalls
- Peeking problem: Checking results too early inflates false positive rates
- Metric gaming: Optimizing proxy metrics that don't reflect true quality
- Novelty effects: Users react differently to new features initially
- Population shift: User base changes during long experiments
- Interference: Treatment effects leak between groups (network effects)
Use a pre-registration document that specifies your hypothesis, primary metric, sample size calculation, and analysis plan before starting the experiment. This prevents p-hacking and post-hoc rationalization.
Practice Exercises
-
Conceptual: Explain the difference between user-level and request-level randomization for LLM experiments. What are the implications for statistical validity?
-
Mathematical: Calculate the minimum sample size needed to detect a 2% improvement in task completion rate (from 85% to 87%) with 90% power at 95% significance.
-
Practical: Design an A/B test comparing a model with and without RAG augmentation. What metrics would you track and why?
-
Research: Compare the sample efficiency of A/B testing versus Thompson Sampling for LLM model selection. Under what conditions is each approach preferred?
Key Takeaways:
- LLM A/B testing requires careful statistical design due to high output variance
- Always pre-define primary metrics, effect sizes, and sample sizes
- Sequential testing enables early stopping without inflating error rates
- Canary deployments reduce risk by gradually shifting traffic
- Multi-armed bandits are more sample-efficient than fixed-ratio experiments
What to Learn Next
-> LLM Versioning and Rollouts Model versioning, artifact management, and gradual rollout strategies.
-> LLM Monitoring and Observability Logging, tracing, metrics, and drift detection for production systems.
-> LLM Evaluation in Production Online evaluation, user feedback loops, and quality assurance.
-> Cost Optimization for LLMs Token economics, caching strategies, and batching for cost efficiency.
-> Multi-Tenant LLM Systems Tenant isolation, resource sharing, and customization at scale.
-> LLM Disaster Recovery Failover, backup models, and graceful degradation strategies.