LLM Production
LLM Versioning and Rollouts β Safe Model Evolution
Managing model versions in production requires careful coordination of artifact storage, rollback capabilities, and gradual traffic shifting to minimize risk.
- Version Management β Model artifacts, metadata, and reproducibility
- Rollout Strategies β Canary, blue-green, and shadow deployments
- Rollback β Automated triggers and recovery procedures
Ship fast, but ship safely.
LLM Versioning and Rollouts
Deploying LLMs in production requires a systematic approach to versioning, testing, and rolling out changes. Unlike traditional software, model changes can have non-obvious behavioral impacts that only surface under real-world conditions.
DfLLM Versioning
LLM versioning is the practice of uniquely identifying, storing, and managing all artifacts required to reproduce a specific model's behavior: model weights, tokenizer, system prompt, configuration, and dependencies. Each version must be immutable and auditable.
Version Management
Model Artifacts
A complete LLM version includes:
DfModel Artifact Bundle
A model artifact bundle is a versioned collection of: (1) model weights/checkpoint, (2) tokenizer configuration, (3) system prompt template, (4) generation parameters (temperature, top_p, etc.), (5) safety configuration, and (6) dependency versions. All components must be stored together for reproducibility.
Versioning Schema:
model-name/v{MAJOR}.{MINOR}.{PATCH}-{hash}
βββ model/
β βββ weights/ # Model checkpoint files
β βββ tokenizer/ # Tokenizer config + vocab
β βββ config.json # Model configuration
βββ prompt/
β βββ system.txt # System prompt template
β βββ few_shot/ # Few-shot examples
βββ config/
β βββ generation.json # Temperature, top_p, etc.
β βββ safety.json # Guardrails configuration
β βββ routing.json # Request routing rules
βββ metadata.json # Version metadata
βββ tests/ # Evaluation results
Semantic Versioning for Models
Model Version Format
Here,
- =Breaking changes (new base model, architecture)
- =New capabilities (fine-tune, prompt update)
- =Bug fixes, parameter tuning
- =Git hash for reproducibility
Model versioning differs from software versioning because model behavior is emergent from the combination of weights, prompts, and configuration. A "patch" change to the system prompt can have as much behavioral impact as a "minor" model update.
Rollout Strategies
Canary Deployment
DfCanary Deployment
A canary deployment routes a small percentage of production traffic (typically 1-5%) to the new model version while monitoring key metrics. Traffic is gradually increased if metrics remain stable, or rolled back if degradation is detected.
Traffic Ramp Schedule:
| Stage | Traffic | Duration | Gate Criteria |
|---|---|---|---|
| Stage 0 | 0% | - | All tests pass |
| Stage 1 | 1% | 1 hour | Latency p99 < threshold |
| Stage 2 | 5% | 4 hours | Quality metrics stable |
| Stage 3 | 25% | 24 hours | No safety regressions |
| Stage 4 | 50% | 48 hours | User satisfaction maintained |
| Stage 5 | 100% | - | Full rollout |
Blue-Green Deployment
DfBlue-Green Deployment
Blue-green deployment maintains two identical production environments. The "blue" environment serves production traffic while the "green" environment hosts the new version. Traffic is switched instantly after validation, enabling instant rollback.
Normal State:
ββββββββββββββββ ββββββββββββββββ
β Blue (v1.2) ββββββΆβ Users β
β Active β β β
ββββββββββββββββ ββββββββββββββββ
ββββββββββββββββ
β Green (v1.3) β
β Standby β
ββββββββββββββββ
After Switch:
ββββββββββββββββ
β Blue (v1.2) β
β Standby β
ββββββββββββββββ
ββββββββββββββββ ββββββββββββββββ
β Green (v1.3) ββββββΆβ Users β
β Active β β β
ββββββββββββββββ ββββββββββββββββ
Shadow Deployment
DfShadow Deployment
A shadow deployment runs the new model version in parallel with production, processing identical inputs but not serving responses to users. This enables offline comparison of outputs without user impact.
Automated Rollback
Rollback Triggers
DfRollback Trigger
A rollback trigger is an automated condition that reverts a deployment to the previous version when detected metrics breach predefined thresholds. Common triggers include latency spikes, error rate increases, quality score drops, and safety violations.
Rollback Decision Function
Here,
- =Rollback decision (1 = rollback, 0 = continue)
- =Monitored metric value
- =Maximum acceptable threshold
- =Minimum acceptable threshold
Recovery Procedures
Automated Recovery:
- Detect threshold breach
- Pause new traffic to canary
- Drain in-flight requests (graceful shutdown)
- Redirect traffic to previous version
- Notify on-call team
- Log incident for post-mortem
Implement a "dead man's switch" that automatically rolls back if the new version stops responding entirely. This catches failures that simple metric monitoring might miss.
Evaluation Gates
Pre-Deployment Evaluation
Before any rollout begins, the new version must pass automated evaluation:
| Gate | Metric | Threshold | Action on Failure |
|---|---|---|---|
| Safety | Toxicity score | < 0.05 | Block deployment |
| Quality | Win rate vs. current | > 50% | Block deployment |
| Performance | Latency p99 | < current + 20% | Warning |
| Robustness | Adversarial test suite | 100% pass | Block deployment |
Post-Deployment Monitoring
DfShadow Period Monitoring
Shadow period monitoring compares new version outputs against the current version during the canary phase, measuring behavioral differences that may not appear in offline evaluation.
Practice Exercises
-
Conceptual: Explain why model versioning must include the system prompt and configuration, not just the model weights. What can go wrong with weight-only versioning?
-
Mathematical: Design a canary deployment schedule for a 70B model rollout, given that latency regression can only be detected with 95% confidence after 10,000 requests.
-
Practical: Implement a blue-green deployment system for an LLM service that supports instant rollback, traffic splitting, and automated health checks.
-
Research: Compare the risk profiles of canary deployments versus shadow deployments for LLM systems. Under what conditions is each approach preferable?
Key Takeaways:
- LLM versioning must include all artifacts: weights, tokenizer, prompt, and configuration
- Semantic versioning provides clear communication of change scope
- Canary deployments enable gradual rollout with metric-based gates
- Blue-green deployments enable instant rollback but require double infrastructure
- Automated rollback triggers prevent extended exposure to degraded models
What to Learn Next
-> AB Testing for LLMs Experiment design, statistical significance, and canary deployments.
-> LLM Fine-Tuning Pipelines End-to-end fine-tuning infrastructure and data management.
-> LLM Monitoring and Observability Logging, tracing, metrics, and drift detection for production systems.
-> LLM Evaluation in Production Online evaluation, user feedback loops, and quality assurance.
-> LLM Disaster Recovery Failover, backup models, and graceful degradation strategies.
-> LLM Serving Architectures vLLM, TGI, TensorRT-LLM, and serving patterns for production deployments.