LLM Production

LLM Versioning and Rollouts — Safe Model Evolution

Managing model versions in production requires careful coordination of artifact storage, rollback capabilities, and gradual traffic shifting to minimize risk.

Version Management — Model artifacts, metadata, and reproducibility
Rollout Strategies — Canary, blue-green, and shadow deployments
Rollback — Automated triggers and recovery procedures

Ship fast, but ship safely.

LLM Versioning and Rollouts

Deploying LLMs in production requires a systematic approach to versioning, testing, and rolling out changes. Unlike traditional software, model changes can have non-obvious behavioral impacts that only surface under real-world conditions.

DfLLM Versioning

LLM versioning is the practice of uniquely identifying, storing, and managing all artifacts required to reproduce a specific model's behavior: model weights, tokenizer, system prompt, configuration, and dependencies. Each version must be immutable and auditable.

Version Management

Model Artifacts

A complete LLM version includes:

DfModel Artifact Bundle

A model artifact bundle is a versioned collection of: (1) model weights/checkpoint, (2) tokenizer configuration, (3) system prompt template, (4) generation parameters (temperature, top_p, etc.), (5) safety configuration, and (6) dependency versions. All components must be stored together for reproducibility.

Versioning Schema:

Architecture Diagram

model-name/v{MAJOR}.{MINOR}.{PATCH}-{hash}
├── model/
│   ├── weights/           # Model checkpoint files
│   ├── tokenizer/         # Tokenizer config + vocab
│   └── config.json        # Model configuration
├── prompt/
│   ├── system.txt         # System prompt template
│   └── few_shot/          # Few-shot examples
├── config/
│   ├── generation.json    # Temperature, top_p, etc.
│   ├── safety.json        # Guardrails configuration
│   └── routing.json       # Request routing rules
├── metadata.json          # Version metadata
└── tests/                 # Evaluation results

Semantic Versioning for Models

Model Version Format

V = v{MAJOR}.{MINOR}.{PATCH}-{commit_hash}

Here,

$MAJOR$ =Breaking changes (new base model, architecture)
$MINOR$ =New capabilities (fine-tune, prompt update)
$PATCH$ =Bug fixes, parameter tuning
$commit_hash$ =Git hash for reproducibility

Model versioning differs from software versioning because model behavior is emergent from the combination of weights, prompts, and configuration. A "patch" change to the system prompt can have as much behavioral impact as a "minor" model update.

Rollout Strategies

Canary Deployment

DfCanary Deployment

A canary deployment routes a small percentage of production traffic (typically 1-5%) to the new model version while monitoring key metrics. Traffic is gradually increased if metrics remain stable, or rolled back if degradation is detected.

Traffic Ramp Schedule:

Stage	Traffic	Duration	Gate Criteria
Stage 0	0%	-	All tests pass
Stage 1	1%	1 hour	Latency p99 < threshold
Stage 2	5%	4 hours	Quality metrics stable
Stage 3	25%	24 hours	No safety regressions
Stage 4	50%	48 hours	User satisfaction maintained
Stage 5	100%	-	Full rollout

Blue-Green Deployment

DfBlue-Green Deployment

Blue-green deployment maintains two identical production environments. The "blue" environment serves production traffic while the "green" environment hosts the new version. Traffic is switched instantly after validation, enabling instant rollback.

Architecture Diagram

Normal State:
┌──────────────┐     ┌──────────────┐
│  Blue (v1.2) │────▶│   Users      │
│  Active      │     │              │
└──────────────┘     └──────────────┘
┌──────────────┐
│ Green (v1.3) │
│ Standby      │
└──────────────┘

After Switch:
┌──────────────┐
│  Blue (v1.2) │
│  Standby     │
└──────────────┘
┌──────────────┐     ┌──────────────┐
│ Green (v1.3) │────▶│   Users      │
│  Active      │     │              │
└──────────────┘     └──────────────┘

Shadow Deployment

DfShadow Deployment

A shadow deployment runs the new model version in parallel with production, processing identical inputs but not serving responses to users. This enables offline comparison of outputs without user impact.

Automated Rollback

Rollback Triggers

DfRollback Trigger

A rollback trigger is an automated condition that reverts a deployment to the previous version when detected metrics breach predefined thresholds. Common triggers include latency spikes, error rate increases, quality score drops, and safety violations.

Rollback Decision Function

R = \\begin{cases} 1 & \\text{if } \\exists m : m > \\theta_m \\lor m < \\theta_{min} \\\\ 0 & \\text{otherwise} \\end{cases}

Here,

$R$ =Rollback decision (1 = rollback, 0 = continue)
$m$ =Monitored metric value
$\theta_m$ =Maximum acceptable threshold
$\theta_{min}$ =Minimum acceptable threshold

Recovery Procedures

Automated Recovery:

Detect threshold breach
Pause new traffic to canary
Drain in-flight requests (graceful shutdown)
Redirect traffic to previous version
Notify on-call team
Log incident for post-mortem

Implement a "dead man's switch" that automatically rolls back if the new version stops responding entirely. This catches failures that simple metric monitoring might miss.

Evaluation Gates

Pre-Deployment Evaluation

Before any rollout begins, the new version must pass automated evaluation:

Gate	Metric	Threshold	Action on Failure
Safety	Toxicity score	< 0.05	Block deployment
Quality	Win rate vs. current	> 50%	Block deployment
Performance	Latency p99	< current + 20%	Warning
Robustness	Adversarial test suite	100% pass	Block deployment

Post-Deployment Monitoring

DfShadow Period Monitoring

Shadow period monitoring compares new version outputs against the current version during the canary phase, measuring behavioral differences that may not appear in offline evaluation.

Practice Exercises

Conceptual: Explain why model versioning must include the system prompt and configuration, not just the model weights. What can go wrong with weight-only versioning?
Mathematical: Design a canary deployment schedule for a 70B model rollout, given that latency regression can only be detected with 95% confidence after 10,000 requests.
Practical: Implement a blue-green deployment system for an LLM service that supports instant rollback, traffic splitting, and automated health checks.
Research: Compare the risk profiles of canary deployments versus shadow deployments for LLM systems. Under what conditions is each approach preferable?

Key Takeaways:

LLM versioning must include all artifacts: weights, tokenizer, prompt, and configuration
Semantic versioning provides clear communication of change scope
Canary deployments enable gradual rollout with metric-based gates
Blue-green deployments enable instant rollback but require double infrastructure
Automated rollback triggers prevent extended exposure to degraded models

What to Learn Next

-> AB Testing for LLMs Experiment design, statistical significance, and canary deployments.

-> LLM Fine-Tuning Pipelines End-to-end fine-tuning infrastructure and data management.

-> LLM Monitoring and Observability Logging, tracing, metrics, and drift detection for production systems.

-> LLM Evaluation in Production Online evaluation, user feedback loops, and quality assurance.

-> LLM Disaster Recovery Failover, backup models, and graceful degradation strategies.

-> LLM Serving Architectures vLLM, TGI, TensorRT-LLM, and serving patterns for production deployments.

LLM Versioning and Rollouts

LLM Versioning and Rollouts — Safe Model Evolution

LLM Versioning and Rollouts

DfLLM Versioning

Version Management

Model Artifacts

DfModel Artifact Bundle

Semantic Versioning for Models

Model Version Format

Rollout Strategies

Canary Deployment

DfCanary Deployment

Blue-Green Deployment

DfBlue-Green Deployment

Shadow Deployment

DfShadow Deployment

Automated Rollback

Rollback Triggers

DfRollback Trigger

Rollback Decision Function

Recovery Procedures

Evaluation Gates

Pre-Deployment Evaluation

Post-Deployment Monitoring

DfShadow Period Monitoring

Practice Exercises

What to Learn Next

Need Expert LLM Help?