CW

LLM Versioning and Rollouts

ProductionModel ManagementFree Lesson

Advertisement

LLM Production

LLM Versioning and Rollouts β€” Safe Model Evolution

Managing model versions in production requires careful coordination of artifact storage, rollback capabilities, and gradual traffic shifting to minimize risk.

  • Version Management β€” Model artifacts, metadata, and reproducibility
  • Rollout Strategies β€” Canary, blue-green, and shadow deployments
  • Rollback β€” Automated triggers and recovery procedures

Ship fast, but ship safely.

LLM Versioning and Rollouts

Deploying LLMs in production requires a systematic approach to versioning, testing, and rolling out changes. Unlike traditional software, model changes can have non-obvious behavioral impacts that only surface under real-world conditions.

DfLLM Versioning

LLM versioning is the practice of uniquely identifying, storing, and managing all artifacts required to reproduce a specific model's behavior: model weights, tokenizer, system prompt, configuration, and dependencies. Each version must be immutable and auditable.

Version Management

Model Artifacts

A complete LLM version includes:

DfModel Artifact Bundle

A model artifact bundle is a versioned collection of: (1) model weights/checkpoint, (2) tokenizer configuration, (3) system prompt template, (4) generation parameters (temperature, top_p, etc.), (5) safety configuration, and (6) dependency versions. All components must be stored together for reproducibility.

Versioning Schema:

Architecture Diagram
model-name/v{MAJOR}.{MINOR}.{PATCH}-{hash}
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ weights/           # Model checkpoint files
β”‚   β”œβ”€β”€ tokenizer/         # Tokenizer config + vocab
β”‚   └── config.json        # Model configuration
β”œβ”€β”€ prompt/
β”‚   β”œβ”€β”€ system.txt         # System prompt template
β”‚   └── few_shot/          # Few-shot examples
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ generation.json    # Temperature, top_p, etc.
β”‚   β”œβ”€β”€ safety.json        # Guardrails configuration
β”‚   └── routing.json       # Request routing rules
β”œβ”€β”€ metadata.json          # Version metadata
└── tests/                 # Evaluation results

Semantic Versioning for Models

Model Version Format

V=vMAJOR.MINOR.PATCHβˆ’commithashV = v{MAJOR}.{MINOR}.{PATCH}-{commit_hash}

Here,

  • MAJORMAJOR=Breaking changes (new base model, architecture)
  • MINORMINOR=New capabilities (fine-tune, prompt update)
  • PATCHPATCH=Bug fixes, parameter tuning
  • commithashcommit_hash=Git hash for reproducibility

Model versioning differs from software versioning because model behavior is emergent from the combination of weights, prompts, and configuration. A "patch" change to the system prompt can have as much behavioral impact as a "minor" model update.

Rollout Strategies

Canary Deployment

DfCanary Deployment

A canary deployment routes a small percentage of production traffic (typically 1-5%) to the new model version while monitoring key metrics. Traffic is gradually increased if metrics remain stable, or rolled back if degradation is detected.

Traffic Ramp Schedule:

StageTrafficDurationGate Criteria
Stage 00%-All tests pass
Stage 11%1 hourLatency p99 < threshold
Stage 25%4 hoursQuality metrics stable
Stage 325%24 hoursNo safety regressions
Stage 450%48 hoursUser satisfaction maintained
Stage 5100%-Full rollout

Blue-Green Deployment

DfBlue-Green Deployment

Blue-green deployment maintains two identical production environments. The "blue" environment serves production traffic while the "green" environment hosts the new version. Traffic is switched instantly after validation, enabling instant rollback.

Architecture Diagram
Normal State:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Blue (v1.2) │────▢│   Users      β”‚
β”‚  Active      β”‚     β”‚              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Green (v1.3) β”‚
β”‚ Standby      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

After Switch:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Blue (v1.2) β”‚
β”‚  Standby     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Green (v1.3) │────▢│   Users      β”‚
β”‚  Active      β”‚     β”‚              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Shadow Deployment

DfShadow Deployment

A shadow deployment runs the new model version in parallel with production, processing identical inputs but not serving responses to users. This enables offline comparison of outputs without user impact.

Automated Rollback

Rollback Triggers

DfRollback Trigger

A rollback trigger is an automated condition that reverts a deployment to the previous version when detected metrics breach predefined thresholds. Common triggers include latency spikes, error rate increases, quality score drops, and safety violations.

Rollback Decision Function

R = \\begin{cases} 1 & \\text{if } \\exists m : m > \\theta_m \\lor m < \\theta_{min} \\\\ 0 & \\text{otherwise} \\end{cases}

Here,

  • RR=Rollback decision (1 = rollback, 0 = continue)
  • mm=Monitored metric value
  • ΞΈm\theta_m=Maximum acceptable threshold
  • ΞΈmin\theta_{min}=Minimum acceptable threshold

Recovery Procedures

Automated Recovery:

  1. Detect threshold breach
  2. Pause new traffic to canary
  3. Drain in-flight requests (graceful shutdown)
  4. Redirect traffic to previous version
  5. Notify on-call team
  6. Log incident for post-mortem

Implement a "dead man's switch" that automatically rolls back if the new version stops responding entirely. This catches failures that simple metric monitoring might miss.

Evaluation Gates

Pre-Deployment Evaluation

Before any rollout begins, the new version must pass automated evaluation:

GateMetricThresholdAction on Failure
SafetyToxicity score< 0.05Block deployment
QualityWin rate vs. current> 50%Block deployment
PerformanceLatency p99< current + 20%Warning
RobustnessAdversarial test suite100% passBlock deployment

Post-Deployment Monitoring

DfShadow Period Monitoring

Shadow period monitoring compares new version outputs against the current version during the canary phase, measuring behavioral differences that may not appear in offline evaluation.

Practice Exercises

  1. Conceptual: Explain why model versioning must include the system prompt and configuration, not just the model weights. What can go wrong with weight-only versioning?

  2. Mathematical: Design a canary deployment schedule for a 70B model rollout, given that latency regression can only be detected with 95% confidence after 10,000 requests.

  3. Practical: Implement a blue-green deployment system for an LLM service that supports instant rollback, traffic splitting, and automated health checks.

  4. Research: Compare the risk profiles of canary deployments versus shadow deployments for LLM systems. Under what conditions is each approach preferable?

Key Takeaways:

  • LLM versioning must include all artifacts: weights, tokenizer, prompt, and configuration
  • Semantic versioning provides clear communication of change scope
  • Canary deployments enable gradual rollout with metric-based gates
  • Blue-green deployments enable instant rollback but require double infrastructure
  • Automated rollback triggers prevent extended exposure to degraded models

What to Learn Next

-> AB Testing for LLMs Experiment design, statistical significance, and canary deployments.

-> LLM Fine-Tuning Pipelines End-to-end fine-tuning infrastructure and data management.

-> LLM Monitoring and Observability Logging, tracing, metrics, and drift detection for production systems.

-> LLM Evaluation in Production Online evaluation, user feedback loops, and quality assurance.

-> LLM Disaster Recovery Failover, backup models, and graceful degradation strategies.

-> LLM Serving Architectures vLLM, TGI, TensorRT-LLM, and serving patterns for production deployments.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement