CW

LLM Disaster Recovery

ProductionResilienceFree Lesson

Advertisement

LLM Production

LLM Disaster Recovery β€” Ensuring Business Continuity

LLM systems face unique failure modes: provider outages, model degradation, safety incidents, and capacity constraints. Robust disaster recovery planning ensures service continuity.

  • Failover Strategies β€” Multi-provider, multi-region, and fallback chains
  • Backup Models β€” Alternative models for graceful degradation
  • Recovery Procedures β€” Automated failover and manual recovery playbooks

Hope for the best, plan for the worst.

LLM Disaster Recovery

LLM systems have unique failure modes that traditional disaster recovery plans don't address. Model providers can go down, safety incidents can require immediate model retirement, and capacity constraints can throttle service. This guide covers comprehensive disaster recovery for LLM systems.

DfLLM Disaster Recovery

LLM disaster recovery is the set of strategies, procedures, and infrastructure that ensures continuous LLM service availability despite failures including: provider outages, model degradation, safety incidents, capacity exhaustion, and data loss.

Failure Modes

Provider-Level Failures

DfProvider Failure

A provider failure occurs when an LLM API provider experiences downtime, rate limiting, or degraded performance. This is the most common failure mode for systems relying on external APIs.

Failure TypeImpactRecovery Time
Complete outageAll requests failMinutes to hours
Rate limitingPartial request failureSeconds to minutes
Quality degradationIncorrect outputsUnclear until detected
Cost spikeBudget overrunImmediate (financial)

Model-Level Failures

DfModel Failure

A model failure occurs when the model produces incorrect, harmful, or inconsistent outputs due to: distribution shift, adversarial attacks, or undiscovered bugs in the model or serving infrastructure.

Infrastructure Failures

DfInfrastructure Failure

An infrastructure failure includes GPU failures, memory exhaustion, network partitions, and storage failures that prevent model serving regardless of model health.

Failover Strategies

Multi-Provider Architecture

DfMulti-Provider Failover

Multi-provider failover routes requests across multiple LLM providers (OpenAI, Anthropic, Google, self-hosted) with automatic failover when the primary provider fails. This eliminates single-provider dependency.

Failover Chain Architecture:

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               Request Router                     β”‚
β”‚    (Health checks, latency-based routing)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚            β”‚            β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
    β”‚ Primary  β”‚  β”‚Secondaryβ”‚  β”‚ Tertiaryβ”‚
    β”‚ GPT-4    β”‚  β”‚ Claude  β”‚  β”‚ LLaMA  β”‚
    β”‚ (API)    β”‚  β”‚ (API)   β”‚  β”‚(Self)  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚            β”‚            β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
    β”‚         Health Monitor              β”‚
    β”‚  (Continuous probing, latency)     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Fallback Models

DfFallback Model

A fallback model is a smaller, cheaper, or self-hosted model that can handle requests when the primary model is unavailable. Fallback models provide degraded but functional service.

Fallback Strategy Matrix:

PrimaryFallbackTriggerQuality Impact
GPT-4GPT-3.5Latency > 5sMedium
Claude OpusClaude SonnetRate limitedLow
Self-hosted 70BSelf-hosted 7BGPU failureHigh
Any APISelf-hostedProvider downMedium-High

Graceful Degradation

DfGraceful Degradation

Graceful degradation reduces service quality rather than failing completely. For LLMs, this can mean: shorter responses, simpler models, cached responses, or rule-based fallbacks for structured tasks.

Degradation Level Score

Dlevel=minleft(fracAavailableArequired,1right)D_{level} = \\min\\left(\\frac{A_{available}}{A_{required}}, 1\\right)

Here,

  • AavailableA_{available}=Available resources/capability
  • ArequiredA_{required}=Required resources/capability

Degradation Tiers:

TierConditionBehavior
FullAll systems operationalFull model capability
DegradedPartial outageSmaller model, shorter responses
MinimalMajor outageCached responses, rule-based fallback
EmergencyComplete failureMaintenance message, retry queue

Recovery Procedures

Automated Recovery

DfAutomated Recovery

Automated recovery uses health checks, circuit breakers, and retry logic to detect failures and switch to fallback systems without human intervention. Recovery actions are predefined and tested.

Circuit Breaker Pattern:

Architecture Diagram
States:
CLOSED ──(failure threshold)──▢ OPEN
  β–²                              β”‚
  β”‚                        (timeout)
  β”‚                              β”‚
  └──(success threshold)── HALF-OPEN

Circuit Breaker Threshold

\\text{Open} = \\begin{cases} \\text{true} & \\text{if } \\frac{\\text{failures}}{\\text{total}} > \\theta_{fail} \\land \\text{total} > N_{min} \\\\ \\text{false} & \\text{otherwise} \\end{cases}

Here,

  • ΞΈfail\theta_{fail}=Failure rate threshold (e.g., 0.5)
  • NminN_{min}=Minimum requests before evaluation

Manual Recovery Playbook

Incident Response Steps:

  1. Detect: Automated monitoring triggers alert
  2. Triage: Assess severity (P0-P3) and impact
  3. Mitigate: Activate failover, reroute traffic
  4. Investigate: Root cause analysis
  5. Recover: Restore primary service
  6. Post-mortem: Document and improve

Maintain runbooks for each failure mode with specific commands, contacts, and escalation paths. Runbooks should be tested quarterly through game day exercises.

Backup and State Management

Conversation Backup

DfConversation Backup

Conversation backup preserves conversation history and context so that users can resume sessions after a failover. This requires persisting conversation state independently of the LLM serving infrastructure.

Configuration Backup

DfConfiguration Backup

Configuration backup stores all system configurations (model parameters, routing rules, safety settings, prompts) in version-controlled, disaster-recoverable storage. Configuration should be restorable independently of model weights.

Business Continuity

Recovery Time Objectives

Recovery Time Objective

RTO=Tdetect+Ttriage+Tmitigate+TrecoverRTO = T_{detect} + T_{triage} + T_{mitigate} + T_{recover}

Here,

  • TdetectT_{detect}=Time to detect failure
  • TtriageT_{triage}=Time to assess severity
  • TmitigateT_{mitigate}=Time to activate failover
  • TrecoverT_{recover}=Time to restore primary service

Recovery Point Objectives

DfRecovery Point Objective

Recovery Point Objective (RPO) defines the maximum acceptable data loss in a disaster. For LLM systems, this typically applies to conversation history, user preferences, and cached results.

ComponentRTO TargetRPO Target
API routing< 30 seconds0 (stateless)
Conversation history< 5 minutes< 1 minute
Model serving< 5 minutes0 (stateless)
User preferences< 15 minutes< 5 minutes

Practice Exercises

  1. Conceptual: Design a multi-provider failover architecture for a customer service chatbot that uses GPT-4 as primary and must maintain 99.9% availability.

  2. Mathematical: Calculate the expected monthly downtime for an LLM service with three independent providers, each with 99.5% uptime, using a failover architecture.

  3. Practical: Create a disaster recovery runbook for a safety incident where the primary model must be immediately retired and replaced with a backup.

  4. Research: Compare the cost-effectiveness of multi-provider redundancy versus self-hosted backup models for disaster recovery.

Key Takeaways:

  • LLM systems have unique failure modes: provider outages, model degradation, safety incidents
  • Multi-provider failover eliminates single-provider dependency
  • Graceful degradation provides reduced but functional service during failures
  • Automated recovery with circuit breakers reduces mean time to recovery
  • Regular disaster recovery testing (game days) ensures procedures work when needed

What to Learn Next

-> LLM Serving Architectures vLLM, TGI, TensorRT-LLM, and serving patterns for production deployments.

-> LLM Monitoring and Observability Logging, tracing, metrics, and drift detection for production systems.

-> LLM Versioning and Rollouts Model versioning, artifact management, and gradual rollout strategies.

-> LLM Security Best Practices Protecting systems from adversarial attacks and data privacy risks.

-> Multi-Tenant LLM Systems Tenant isolation, resource sharing, and customization at scale.

-> LLM Evaluation in Production Online evaluation, user feedback loops, and quality assurance.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement