Disaster Recovery Patterns

Difficulty: Senior Level | Companies: AWS, Google, Microsoft, Netflix, Uber

DR Strategy Overview

Disaster Recovery (DR) ensures business continuity when systems fail. The right strategy depends on your RTO (Recovery Time Objective) and RPO (Recovery Point Objective).

ℹ️

RPO = how much data you can afford to lose. RTO = how quickly you need to recover. Lower RPO/RTO = higher cost.

DR Strategy Comparison

Architecture Diagram

                    RPO/RTO
Cost ◀─────────────────────────────────────────▶ Recovery
  │                                               │
  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌──┴────────┐
  │  │Backup & │  │Pilot    │  │Warm     │  │Hot        │
  │  │Restore  │  │Light    │  │Standby  │  │Standby    │
  │  │         │  │         │  │         │  │           │
  │  │RPO:24hr │  │RPO:1hr  │  │RPO:15m  │  │RPO:0      │
  │  │RTO:24hr │  │RTO:30m  │  │RTO:5m   │  │RTO:0      │
  │  │$        │  │$$       │  │$$$      │  │$$$$       │
  │  └─────────┘  └─────────┘  └─────────┘  └───────────┘

Pattern 1: Backup and Restore

Simplest and most cost-effective DR strategy.

# Automated backup with AWS Backup
import boto3
from datetime import datetime

backup = boto3.client('backup')

# Create backup plan
response = backup.create_backup_plan(
    BackupPlan={
        'BackupPlanName': 'disaster-recovery-plan',
        'Rules': [
            {
                'RuleName': 'daily-backup',
                'TargetBackupVaultName': 'dr-vault',
                'ScheduleExpression': 'cron(0 3 * * ? *)',
                'StartWindowMinutes': 60,
                'CompletionWindowMinutes': 180,
                'Lifecycle': {
                    'DeleteAfterDays': 35,
                    'MoveToColdStorageAfterDays': 7,
                },
            },
            {
                'RuleName': 'monthly-backup',
                'TargetBackupVaultName': 'dr-vault',
                'ScheduleExpression': 'cron(0 3 1 * ? *)',
                'StartWindowMinutes': 60,
                'CompletionWindowMinutes': 360,
                'Lifecycle': {
                    'DeleteAfterDays': 365,
                    'MoveToColdStorageAfterDays': 90,
                },
            },
        ],
        'AdvancedBackupSettings': [
            {
                'ResourceType': 'EC2',
                'BackupOptions': {
                    'WindowsVSS': 'disabled',
                },
            },
        ],
    }
)

# Cross-region backup copy
backup.create_backup_plan(
    BackupPlan={
        'BackupPlanName': 'cross-region-dr-copy',
        'Rules': [
            {
                'RuleName': 'copy-to-dr-region',
                'TargetBackupVaultName': 'dr-vault-west',
                'SourceBackupVaultName': 'dr-vault',
                'ScheduleExpression': 'cron(0 6 * * ? *)',
                'StartWindowMinutes': 60,
                'CopyActions': [
                    {
                        'DestinationBackupVaultArn': 'arn:aws:backup:us-west-2:123456789:backup-vault:dr-vault-west',
                        'Lifecycle': {
                            'DeleteAfterDays': 35,
                        },
                    },
                ],
            },
        ],
    }
)

⚠️

Backup and Restore has the longest RTO. Use it for non-critical workloads where 24-hour recovery is acceptable.

Pattern 2: Pilot Light (Minimal Standby)

Keep core infrastructure running in DR region with minimal capacity.

# Terraform for Pilot Light DR in us-west-2
provider "aws" {
  alias  = "dr"
  region = "us-west-2"
}

# DR RDS Replica
resource "aws_db_instance" "dr_replica" {
  provider = aws.dr
  
  identifier           = "app-database-dr"
  replicate_source_db = "arn:aws:rds:us-east-1:123456789:db:app-database"
  instance_class       = "db.r6g.large"  # Smaller instance for DR
  
  vpc_security_group_ids = [aws_security_group.dr_db.id]
  db_subnet_group_name    = aws_db_subnet_group.dr.name
  
  # Enable automated backups in DR region
  backup_retention_period = 7
  skip_final_snapshot     = true
  
  tags = {
    Environment = "dr"
    Role        = "pilot-light"
  }
}

# Lambda functions (CodeDeploy across regions)
resource "aws_lambda_function" "api_handler_dr" {
  provider = aws.dr
  
  function_name = "api-handler"
  runtime       = "nodejs20.x"
  handler       = "index.handler'
  s3_bucket     = "deployment-artifacts-dr"
  s3_key        = "api-handler.zip"
  
  vpc_config {
    subnet_ids         = var.dr_subnet_ids
    security_group_ids = [aws_security_group.dr_lambda.id]
  }
  
  environment {
    variables = {
      DATABASE_URL = "postgresql://user:pass@${aws_db_instance.dr_replica.address}:5432/app"
      ENVIRONMENT  = "dr"
    }
  }
}

# SNS topic for DR failover notifications
resource "aws_sns_topic" "dr_failover" {
  provider = aws.dr
  name     = "dr-failover-notifications"
}

Pattern 3: Warm Standby

Maintain scaled-down infrastructure in DR region, ready to scale up.

# Warm Standby configuration
AWSTemplateFormatVersion: '2010-09-09'
Description: Warm Standby DR Infrastructure

Parameters:
  Environment:
    Type: String
  PrimaryRegion:
    Type: String
    Default: us-east-1
  DRRegion:
    Type: String
    Default: us-west-2

Resources:
  # DR Aurora Global Database
  AuroraCluster:
    Type: AWS::RDS::DBCluster
    Properties:
      Engine: aurora-postgresql
      EngineVersion: '15.4'
      DatabaseName: appdb
      MasterUsername: !Ref DBUsername
      MasterUserPassword: !Ref DBPassword
      GlobalClusterIdentifier: !Sub '${Environment}-global-cluster'
      StorageEncrypted: true
      
  # Scale-down compute in DR region
  DRScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      MinSize: '2'  # Minimal capacity
      MaxSize: '20'  # Can scale to match primary
      DesiredCapacity: '2'
      VPCZoneIdentifier: !Ref DRSubnetIds
      TargetGroupARNs:
        - !Ref DRTargetGroup
      HealthCheckType: ELB
      HealthCheckGracePeriod: 300
      
  # DNS failover record
  DNSFailover:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: !Ref HostedZoneId
      Name: api.example.com
      Type: A
      SetIdentifier: primary
      Failover: PRIMARY
      TTL: 60
      ResourceRecords:
        - !Ref PrimaryEndpoint
      HealthCheckId: !Ref PrimaryHealthCheck

ℹ️

Warm Standby provides 5-15 minute RTO. Scale up DR infrastructure when failover is triggered.

Pattern 4: Multi-Active (Hot Standby)

Run full infrastructure in multiple regions simultaneously.

// Multi-active traffic routing
import * as route53 from 'aws-cdk-lib/aws-route53';
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';

// Health checks for both regions
const primaryHealthCheck = new route53.CfnHealthCheck(this, 'PrimaryHealthCheck', {
  healthCheckConfig: {
    ipAddress: primaryAlb.loadBalancerDnsName,
    port: 443,
    type: 'HTTPS',
    resourcePath: '/health',
    requestInterval: 10,
    failureThreshold: 3,
  },
});

const secondaryHealthCheck = new route53.CfnHealthCheck(this, 'SecondaryHealthCheck', {
  healthCheckConfig: {
    ipAddress: secondaryAlb.loadBalancerDnsName,
    port: 443,
    type: 'HTTPS',
    resourcePath: '/health',
    requestInterval: 10,
    failureThreshold: 3,
  },
});

// Latency-based routing with failover
new route53.CfnRecordSet(this, 'MultiActiveRecord', {
  hostedZoneId: hostedZoneId,
  name: 'api.example.com',
  type: 'A',
  setIdentifier: 'us-east-1',
  region: 'us-east-1',
  failover: 'PRIMARY',
  ttl: '60',
  resourceRecords: [{ value: primaryAlb.loadBalancerDnsName }],
  healthCheckId: primaryHealthCheck.ref,
});

new route53.CfnRecordSet(this, 'MultiActiveRecordDR', {
  hostedZoneId: hostedZoneId,
  name: 'api.example.com',
  type: 'A',
  setIdentifier: 'us-west-2',
  region: 'us-west-2',
  failover: 'SECONDARY',
  ttl: '60',
  resourceRecords: [{ value: secondaryAlb.loadBalancerDnsName }],
  healthCheckId: secondaryHealthCheck.ref,
});

Pattern 5: DR Testing Automation

Automate DR drills to validate recovery procedures.

# dr_testing/automated_drill.py
import boto3
import json
from datetime import datetime

class DRDrillManager:
    def __init__(self):
        self.cf = boto3.client('cloudformation')
        self.rds = boto3.client('rds')
        self.route53 = boto3.client('route53')
    
    def execute_dr_drill(self, drill_type: str = 'full'):
        """Execute DR drill and validate recovery."""
        results = {
            'drill_id': f"dr-drill-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
            'start_time': datetime.now().isoformat(),
            'steps': [],
        }
        
        try:
            # Step 1: Promote DR database replica
            if drill_type in ['full', 'database']:
                db_result = self._promote_dr_replica()
                results['steps'].append({
                    'name': 'promote_database',
                    'status': 'success',
                    'details': db_result,
                })
            
            # Step 2: Scale up DR compute
            if drill_type in ['full', 'compute']:
                compute_result = self._scale_dr_infrastructure()
                results['steps'].append({
                    'name': 'scale_compute',
                    'status': 'success',
                    'details': compute_result,
                })
            
            # Step 3: Switch DNS
            if drill_type == 'full':
                dns_result = self._switch_dns_to_dr()
                results['steps'].append({
                    'name': 'dns_failover',
                    'status': 'success',
                    'details': dns_result,
                })
            
            # Step 4: Validate application
            app_result = self._validate_application()
            results['steps'].append({
                'name': 'validate_app',
                'status': 'success' if app_result['healthy'] else 'failed',
                'details': app_result,
            })
            
            results['status'] = 'completed'
            
        except Exception as e:
            results['status'] = 'failed'
            results['error'] = str(e)
        
        finally:
            # Always rollback after drill
            self._rollback_dr_changes()
            results['end_time'] = datetime.now().isoformat()
        
        # Send results
        self._send_drill_report(results)
        
        return results
    
    def _validate_application(self):
        """Run health checks against DR endpoint."""
        import requests
        
        checks = [
            ('https://dr-api.example.com/health', 200),
            ('https://dr-api.example.com/api/v1/status', 200),
        ]
        
        results = []
        for url, expected_status in checks:
            try:
                response = requests.get(url, timeout=10)
                results.append({
                    'url': url,
                    'status_code': response.status_code,
                    'healthy': response.status_code == expected_status,
                })
            except Exception as e:
                results.append({
                    'url': url,
                    'error': str(e),
                    'healthy': False,
                })
        
        return {
            'healthy': all(r['healthy'] for r in results),
            'checks': results,
        }

DR Drill Schedule

Drill Type	Frequency	Duration	Team Involved
Tabletop Exercise	Quarterly	2 hours	Leadership, Engineering
Component Failover	Monthly	1 hour	Engineering
Full DR Drill	Semi-annually	4-8 hours	All teams
Chaos Engineering	Weekly	1 hour	SRE

Follow-Up Questions

How do you handle data consistency during a cross-region failover?
What strategies would you use to test DR procedures without impacting production?
How do you design a DR strategy for a multi-tenant SaaS application?