Disaster Recovery Patterns
Difficulty: Senior Level | Companies: AWS, Google, Microsoft, Netflix, Uber
DR Strategy Overview
Disaster Recovery (DR) ensures business continuity when systems fail. The right strategy depends on your RTO (Recovery Time Objective) and RPO (Recovery Point Objective).
โน๏ธ
RPO = how much data you can afford to lose. RTO = how quickly you need to recover. Lower RPO/RTO = higher cost.
DR Strategy Comparison
RPO/RTO
Cost โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโถ Recovery
โ โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโดโโโโโโโโโ
โ โBackup & โ โPilot โ โWarm โ โHot โ
โ โRestore โ โLight โ โStandby โ โStandby โ
โ โ โ โ โ โ โ โ โ
โ โRPO:24hr โ โRPO:1hr โ โRPO:15m โ โRPO:0 โ
โ โRTO:24hr โ โRTO:30m โ โRTO:5m โ โRTO:0 โ
โ โ$ โ โ$$ โ โ$$$ โ โ$$$$ โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโโโ
Pattern 1: Backup and Restore
Simplest and most cost-effective DR strategy.
# Automated backup with AWS Backup
import boto3
from datetime import datetime
backup = boto3.client('backup')
# Create backup plan
response = backup.create_backup_plan(
BackupPlan={
'BackupPlanName': 'disaster-recovery-plan',
'Rules': [
{
'RuleName': 'daily-backup',
'TargetBackupVaultName': 'dr-vault',
'ScheduleExpression': 'cron(0 3 * * ? *)',
'StartWindowMinutes': 60,
'CompletionWindowMinutes': 180,
'Lifecycle': {
'DeleteAfterDays': 35,
'MoveToColdStorageAfterDays': 7,
},
},
{
'RuleName': 'monthly-backup',
'TargetBackupVaultName': 'dr-vault',
'ScheduleExpression': 'cron(0 3 1 * ? *)',
'StartWindowMinutes': 60,
'CompletionWindowMinutes': 360,
'Lifecycle': {
'DeleteAfterDays': 365,
'MoveToColdStorageAfterDays': 90,
},
},
],
'AdvancedBackupSettings': [
{
'ResourceType': 'EC2',
'BackupOptions': {
'WindowsVSS': 'disabled',
},
},
],
}
)
# Cross-region backup copy
backup.create_backup_plan(
BackupPlan={
'BackupPlanName': 'cross-region-dr-copy',
'Rules': [
{
'RuleName': 'copy-to-dr-region',
'TargetBackupVaultName': 'dr-vault-west',
'SourceBackupVaultName': 'dr-vault',
'ScheduleExpression': 'cron(0 6 * * ? *)',
'StartWindowMinutes': 60,
'CopyActions': [
{
'DestinationBackupVaultArn': 'arn:aws:backup:us-west-2:123456789:backup-vault:dr-vault-west',
'Lifecycle': {
'DeleteAfterDays': 35,
},
},
],
},
],
}
)
โ ๏ธ
Backup and Restore has the longest RTO. Use it for non-critical workloads where 24-hour recovery is acceptable.
Pattern 2: Pilot Light (Minimal Standby)
Keep core infrastructure running in DR region with minimal capacity.
# Terraform for Pilot Light DR in us-west-2
provider "aws" {
alias = "dr"
region = "us-west-2"
}
# DR RDS Replica
resource "aws_db_instance" "dr_replica" {
provider = aws.dr
identifier = "app-database-dr"
replicate_source_db = "arn:aws:rds:us-east-1:123456789:db:app-database"
instance_class = "db.r6g.large" # Smaller instance for DR
vpc_security_group_ids = [aws_security_group.dr_db.id]
db_subnet_group_name = aws_db_subnet_group.dr.name
# Enable automated backups in DR region
backup_retention_period = 7
skip_final_snapshot = true
tags = {
Environment = "dr"
Role = "pilot-light"
}
}
# Lambda functions (CodeDeploy across regions)
resource "aws_lambda_function" "api_handler_dr" {
provider = aws.dr
function_name = "api-handler"
runtime = "nodejs20.x"
handler = "index.handler'
s3_bucket = "deployment-artifacts-dr"
s3_key = "api-handler.zip"
vpc_config {
subnet_ids = var.dr_subnet_ids
security_group_ids = [aws_security_group.dr_lambda.id]
}
environment {
variables = {
DATABASE_URL = "postgresql://user:pass@${aws_db_instance.dr_replica.address}:5432/app"
ENVIRONMENT = "dr"
}
}
}
# SNS topic for DR failover notifications
resource "aws_sns_topic" "dr_failover" {
provider = aws.dr
name = "dr-failover-notifications"
}
Pattern 3: Warm Standby
Maintain scaled-down infrastructure in DR region, ready to scale up.
# Warm Standby configuration
AWSTemplateFormatVersion: '2010-09-09'
Description: Warm Standby DR Infrastructure
Parameters:
Environment:
Type: String
PrimaryRegion:
Type: String
Default: us-east-1
DRRegion:
Type: String
Default: us-west-2
Resources:
# DR Aurora Global Database
AuroraCluster:
Type: AWS::RDS::DBCluster
Properties:
Engine: aurora-postgresql
EngineVersion: '15.4'
DatabaseName: appdb
MasterUsername: !Ref DBUsername
MasterUserPassword: !Ref DBPassword
GlobalClusterIdentifier: !Sub '${Environment}-global-cluster'
StorageEncrypted: true
# Scale-down compute in DR region
DRScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
MinSize: '2' # Minimal capacity
MaxSize: '20' # Can scale to match primary
DesiredCapacity: '2'
VPCZoneIdentifier: !Ref DRSubnetIds
TargetGroupARNs:
- !Ref DRTargetGroup
HealthCheckType: ELB
HealthCheckGracePeriod: 300
# DNS failover record
DNSFailover:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneId: !Ref HostedZoneId
Name: api.example.com
Type: A
SetIdentifier: primary
Failover: PRIMARY
TTL: 60
ResourceRecords:
- !Ref PrimaryEndpoint
HealthCheckId: !Ref PrimaryHealthCheck
โน๏ธ
Warm Standby provides 5-15 minute RTO. Scale up DR infrastructure when failover is triggered.
Pattern 4: Multi-Active (Hot Standby)
Run full infrastructure in multiple regions simultaneously.
// Multi-active traffic routing
import * as route53 from 'aws-cdk-lib/aws-route53';
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';
// Health checks for both regions
const primaryHealthCheck = new route53.CfnHealthCheck(this, 'PrimaryHealthCheck', {
healthCheckConfig: {
ipAddress: primaryAlb.loadBalancerDnsName,
port: 443,
type: 'HTTPS',
resourcePath: '/health',
requestInterval: 10,
failureThreshold: 3,
},
});
const secondaryHealthCheck = new route53.CfnHealthCheck(this, 'SecondaryHealthCheck', {
healthCheckConfig: {
ipAddress: secondaryAlb.loadBalancerDnsName,
port: 443,
type: 'HTTPS',
resourcePath: '/health',
requestInterval: 10,
failureThreshold: 3,
},
});
// Latency-based routing with failover
new route53.CfnRecordSet(this, 'MultiActiveRecord', {
hostedZoneId: hostedZoneId,
name: 'api.example.com',
type: 'A',
setIdentifier: 'us-east-1',
region: 'us-east-1',
failover: 'PRIMARY',
ttl: '60',
resourceRecords: [{ value: primaryAlb.loadBalancerDnsName }],
healthCheckId: primaryHealthCheck.ref,
});
new route53.CfnRecordSet(this, 'MultiActiveRecordDR', {
hostedZoneId: hostedZoneId,
name: 'api.example.com',
type: 'A',
setIdentifier: 'us-west-2',
region: 'us-west-2',
failover: 'SECONDARY',
ttl: '60',
resourceRecords: [{ value: secondaryAlb.loadBalancerDnsName }],
healthCheckId: secondaryHealthCheck.ref,
});
Pattern 5: DR Testing Automation
Automate DR drills to validate recovery procedures.
# dr_testing/automated_drill.py
import boto3
import json
from datetime import datetime
class DRDrillManager:
def __init__(self):
self.cf = boto3.client('cloudformation')
self.rds = boto3.client('rds')
self.route53 = boto3.client('route53')
def execute_dr_drill(self, drill_type: str = 'full'):
"""Execute DR drill and validate recovery."""
results = {
'drill_id': f"dr-drill-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
'start_time': datetime.now().isoformat(),
'steps': [],
}
try:
# Step 1: Promote DR database replica
if drill_type in ['full', 'database']:
db_result = self._promote_dr_replica()
results['steps'].append({
'name': 'promote_database',
'status': 'success',
'details': db_result,
})
# Step 2: Scale up DR compute
if drill_type in ['full', 'compute']:
compute_result = self._scale_dr_infrastructure()
results['steps'].append({
'name': 'scale_compute',
'status': 'success',
'details': compute_result,
})
# Step 3: Switch DNS
if drill_type == 'full':
dns_result = self._switch_dns_to_dr()
results['steps'].append({
'name': 'dns_failover',
'status': 'success',
'details': dns_result,
})
# Step 4: Validate application
app_result = self._validate_application()
results['steps'].append({
'name': 'validate_app',
'status': 'success' if app_result['healthy'] else 'failed',
'details': app_result,
})
results['status'] = 'completed'
except Exception as e:
results['status'] = 'failed'
results['error'] = str(e)
finally:
# Always rollback after drill
self._rollback_dr_changes()
results['end_time'] = datetime.now().isoformat()
# Send results
self._send_drill_report(results)
return results
def _validate_application(self):
"""Run health checks against DR endpoint."""
import requests
checks = [
('https://dr-api.example.com/health', 200),
('https://dr-api.example.com/api/v1/status', 200),
]
results = []
for url, expected_status in checks:
try:
response = requests.get(url, timeout=10)
results.append({
'url': url,
'status_code': response.status_code,
'healthy': response.status_code == expected_status,
})
except Exception as e:
results.append({
'url': url,
'error': str(e),
'healthy': False,
})
return {
'healthy': all(r['healthy'] for r in results),
'checks': results,
}
DR Drill Schedule
| Drill Type | Frequency | Duration | Team Involved |
|---|---|---|---|
| Tabletop Exercise | Quarterly | 2 hours | Leadership, Engineering |
| Component Failover | Monthly | 1 hour | Engineering |
| Full DR Drill | Semi-annually | 4-8 hours | All teams |
| Chaos Engineering | Weekly | 1 hour | SRE |
Follow-Up Questions
- How do you handle data consistency during a cross-region failover?
- What strategies would you use to test DR procedures without impacting production?
- How do you design a DR strategy for a multi-tenant SaaS application?