Topic: Airflow Production Best Practices

Interview Question

ℹ️Interview Context

Company: Netflix / Uber Role: Senior Data Engineer / Platform Engineer Difficulty: Advanced Time: 60-90 minutes

Question: "Describe the best practices for running Airflow in production at scale. How do you handle scaling, migration, disaster recovery, and operational excellence?"

Detailed Theory

Production Fundamentals

# production_fundamentals.py
"""
Production Best Practices:

1. Scaling:
   - Horizontal scaling
   - Vertical scaling
   - Auto-scaling

2. High Availability:
   - Redundancy
   - Failover
   - Disaster recovery

3. Monitoring:
   - Health checks
   - Alerting
   - Performance monitoring

4. Operations:
   - Runbooks
   - Incident response
   - Change management

5. Cost Optimization:
   - Resource right-sizing
   - Spot instances
   - Scheduling optimization
"""

1. Scaling Strategies

# scaling_strategies.py
"""
Scaling Airflow:

1. Horizontal Scaling:
   - Add more workers
   - Add more scheduler instances
   - Use Kubernetes for dynamic scaling

2. Vertical Scaling:
   - Increase worker resources
   - Increase database resources
   - Increase webserver resources

3. Auto-scaling:
   - Kubernetes HPA
   - Celery worker auto-scaling
   - Database read replicas
"""

# Kubernetes HPA configuration
HPA_CONFIG = """
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: airflow-worker-hpa
  namespace: airflow
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: airflow-worker
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 2
        periodSeconds: 120
"""

# Database read replicas
DATABASE_SCALING = """
# PostgreSQL read replicas

1. Primary database for writes
2. Read replicas for read-heavy operations
3. Connection pooling with PgBouncer

Configuration:
- Write operations -> Primary
- Read operations -> Read replicas
- DAG parsing -> Read replicas
"""

# Celery worker auto-scaling
CELERY_SCALING = """
# Celery worker auto-scaling

1. Monitor queue depth
2. Scale workers based on queue size
3. Use cloud auto-scaling groups

Metrics to monitor:
- Queue depth
- Worker utilization
- Task execution time
"""

ℹ️Pro Tip

Use Kubernetes for dynamic scaling. It allows you to scale workers based on demand and release resources when not needed.

2. High Availability

# high_availability.py
"""
High Availability:

Ensure Airflow remains operational during failures.
"""

# Multi-AZ deployment
MULTI_AZ_CONFIG = """
# Deploy across multiple Availability Zones

1. Webserver: Multi-AZ with ALB
2. Scheduler: Active-Passive with database locking
3. Workers: Multi-AZ with auto-scaling
4. Database: Multi-AZ with automated failover
"""

# Scheduler HA
SCHEDULER_HA = """
# Scheduler High Availability

1. Multiple scheduler instances
2. Database-based locking
3. Automatic failover

Configuration:
[scheduler]
scheduler_heartbeat_sec = 5
max_tis_per_query = 512
"""

# Database HA
DATABASE_HA = """
# Database High Availability

1. Primary-Replica setup
2. Automated failover
3. Connection pooling

Configuration:
[database]
sql_alchemy_pool_size = 20
sql_alchemy_max_overflow = 30
sql_alchemy_pool_pre_ping = True
"""

# Disaster recovery
DISASTER_RECOVERY = """
# Disaster Recovery Plan

1. Regular backups
2. Cross-region replication
3. Recovery time objectives (RTO)
4. Recovery point objectives (RPO)

Backup strategy:
- Daily full backups
- Hourly incremental backups
- WAL archiving for point-in-time recovery
"""

3. Migration Strategies

# migration_strategies.py
"""
Migration Strategies:

Migrate Airflow deployments with minimal downtime.
"""

# Airflow 1.x to 2.x migration
MIGRATION_1_TO_2 = """
# Airflow 1.x to 2.x Migration

1. Database migration
   - Backup existing database
   - Run airflow db migrate
   - Verify migration

2. DAG migration
   - Update imports
   - Replace SubDAGs with TaskGroups
   - Update operator imports

3. Configuration migration
   - Update airflow.cfg
   - Migrate to environment variables
   - Test configuration

4. Testing
   - Test all DAGs
   - Verify connections
   - Test scheduling
"""

# Cloud migration
CLOUD_MIGRATION = """
# On-premise to Cloud Migration

1. Assessment
   - Inventory current setup
   - Identify dependencies
   - Plan migration phases

2. Preparation
   - Set up cloud infrastructure
   - Configure networking
   - Set up secrets management

3. Migration
   - Migrate DAGs
   - Migrate connections
   - Migrate variables

4. Validation
   - Test all pipelines
   - Verify monitoring
   - Performance testing
"""

# Multi-region migration
MULTI_REGION_MIGRATION = """
# Multi-region Migration

1. Set up target region
2. Replicate DAGs and configurations
3. Set up database replication
4. Test cross-region failover
5. Gradual traffic migration
"""

4. Operational Excellence

# operational_excellence.py
"""
Operational Excellence:

Runbooks, incident response, and best practices.
"""

# Runbook template
RUNBOOK_TEMPLATE = """
# Runbook: [Incident Type]

## Overview
- Description of the incident
- Impact assessment
- Severity level

## Detection
- How to detect the incident
- Monitoring alerts
- Symptoms

## Investigation
- Initial investigation steps
- Logs to check
- Metrics to review

## Resolution
- Step-by-step resolution
- Commands to run
- Configuration changes

## Prevention
- How to prevent recurrence
- Improvements to implement
- Monitoring enhancements
"""

# Incident response
INCIDENT_RESPONSE = """
# Incident Response Process

1. Detection
   - Monitoring alerts
   - User reports
   - Automated checks

2. Triage
   - Assess severity
   - Identify impact
   - Assign owner

3. Investigation
   - Gather information
   - Identify root cause
   - Determine scope

4. Resolution
   - Implement fix
   - Verify resolution
   - Communicate status

5. Post-mortem
   - Document incident
   - Identify improvements
   - Update runbooks
"""

# Change management
CHANGE_MANAGEMENT = """
# Change Management Process

1. Change request
   - Document change
   - Risk assessment
   - Approval process

2. Testing
   - Test in staging
   - Validate changes
   - Rollback plan

3. Deployment
   - Deploy to production
   - Monitor impact
   - Verify success

4. Documentation
   - Update documentation
   - Communicate changes
   - Train users
"""

⚠️Important

Always have a rollback plan for changes. Test changes in staging before production deployment.

5. Cost Optimization

# cost_optimization.py
"""
Cost Optimization:

Reduce costs while maintaining performance.
"""

# Resource right-sizing
RIGHT_SIZING = """
# Resource Right-sizing

1. Monitor resource usage
2. Identify over-provisioned resources
3. Right-size based on actual usage
4. Review regularly

Tools:
- Cloud provider cost explorer
- Kubernetes resource metrics
- Application performance monitoring
"""

# Spot instances
SPOT_INSTANCES = """
# Use Spot Instances for Workers

1. Spot instances for non-critical workloads
2. On-demand for critical workloads
3. Reserved instances for base capacity

Benefits:
- Up to 90% cost reduction
- Good for fault-tolerant workloads
- Auto-scaling with spot fleet
"""

# Scheduling optimization
SCHEDULING_OPTIMIZATION = """
# Optimize Scheduling

1. Avoid peak hours
2. Use idle resources
3. Batch processing
4. Resource-aware scheduling

Benefits:
- Lower costs
- Better resource utilization
- Reduced contention
"""

Real-World Scenarios

Scenario 1: Netflix's Production Setup

# netflix_production.py
"""
Netflix-style production setup:
- Global deployment
- Multi-region
- Cost optimization
"""

from airflow.decorators import dag, task
from datetime import datetime
from typing import Dict, Any

@dag(
    dag_id='netflix_global_pipeline',
    schedule_interval='@hourly',
    start_date=datetime(2024, 1, 1),
    catchup=False,
    max_active_runs=10,
    max_active_tasks=100,
    tags=['netflix', 'global', 'production'],
)
def netflix_global():
    @task
    def extract_global() -> Dict[str, Any]:
        """Extract from global sources"""
        return {'sources': ['us', 'eu', 'apac']}
    
    @task
    def process_region(region: str) -> Dict[str, Any]:
        """Process by region"""
        return {'region': region, 'processed': True}
    
    @task
    def aggregate_global(results: list) -> Dict[str, Any]:
        """Aggregate global results"""
        return {'total_regions': len(results)}
    
    # Global pipeline
    sources = extract_global()
    regional = process_region.expand(sources['sources'])
    aggregate = aggregate_global(regional)

netflix_global()

QuizBox

Best Practices

# best_practices.py
"""
Production Best Practices:

1. Scaling:
   - Implement auto-scaling
   - Monitor resource usage
   - Right-size resources

2. High Availability:
   - Deploy across multiple AZs
   - Implement failover
   - Regular backups

3. Operations:
   - Create runbooks
   - Implement incident response
   - Regular training

4. Cost Optimization:
   - Monitor costs
   - Use spot instances
   - Optimize scheduling

5. Continuous Improvement:
   - Regular reviews
   - Performance tuning
   - Process improvement
"""

ℹ️Netflix Interview Tip

At Netflix, they emphasize operational excellence. When discussing production best practices, highlight the importance of monitoring, runbooks, and continuous improvement. Also mention how they handle incidents and optimize costs at scale.

Summary

Production best practices are critical for reliable Airflow operation. Key takeaways:

Scaling for performance
High availability for reliability
Migration for upgrades
Operations for maintainability
Cost optimization for efficiency

For Netflix and Uber interviews, focus on:

Scaling strategies
High availability patterns
Operational excellence
Cost optimization
Continuous improvement

This question is part of the Apache Airflow Advanced interview preparation series. Practice explaining these concepts before your interview.