Managed Airflow: MWAA and Cloud Composer

Architecture Diagram

Formal Definitions

Detailed Explanation

Managed Airflow Services

Managed Airflow services handle infrastructure management while you focus on DAG development. The three major providers are AWS MWAA, Google Cloud Composer, and Azure Managed Apache Airflow.

Key Insight: Managed services reduce operational overhead but you're still responsible for writing idempotent, well-tested DAGs.

Service Comparison

Feature	MWAA	Cloud Composer	Azure Managed AA
Airflow Version	2.x	2.x	2.x
DAG Storage	S3	GCS	Blob Storage
Metadata DB	RDS PostgreSQL	Cloud SQL	Azure Database
Scaling	Manual (workers)	Auto-scaling	Manual
Monitoring	CloudWatch	Cloud Monitoring	Azure Monitor
Security	IAM, VPC	IAM, VPC	Azure AD, VNet
Version Upgrade	Manual	Auto or manual	Manual

When to Use Managed vs Self-Managed

Scenario	Recommendation	Reason
Small team, limited ops	Managed	Reduces operational burden
Large scale, custom needs	Self-managed	More control and flexibility
Multi-cloud strategy	Self-managed	Consistent across clouds
Regulatory compliance	Self-managed	Full control over infrastructure
Rapid prototyping	Managed	Faster time to production

MWAA Setup

# mwaa_environment.tf (Terraform)
resource "aws_mwaa_environment" "airflow" {
  name               = "production-airflow"
  airflow_version    = "2.8.1"
  environment_class  = "mw1.medium"

  source_bucket_arn  = aws_s3_bucket.dags.arn
  dag_s3_path        = "dags/"

  execution_role_arn = aws_iam_role.mwaa_role.arn

  network_configuration {
    security_group_ids = [aws_security_group.mwaa.id]
    subnet_ids         = var.private_subnet_ids
  }

  logging_configuration {
    dag_processing_logs {
      enabled   = true
      log_level = "INFO"
    }
    scheduler_logs {
      enabled   = true
      log_level = "INFO"
    }
    task_logs {
      enabled   = true
      log_level = "INFO"
    }
    webserver_logs {
      enabled   = true
      log_level = "WARNING"
    }
    worker_logs {
      enabled   = true
      log_level = "INFO"
    }
  }

  webserver_access_mode = "PUBLIC_ONLY"

  max_workers = 10
  min_workers = 2

  environment_variables = {
    AIRFLOW__CORE__LOAD_EXAMPLES = "False"
    AIRFLOW__WEBSERVER__EXPOSE_CONFIG = "True"
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN = "postgresql+psycopg2://..."
  }

  webserver_url = aws_mwaa_environment.airflow.webserver_url
  arn           = aws_mwaa_environment.airflow.arn
}

Cloud Composer Setup

# composer_environment.tf (Terraform)
resource "google_composer_environment" "airflow" {
  name   = "production-airflow"
  region = "us-central1"

  config {
    node_count = 4

    software_config {
      image_version = "composer-2.6.1-airflow-2.8.1"
      pypi_packages = {
        "apache-airflow-providers-google" = ">=10.0.0"
        "apache-airflow-providers-amazon" = ">=8.0.0"
        "pandas" = ">=2.0.0"
      }
      env_variables = {
        AIRFLOW__CORE__LOAD_EXAMPLES = "False"
      }
    }

    workloads_config {
      scheduler {
        cpu        = 2
        memory_gb  = 4
        storage_gb = 10
        count      = 2
      }
      worker {
        cpu        = 2
        memory_gb  = 8
        storage_gb = 20
        min_count  = 2
        max_count  = 10
      }
      triggerer {
        cpu        = 0.5
        memory_gb  = 1
        count      = 2
      }
    }

    database_config {
      machine_type = "db-n1-standard-2"
    }

    web_server_config {
      machine_type = "composer-web-server-medium"
    }

    environment_size = "ENVIRONMENT_SIZE_MEDIUM"

    private_environment_config {
      enable_private_endpoint = true
      master_ipv4_cidr_block  = "172.16.0.0/28"
    }
  }
}

MWAA DAG with S3 Storage

# dags/mwaa_example_dag.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from airflow.providers.amazon.aws.transfers.s3_to_redshift import S3ToRedshiftOperator


def extract_to_s3(**context):
    """Extract data and upload to S3 DAG bucket."""
    import pandas as pd
    import io

    data = pd.DataFrame({
        'date': [context['ds']],
        'record_count': [1000],
        'status': ['completed'],
    })

    csv_buffer = io.StringIO()
    data.to_csv(csv_buffer, index=False)

    hook = S3Hook(aws_conn_id='aws_default')
    hook.load_string(
        string_data=csv_buffer.getvalue(),
        key=f'extracted/{context["ds"]}/data.csv',
        bucket_name='data-lake',
        replace=True,
    )


with DAG(
    dag_id='mwaa_managed_pipeline',
    start_date=datetime(2024, 1, 1),
    schedule_interval='@daily',
    catchup=False,
    default_args={
        'retries': 2,
        'retry_delay': timedelta(minutes=5),
    },
    tags=['mwaa', 'managed'],
) as dag:

    extract = PythonOperator(
        task_id='extract',
        python_callable=extract_to_s3,
    )

    load = S3ToRedshiftOperator(
        task_id='load_to_redshift',
        schema='public',
        table='daily_metrics',
        s3_bucket='data-lake',
        s3_key='extracted/{{ ds }}/data.csv',
        copy_options=['FORMAT AS CSV', 'IGNOREHEADER 1'],
        aws_conn_id='aws_default',
        redshift_conn_id='redshift_default',
    )

    extract >> load

Service Comparison

Feature	MWAA	Cloud Composer	Azure Managed AA
Airflow Version	2.x	2.x	2.x
DAG Storage	S3	GCS	Blob Storage
Metadata DB	RDS PostgreSQL	Cloud SQL	Azure Database
Scaling	Manual (workers)	Auto-scaling	Manual
Monitoring	CloudWatch	Cloud Monitoring	Azure Monitor
Security	IAM, VPC	IAM, VPC	Azure AD, VNet
Version Upgrade	Manual	Auto or manual	Manual
Min Workers	2	2	2
Max Workers	Configurable	Auto-scaled	Configurable
Price Model	Per environment + workers	Per environment + workers	Per environment + workers

Scaling Thresholds

Metric	Small	Medium	Large
DAG Count	<50	50-200	200+
Task Count/Day	<1000	1000-10000	10000+
Workers	2-4	4-10	10-20
Scheduler	1	1-2	2+
DB Instance	db.t3.medium	db.r5.large	db.r5.xlarge

Best Practices

Security

Use managed secrets: Leverage AWS Secrets Manager, GCP Secret Manager, or Azure Key Vault instead of environment variables for sensitive data.
Enable VPC peering: Run MWAA/Composer in private subnets with VPC peering for secure database and storage access.

Scaling and Performance

Monitor worker utilization: Scale workers based on task queue depth and execution time, not just DAG count.
Use Airflow variables for environment-specific configuration — avoid hardcoding in DAG files.
Optimize parse time: Use dynamic DAG generation and minimize imports at module level.

Operations

Implement CI/CD: Use S3 sync, GCS sync, or Git-based deployment for DAG updates.
Enable logging: Configure all log types (scheduler, task, webserver) for debugging.
Test upgrades: Use staging environments to test Airflow version upgrades before production.

Cost Optimization

Strategy	Savings	Implementation
Auto-scaling workers	30-50%	Scale based on queue depth
Right-size environment	20-40%	Match class to workload
Off-peak scheduling	10-20%	Schedule heavy jobs during low-cost hours
Spot instances	60-70%	Use spot for non-critical workloads

Managed Airflow: MWAA and Cloud Composer

Managed Airflow: MWAA and Cloud Composer

Architecture Diagram

Formal Definitions

Detailed Explanation

Managed Airflow Services

Service Comparison

When to Use Managed vs Self-Managed

MWAA Setup

Cloud Composer Setup

MWAA DAG with S3 Storage

Service Comparison

Scaling Thresholds

Best Practices

Security

Scaling and Performance

Operations

Cost Optimization

See Also

Need Expert Airflow Help?