πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Amazon MWAA (Managed Airflow)

AWS Data EngineeringMWAA Configuration, Scaling & DAGs⭐ Premium

Advertisement

πŸ”„ Amazon MWAA

Master Amazon Managed Workflows for Apache Airflow (MWAA) on AWS.

Module: AWS Data Engineering β€’ Topic 32 of 65 β€’ Premium Content

MWAA Architecture

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    AMAZON MWAA ARCHITECTURE                                   β”‚
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  MWAA ENVIRONMENT (Managed)                                         β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚
β”‚  β”‚  β”‚  Scheduler (2+)     β”‚    Web Server        β”‚    Workers (2+)  β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  Parse DAGs         β”‚    Airflow UI        β”‚    Execute Tasks  β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  Schedule runs      β”‚    REST API          β”‚    Run Operators  β”‚  β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚
β”‚  β”‚  β”‚  Metadata Database (RDS PostgreSQL - Managed)                 β”‚  β”‚    β”‚
β”‚  β”‚  β”‚  β€’ DAG runs, task instances, connections, variables           β”‚  β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                β”‚                                           β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚              β–Ό                 β–Ό                 β–Ό                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚  S3 (DAGs/      β”‚  β”‚  Glue/EMR/Lambda β”‚  β”‚  CloudWatch     β”‚            β”‚
β”‚  β”‚  Plugins/Logs)  β”‚  β”‚  (Task Targets)  β”‚  β”‚  (Monitoring)   β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Example DAG

from datetime import datetime, timedelta
from airflow import DAG
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from airflow.operators.python import PythonOperator

default_args = {
    'owner': 'data-engineering',
    'depends_on_past': False,
    'email_on_failure': True,
    'email': ['team@company.com'],
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
}

with DAG(
    'daily_etl_pipeline',
    default_args=default_args,
    description='Daily ETL pipeline',
    schedule_interval='0 2 * * *',
    start_date=datetime(2024, 1, 1),
    catchup=False,
    tags=['etl', 'production'],
) as dag:

    wait_for_data = S3KeySensor(
        task_id='wait_for_raw_data',
        bucket_name='data-lake-raw',
        bucket_key='landing/{{ ds }}/',
        wildcard_match=True,
        timeout=3600,
        poke_interval=60,
    )

    run_glue_job = GlueJobOperator(
        task_id='run_etl_job',
        job_name='process-daily-data',
        script_location='s3://glue-scripts/process_data.py',
        s3_bucket='glue-scripts',
        iam_role_name='GlueRole',
        create_job_kwargs={
            'GlueVersion': '4.0',
            'NumberOfWorkers': 10,
            'WorkerType': 'G.1X'
        },
        job_poll_interval=30,
        aws_conn_id='aws_default',
        dag=dag,
    )

    wait_for_data >> run_glue_job

Interview Q&A

Q1: MWAA vs. Step Functions?

Answer: MWAA for complex multi-step workflows with Python expertise. Step Functions for serverless, visual AWS-native workflows.

Q2: How does MWAA scale?

Answer: MWAA scales workers based on queued tasks. Configure min/max workers and scheduler count based on workload.

Q3: What are Airflow Connections?

Answer: Connections store credentials and configuration for external services (AWS, databases, APIs). Managed securely in the metadata database.

Summary

  • MWAA: Managed Apache Airflow on AWS
  • Components: Scheduler, Web Server, Workers, Metadata DB
  • Storage: S3 for DAGs, plugins, and logs
  • Integration: Native AWS operators for Glue, Lambda, EMR, S3
  • Scaling: Auto-scaling workers based on task queue
  • Use Case: Complex multi-step workflow orchestration

Advertisement