🔄 ETL Pipeline Patterns

Master ETL patterns with AWS Glue, Step Functions, and Lambda orchestration.

Module: AWS Data Engineering • Topic 18 of 65 • Premium Content

ETL vs ELT Patterns

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ETL vs ELT PATTERNS                                       │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  ETL (Extract, Transform, Load)                                      │    │
│  │                                                                     │    │
│  │  Source ──► Extract ──► Transform ──► Load ──► Destination          │    │
│  │              │            │            │            │               │    │
│  │           S3/DB       Glue/Lambda    Redshift    Warehouse         │    │
│  │                                                                     │    │
│  │  Best for: Small-medium datasets, complex transformations          │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  ELT (Extract, Load, Transform)                                      │    │
│  │                                                                     │    │
│  │  Source ──► Extract ──► Load ──► Transform ──► Result               │    │
│  │              │            │            │            │               │    │
│  │           S3/DB         S3/Redshift  SQL/Lambda   Views            │    │
│  │                                                                     │    │
│  │  Best for: Large datasets, simple transformations, data lakes       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Glue ETL Job Patterns

Pattern 1: Full Load

# Full load ETL job
raw_df = spark.read.parquet("s3://raw-data/sales/")

transformed_df = raw_df \
    .dropDuplicates() \
    .filter(F.col("amount") > 0) \
    .withColumn("processed_date", F.current_date())

transformed_df.write \
    .mode("overwrite") \
    .partitionBy("year", "month") \
    .parquet("s3://processed-data/sales/")

Pattern 2: Incremental Load (Job Bookmarks)

# Incremental load using job bookmarks
source = glueContext.create_dynamic_frame.from_catalog(
    database="raw_db",
    table_name="sales",
    transformation_ctx="source"
)

# Job bookmark tracks processed data
processed = ApplyMapping.apply(
    frame=source,
    transformation_ctx="processed"
)

glueContext.write_dynamic_frame.from_options(
    frame=processed,
    connection_type="s3",
    connection_options={"path": "s3://processed-data/sales/"},
    format="parquet",
    transformation_ctx="sink"
)

job.commit()

Pattern 3: Slowly Changing Dimensions (SCD Type 2)

def apply_scd_type2(new_data, existing_data, key_columns):
    """Implement SCD Type 2 logic."""
    from pyspark.sql import Window
    
    # Add metadata columns
    new_df = new_data \
        .withColumn("effective_date", F.current_date()) \
        .withColumn("expiry_date", F.lit("9999-12-31").cast("date")) \
        .withColumn("is_current", F.lit(True))
    
    # Find changed records
    join_condition = [new_df[k] == existing_data[k] for k in key_columns]
    changes = new_df.join(existing_data, join_condition, "left_anti")
    
    # Expire old records
    expired = existing_data.join(new_df, join_condition, "inner") \
        .withColumn("expiry_date", F.current_date()) \
        .withColumn("is_current", F.lit(False))
    
    # Combine: unchanged + expired + new
    result = existing_data.join(new_df, join_condition, "left_anti") \
        .unionByName(expired) \
        .unionByName(changes)
    
    return result

Step Functions Orchestration

{
  "Comment": "ETL Pipeline Orchestration",
  "StartAt": "Extract",
  "States": {
    "Extract": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {"JobName": "extract-job"},
      "Next": "Validate",
      "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "HandleError"}]
    },
    "Validate": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:validate-data",
      "Next": "Transform",
      "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "HandleError"}]
    },
    "Transform": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {"JobName": "transform-job"},
      "Next": "Load",
      "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "HandleError"}]
    },
    "Load": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {"JobName": "load-job"},
      "Next": "Notify",
      "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "HandleError"}]
    },
    "Notify": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:send-notification",
      "End": true
    },
    "HandleError": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:handle-error",
      "End": true
    }
  }
}

Interview Q&A

Q1: When should you use Glue vs. Lambda for ETL?

Answer: Glue for large datasets (>1GB), complex transformations, Spark-based processing. Lambda for small datasets (<1GB), simple transformations, event-driven triggers.

Q2: How do you implement idempotent ETL?

Answer: Use job bookmarks, timestamp-based partitioning, DynamoDB dedup, or overwrite mode with atomic writes.

Q3: What is the advantage of using Step Functions over simple Lambda chains?

Answer: Visual workflows, built-in error handling, retry logic, state tracking, human approval steps, and execution history.

Summary

ETL: Transform before loading - best for complex transformations
ELT: Load then transform - best for large datasets, data lakes
Glue: Serverless Spark for ETL with job bookmarks
Step Functions: Orchestrate multi-step ETL workflows
SCD Type 2: Track historical dimension changes
Idempotency: Ensure ETL jobs can be safely retried

ETL Pipeline Patterns on AWS