πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

ETL Pipeline Patterns on AWS

AWS Data EngineeringETL Patterns with Glue, Step Functions & Lambda⭐ Premium

Advertisement

πŸ”„ ETL Pipeline Patterns

Master ETL patterns with AWS Glue, Step Functions, and Lambda orchestration.

Module: AWS Data Engineering β€’ Topic 18 of 65 β€’ Premium Content

ETL vs ELT Patterns

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ETL vs ELT PATTERNS                                       β”‚
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  ETL (Extract, Transform, Load)                                      β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  Source ──► Extract ──► Transform ──► Load ──► Destination          β”‚    β”‚
β”‚  β”‚              β”‚            β”‚            β”‚            β”‚               β”‚    β”‚
β”‚  β”‚           S3/DB       Glue/Lambda    Redshift    Warehouse         β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  Best for: Small-medium datasets, complex transformations          β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  ELT (Extract, Load, Transform)                                      β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  Source ──► Extract ──► Load ──► Transform ──► Result               β”‚    β”‚
β”‚  β”‚              β”‚            β”‚            β”‚            β”‚               β”‚    β”‚
β”‚  β”‚           S3/DB         S3/Redshift  SQL/Lambda   Views            β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  Best for: Large datasets, simple transformations, data lakes       β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Glue ETL Job Patterns

Pattern 1: Full Load

# Full load ETL job
raw_df = spark.read.parquet("s3://raw-data/sales/")

transformed_df = raw_df \
    .dropDuplicates() \
    .filter(F.col("amount") > 0) \
    .withColumn("processed_date", F.current_date())

transformed_df.write \
    .mode("overwrite") \
    .partitionBy("year", "month") \
    .parquet("s3://processed-data/sales/")

Pattern 2: Incremental Load (Job Bookmarks)

# Incremental load using job bookmarks
source = glueContext.create_dynamic_frame.from_catalog(
    database="raw_db",
    table_name="sales",
    transformation_ctx="source"
)

# Job bookmark tracks processed data
processed = ApplyMapping.apply(
    frame=source,
    transformation_ctx="processed"
)

glueContext.write_dynamic_frame.from_options(
    frame=processed,
    connection_type="s3",
    connection_options={"path": "s3://processed-data/sales/"},
    format="parquet",
    transformation_ctx="sink"
)

job.commit()

Pattern 3: Slowly Changing Dimensions (SCD Type 2)

def apply_scd_type2(new_data, existing_data, key_columns):
    """Implement SCD Type 2 logic."""
    from pyspark.sql import Window
    
    # Add metadata columns
    new_df = new_data \
        .withColumn("effective_date", F.current_date()) \
        .withColumn("expiry_date", F.lit("9999-12-31").cast("date")) \
        .withColumn("is_current", F.lit(True))
    
    # Find changed records
    join_condition = [new_df[k] == existing_data[k] for k in key_columns]
    changes = new_df.join(existing_data, join_condition, "left_anti")
    
    # Expire old records
    expired = existing_data.join(new_df, join_condition, "inner") \
        .withColumn("expiry_date", F.current_date()) \
        .withColumn("is_current", F.lit(False))
    
    # Combine: unchanged + expired + new
    result = existing_data.join(new_df, join_condition, "left_anti") \
        .unionByName(expired) \
        .unionByName(changes)
    
    return result

Step Functions Orchestration

{
  "Comment": "ETL Pipeline Orchestration",
  "StartAt": "Extract",
  "States": {
    "Extract": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {"JobName": "extract-job"},
      "Next": "Validate",
      "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "HandleError"}]
    },
    "Validate": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:validate-data",
      "Next": "Transform",
      "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "HandleError"}]
    },
    "Transform": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {"JobName": "transform-job"},
      "Next": "Load",
      "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "HandleError"}]
    },
    "Load": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {"JobName": "load-job"},
      "Next": "Notify",
      "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "HandleError"}]
    },
    "Notify": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:send-notification",
      "End": true
    },
    "HandleError": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:function:handle-error",
      "End": true
    }
  }
}

Interview Q&A

Q1: When should you use Glue vs. Lambda for ETL?

Answer: Glue for large datasets (>1GB), complex transformations, Spark-based processing. Lambda for small datasets (<1GB), simple transformations, event-driven triggers.

Q2: How do you implement idempotent ETL?

Answer: Use job bookmarks, timestamp-based partitioning, DynamoDB dedup, or overwrite mode with atomic writes.

Q3: What is the advantage of using Step Functions over simple Lambda chains?

Answer: Visual workflows, built-in error handling, retry logic, state tracking, human approval steps, and execution history.

Summary

  • ETL: Transform before loading - best for complex transformations
  • ELT: Load then transform - best for large datasets, data lakes
  • Glue: Serverless Spark for ETL with job bookmarks
  • Step Functions: Orchestrate multi-step ETL workflows
  • SCD Type 2: Track historical dimension changes
  • Idempotency: Ensure ETL jobs can be safely retried

Advertisement