ETL vs ELT Patterns
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ETL vs ELT PATTERNS β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ETL (Extract, Transform, Load) β β
β β β β
β β Source βββΊ Extract βββΊ Transform βββΊ Load βββΊ Destination β β
β β β β β β β β
β β S3/DB Glue/Lambda Redshift Warehouse β β
β β β β
β β Best for: Small-medium datasets, complex transformations β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ELT (Extract, Load, Transform) β β
β β β β
β β Source βββΊ Extract βββΊ Load βββΊ Transform βββΊ Result β β
β β β β β β β β
β β S3/DB S3/Redshift SQL/Lambda Views β β
β β β β
β β Best for: Large datasets, simple transformations, data lakes β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Glue ETL Job Patterns
Pattern 1: Full Load
# Full load ETL job
raw_df = spark.read.parquet("s3://raw-data/sales/")
transformed_df = raw_df \
.dropDuplicates() \
.filter(F.col("amount") > 0) \
.withColumn("processed_date", F.current_date())
transformed_df.write \
.mode("overwrite") \
.partitionBy("year", "month") \
.parquet("s3://processed-data/sales/")
Pattern 2: Incremental Load (Job Bookmarks)
# Incremental load using job bookmarks
source = glueContext.create_dynamic_frame.from_catalog(
database="raw_db",
table_name="sales",
transformation_ctx="source"
)
# Job bookmark tracks processed data
processed = ApplyMapping.apply(
frame=source,
transformation_ctx="processed"
)
glueContext.write_dynamic_frame.from_options(
frame=processed,
connection_type="s3",
connection_options={"path": "s3://processed-data/sales/"},
format="parquet",
transformation_ctx="sink"
)
job.commit()
Pattern 3: Slowly Changing Dimensions (SCD Type 2)
def apply_scd_type2(new_data, existing_data, key_columns):
"""Implement SCD Type 2 logic."""
from pyspark.sql import Window
# Add metadata columns
new_df = new_data \
.withColumn("effective_date", F.current_date()) \
.withColumn("expiry_date", F.lit("9999-12-31").cast("date")) \
.withColumn("is_current", F.lit(True))
# Find changed records
join_condition = [new_df[k] == existing_data[k] for k in key_columns]
changes = new_df.join(existing_data, join_condition, "left_anti")
# Expire old records
expired = existing_data.join(new_df, join_condition, "inner") \
.withColumn("expiry_date", F.current_date()) \
.withColumn("is_current", F.lit(False))
# Combine: unchanged + expired + new
result = existing_data.join(new_df, join_condition, "left_anti") \
.unionByName(expired) \
.unionByName(changes)
return result
Step Functions Orchestration
{
"Comment": "ETL Pipeline Orchestration",
"StartAt": "Extract",
"States": {
"Extract": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {"JobName": "extract-job"},
"Next": "Validate",
"Catch": [{"ErrorEquals": ["States.ALL"], "Next": "HandleError"}]
},
"Validate": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:function:validate-data",
"Next": "Transform",
"Catch": [{"ErrorEquals": ["States.ALL"], "Next": "HandleError"}]
},
"Transform": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {"JobName": "transform-job"},
"Next": "Load",
"Catch": [{"ErrorEquals": ["States.ALL"], "Next": "HandleError"}]
},
"Load": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {"JobName": "load-job"},
"Next": "Notify",
"Catch": [{"ErrorEquals": ["States.ALL"], "Next": "HandleError"}]
},
"Notify": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:function:send-notification",
"End": true
},
"HandleError": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:function:handle-error",
"End": true
}
}
}
Interview Q&A
Q1: When should you use Glue vs. Lambda for ETL?
Answer: Glue for large datasets (>1GB), complex transformations, Spark-based processing. Lambda for small datasets (<1GB), simple transformations, event-driven triggers.
Q2: How do you implement idempotent ETL?
Answer: Use job bookmarks, timestamp-based partitioning, DynamoDB dedup, or overwrite mode with atomic writes.
Q3: What is the advantage of using Step Functions over simple Lambda chains?
Answer: Visual workflows, built-in error handling, retry logic, state tracking, human approval steps, and execution history.
Summary
- ETL: Transform before loading - best for complex transformations
- ELT: Load then transform - best for large datasets, data lakes
- Glue: Serverless Spark for ETL with job bookmarks
- Step Functions: Orchestrate multi-step ETL workflows
- SCD Type 2: Track historical dimension changes
- Idempotency: Ensure ETL jobs can be safely retried