Step Functions Architecture
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP FUNCTIONS ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STATE MACHINE DEFINITION β β
β β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Start βββββΊβ Validate βββββΊβ Process βββββΊβ Store β β β
β β β State β β Data β β Data β β Results β β β
β β ββββββββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β β
β β β β β β β
β β βΌ βΌ βΌ β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Error β β Retry β β Success β β β
β β β Handler β β Logic β β State β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β WORKFLOW TYPES β β
β β β β
β β βββββββββββββββββββββββ βββββββββββββββββββββββ β β
β β β STANDARD WORKFLOW β β EXPRESS WORKFLOW β β β
β β β β β β β β
β β β β’ Long-running β β β’ High-volume β β β
β β β β’ Up to 1 year β β β’ Up to 5 min β β β
β β β β’ Human approval β β β’ Event processing β β β
β β β β’ Audit history β β β’ Sub-second β β β
β β β β’ $0.025/1000 β β β’ $1 per million β β β
β β β state transitionsβ β executions β β β
β β βββββββββββββββββββββββ βββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
State Machine Definition
{
"Comment": "ETL Pipeline Workflow",
"StartAt": "ValidateInput",
"States": {
"ValidateInput": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:validate-data",
"Parameters": {
"bucket.$": "$.source_bucket",
"key.$": "$.source_key"
},
"ResultPath": "$.validation",
"Next": "CheckValidation",
"Retry": [
{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 3,
"MaxAttempts": 3,
"BackoffRate": 2
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "HandleError",
"ResultPath": "$.error"
}
]
},
"CheckValidation": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.validation.isValid",
"BooleanEquals": true,
"Next": "ProcessData"
}
],
"Default": "ValidationFailed"
},
"ProcessData": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "process-data-job",
"Arguments": {
"--source_path.$": "$.source_path",
"--output_path.$": "$.output_path"
}
},
"ResultPath": "$.glueResult",
"Next": "CheckJobStatus",
"Retry": [
{
"ErrorEquals": ["Glue.ConcurrentRunsExceededException"],
"IntervalSeconds": 60,
"MaxAttempts": 5,
"BackoffRate": 2
}
]
},
"CheckJobStatus": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.glueResult.JobRunStatus",
"StringEquals": "SUCCEEDED",
"Next": "StoreResults"
}
],
"Default": "HandleError"
},
"StoreResults": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:store-results",
"Next": "SuccessState"
},
"ValidationFailed": {
"Type": "Fail",
"Cause": "Data validation failed",
"Error": "ValidationError"
},
"HandleError": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:handle-error",
"Next": "ErrorState"
},
"SuccessState": {
"Type": "Succeed"
},
"ErrorState": {
"Type": "Fail",
"Cause": "Pipeline failed",
"Error": "PipelineError"
}
}
}
Step Functions with Distributed Map
{
"Comment": "Process large dataset with Distributed Map",
"StartAt": "ListFiles",
"States": {
"ListFiles": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:list-files",
"ResultPath": "$.files",
"Next": "ProcessAllFiles"
},
"ProcessAllFiles": {
"Type": "Map",
"ItemProcessor": {
"ProcessorConfig": {
"Mode": "DISTRIBUTED",
"ExecutionType": "EXPRESS"
},
"StartAt": "ProcessFile",
"States": {
"ProcessFile": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:process-file",
"End": true
}
}
},
"ItemsPath": "$.files",
"MaxConcurrency": 100,
"ResultPath": "$.results",
"End": true
}
}
}
βΉοΈ
Pro Tip: Use Distributed Map for processing large datasets (millions of files). It parallelizes Lambda invocations across thousands of concurrent executions.
Error Handling Patterns
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ERROR HANDLING PATTERNS β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β RETRY PATTERN β β
β β β β
β β Task βββΊ Fail βββΊ Wait βββΊ Retry βββΊ Fail βββΊ Wait βββΊ Retry β β
β β (3s) (6s) (12s) β β
β β β β
β β Config: MaxAttempts=3, BackoffRate=2 β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CATCH PATTERN β β
β β β β
β β Task βββΊ Error βββΊ Catch Handler βββΊ Fallback Task β β
β β β β
β β ErrorEquals: ["States.TaskFailed", "CustomError"] β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CHOICE PATTERN β β
β β β β
β β Decision βββΊ Condition True βββΊ Task A β β
β β β β β
β β βββββββββ Condition False βββΊ Task B β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Interview Questions & Answers
Q1: What is the difference between Standard and Express workflows?
Answer:
- Standard: Up to 1 year, $0.025/1000 transitions, audit history
- Express: Up to 5 min, $1/million executions, high volume
Use Standard for long-running pipelines with human approval. Use Express for high-volume event processing.
Q2: How do you handle idempotency in Step Functions?
Answer:
- Use unique execution IDs
- Check if result exists before processing
- Use DynamoDB for state tracking
- Implement conditional writes
Q3: What is Distributed Map?
Answer: Distributed Map allows Step Functions to orchestrate large-scale parallel processing:
- Up to 10,000 parallel Lambda invocations
- Process millions of files
- Results aggregated automatically
- Great for ETL over large datasets
Q4: How do you debug failed Step Functions executions?
Answer:
- CloudWatch Logs for Lambda errors
- Step Functions execution history
- X-Ray tracing for distributed tracing
- Input/Output inspection at each state
Q5: When should you use Step Functions vs. other orchestration tools?
Answer:
- Step Functions: AWS-native, serverless, visual workflows
- Airflow: Complex DAGs, cross-cloud, Python-based
- Glue Workflows: Glue-specific orchestration
Use Step Functions for serverless AWS workflows. Use Airflow for complex multi-service orchestration.
Summary
Step Functions is AWS's serverless orchestration service. Key takeaways:
- Standard: Long-running workflows (up to 1 year), audit trail
- Express: High-volume, short-duration (up to 5 min), cost-effective
- Distributed Map: Parallel processing of large datasets
- Error Handling: Retry, catch, choice states for robust workflows
- Integration: Native integration with Lambda, Glue, EMR, and more
- Pricing: Pay per state transition or execution