π CI/CD for Data Pipelines
Master CI/CD patterns with CodePipeline, Glue versioning, and data pipeline testing.
Module: AWS Data Engineering β’ Topic 30 of 65 β’ Premium Content
CI/CD Architecture
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CI/CD FOR DATA PIPELINES β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CODECOMMIT (Source) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β /data-pipelines/ β β β
β β β βββ /glue-jobs/ (PySpark scripts) β β β
β β β βββ /lambda-functions/ (Python handlers) β β β
β β β βββ /step-functions/ (State machine definitions) β β β
β β β βββ /infrastructure/ (CloudFormation/Terraform) β β β
β β β βββ /tests/ (Unit + Integration tests) β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CODEPIPELINE (Orchestration) β β
β β β β
β β Source β Build β Test β Deploy-Staging β Approve β Deploy-Prod β β
β β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ βββββββββββββ β
β β β Source βββ Build βββ Test βββ Staging βββ Prod ββ β
β β β (S3/ β β (Code- β β (pytest/β β (manual β β (auto- ββ β
β β β Commit) β β Build) β β Glue) β β gate) β β matic) ββ β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ βββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DEPLOYMENT TARGETS β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β Glue Jobs β β Lambda β β Step β β β
β β β β β Functions β β Functions β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Glue Job Versioning
import boto3
glue = boto3.client('glue')
# Create job with version control
response = glue.create_job(
Name='process-sales-data',
Role='arn:aws:iam::123456789012:role/GlueRole',
Command={
'Name': 'glueetl',
'ScriptLocation': 's3://glue-scripts/process-sales-data.py',
'PythonVersion': '3'
},
DefaultArguments={
'--job-language': 'python',
'--TempDir': 's3://glue-temp/',
'--enable-metrics': 'true',
'--enable-continuous-cloudwatch-log': 'true',
'--enable-job-insights': 'true'
},
GlueVersion='4.0',
NumberOfWorkers=10,
WorkerType='G.1X',
Tags={'Version': 'v1.0', 'Environment': 'production'}
)
# Update job version
glue.update_job(
JobName='process-sales-data',
JobUpdate={
'Command': {
'ScriptLocation': 's3://glue-scripts/process-sales-data-v2.py'
},
'Tags': {'Version': 'v2.0'}
}
)
Testing Patterns
import pytest
from unittest.mock import Mock, patch
class TestGlueJob:
def test_data_transformation(self):
"""Test data transformation logic."""
from process_sales import transform_data
input_data = [
{'sale_id': 1, 'amount': 100, 'status': 'completed'},
{'sale_id': 2, 'amount': -50, 'status': 'refunded'},
{'sale_id': 3, 'amount': None, 'status': 'pending'}
]
result = transform_data(input_data)
assert len(result) == 2 # Invalid records filtered
assert all(r['amount'] > 0 for r in result)
def test_schema_validation(self):
"""Test output schema matches expectations."""
from process_sales import get_output_schema
schema = get_output_schema()
expected_columns = ['sale_id', 'amount', 'status', 'processed_date']
assert all(col in schema for col in expected_columns)
@patch('boto3.client')
def test_s3_write(self, mock_s3):
"""Test S3 write operations."""
from process_sales import write_to_s3
mock_client = Mock()
mock_s3.return_value = mock_client
write_to_s3({'key': 'value'}, 'bucket', 'path/')
mock_client.put_object.assert_called_once()
Interview Q&A
Q1: Why is CI/CD important for data pipelines?
Answer: CI/CD ensures consistent deployments, catches bugs early, enables version control, and allows safe rollbacks. It's essential for reliable data operations.
Q2: How do you test ETL jobs?
Answer: Unit tests for transformation logic, integration tests with sample data, schema validation, and data quality checks. Use pytest for Python-based Glue jobs.
Q3: What is the blue-green deployment pattern?
Answer: Deploy new version alongside old version, switch traffic gradually, and rollback if issues. Reduces deployment risk.
Summary
- Source: CodeCommit for version control
- Build: CodeBuild for compilation and testing
- Test: pytest for ETL job testing
- Deploy: CodePipeline for orchestration
- Versioning: Tag-based or script location versioning
- Secrets: Secrets Manager for secure credential storage