β Data Quality on AWS
Master Glue DataBrew, validation patterns, and data quality frameworks.
Module: AWS Data Engineering β’ Topic 24 of 65 β’ Premium Content
Data Quality Framework
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA QUALITY FRAMEWORK β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β VALIDATION LAYERS β β
β β β β
β β 1. COMPLETENESS: Are all expected records present? β β
β β 2. ACCURACY: Do values match expected ranges? β β
β β 3. CONSISTENCY: Are values consistent across sources? β β
β β 4. TIMELINESS: Is data arriving within SLA? β β
β β 5. UNIQUENESS: Are there duplicate records? β β
β β 6. VALIDITY: Do values conform to formats/rules? β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β AWS SERVICES β β
β β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β β β Glue β β Lambda β β CloudWatch β β β
β β β DataBrew β β Validation β β Metrics β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
DataBrew Profile Job
import boto3
databrew = boto3.client('databrew')
# Create profile job
response = databrew.create_profile_job(
DatasetName='sales-data',
Name='sales-quality-profile',
RoleArn='arn:aws:iam::123456789012:role/DataBrewRole',
OutputLocation={
'Bucket': 'data-quality-reports',
'Key': 'profiles/sales/'
},
ConfigurationOptions={
'Locale': 'en-US',
'MaxSizeGb': 50
},
ValidationConfig={
'ValidationMode': 'PROFILE_JOB',
'RulesetArn': 'arn:aws:databrew:us-east-1:123456789012:ruleset/sales-rules'
},
Tags={'Team': 'data-engineering'}
)
# Start profile job
databrew.start_job_run(
Name='sales-quality-profile'
)
Validation Rules
import boto3
databrew = boto3.client('databrew')
# Create ruleset
databrew.create_ruleset(
Name='sales-validation-rules',
Description='Validation rules for sales data',
Rules=[
{
'Name': 'not-null-amount',
'Expression': '$amount IS NOT NULL',
'SubstitutionMap': {},
'Threshold': {
'Value': 99.0,
'Type': 'PERCENTAGE',
'Comparison': 'GREATER_THAN_OR_EQUAL'
}
},
{
'Name': 'positive-amount',
'Expression': '$amount > 0',
'SubstitutionMap': {},
'Threshold': {
'Value': 99.5,
'Type': 'PERCENTAGE',
'Comparison': 'GREATER_THAN_OR_EQUAL'
}
},
{
'Name': 'valid-date',
'Expression': '$sale_date BETWEEN \"2020-01-01\" AND \"2025-12-31\"',
'SubstitutionMap': {},
'Threshold': {
'Value': 100.0,
'Type': 'PERCENTAGE',
'Comparison': 'EQUAL'
}
}
]
)
Interview Q&A
Q1: What are the 6 dimensions of data quality?
Answer: Completeness, Accuracy, Consistency, Timeliness, Uniqueness, Validity. Each measures a different aspect of data trustworthiness.
Q2: How does DataBrew help with data quality?
Answer: DataBrew provides data profiling, built-in transformations, and validation rules. Profile jobs identify data quality issues automatically.
Q3: How do you implement data quality in Glue ETL?
Answer: Add validation steps in ETL jobs, use DataBrew rulesets, implement Lambda validation functions, and set CloudWatch alarms for quality metrics.
Summary
- Dimensions: Completeness, Accuracy, Consistency, Timeliness, Uniqueness, Validity
- DataBrew: Profiling, validation rules, built-in transformations
- Rulesets: Define validation expressions with pass/fail thresholds
- Monitoring: CloudWatch metrics for quality scores and alerts
- Best Practice: Validate at ingestion, transformation, and serving layers