Data Quality Testing
Testing Architecture
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA QUALITY TESTING ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β QUALITY DIMENSIONS β β
β β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β β
β β β ACCURACY β β COMPLETENESSβ β CONSISTENCY β β β
β β β β β β β β β β
β β β β’ Valid data β β β’ No missing β β β’ Same data everywhere β β β
β β β β’ Correct β β β’ Required β β β’ Aligned values β β β
β β β values β β fields β β β’ Cross-system sync β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β β
β β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β β
β β β TIMELINESS β β UNIQUENESS β β VALIDITY β β β
β β β β β β β β β β
β β β β’ Fresh data β β β’ No dups β β β’ Format correct β β β
β β β β’ SLA met β β β’ Distinct β β β’ Range valid β β β
β β β β’ Update β β keys β β β’ Rules met β β β
β β β frequency β β β β β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β TEST EXECUTION β β
β β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β β RUN βββββΆβ COLLECT βββββΆβ ANALYZE βββββΆβ REPORT β β β
β β β TESTS β β RESULTS β β FAILS β β OUTPUT β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Test Categories
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TEST CATEGORIES AND COVERAGE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STRUCTURAL TESTS β β
β β β β
β β β’ Schema validation β’ Column existence β’ Data type checks β β
β β β’ Table existence β’ View definitions β’ Index verification β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CONTENT TESTS β β
β β β β
β β β’ Uniqueness checks β’ Not-null validation β’ Accepted values β β
β β β’ Range validation β’ Pattern matching β’ Statistical tests β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β REFERENTIAL TESTS β β
β β β β
β β β’ Foreign key validity β’ Orphan detection β’ Relationship tests β β
β β β’ Cross-table sync β’ ID mapping β’ Link validation β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β FRESHNESS TESTS β β
β β β β
β β β’ Source freshness β’ Update frequency β’ SLA monitoring β β
β β β’ Latency checks β’ Delay detection β’ Alert thresholds β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Test Results Dashboard
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TEST RESULTS SUMMARY β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β MODEL: fct_orders β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STRUCTURAL TESTS β β
β β β Table exists β β
β β β All columns present β β
β β β Data types correct β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CONTENT TESTS β β
β β β unique_order_id (0 duplicates) β β
β β β not_null_order_id (0 nulls) β β
β β β not_null_customer_id (0 nulls) β β
β β β valid_status (all values valid) β β
β β β positive_amount (142 negative values) FAILED β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β REFERENTIAL TESTS β β
β β β fk_customer_id (all valid) β β
β β β fk_product_id (all valid) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β FRESHNESS TESTS β β
β β β source_freshness (last update: 2 hours ago) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SUMMARY: 12 passed, 1 failed, 0 errors β
β COVERAGE: 92% β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Detailed Explanation
Data quality testing in dbt ensures that your transformed data meets business requirements and maintains integrity across the pipeline.
Quality Dimensions
- Accuracy: Data correctly represents real-world entities
- Completeness: All required data is present
- Consistency: Data is uniform across systems
- Timeliness: Data is available when needed
- Uniqueness: No duplicate records exist
- Validity: Data conforms to defined formats and rules
Test Types
Structural Tests
Validate the structure of your data:
- Table existence
- Column presence
- Data type correctness
Content Tests
Validate the content of your data:
- Uniqueness constraints
- Not-null requirements
- Accepted value ranges
- Pattern matching
Referential Tests
Validate relationships between tables:
- Foreign key validity
- Orphan detection
- Cross-table consistency
Freshness Tests
Validate data timeliness:
- Source freshness checks
- Update frequency monitoring
- SLA compliance
Test Execution
dbt tests work by:
- Compiling test SQL queries
- Executing against the target warehouse
- Checking for failing rows
- Reporting pass/fail results
Custom Test Patterns
Custom tests enable domain-specific validations:
- Business rule assertions
- Statistical validations
- Cross-system comparisons
- Aggregation checks
Code Examples
Schema Test Definitions
# models/staging/_sources.yml
version: 2
sources:
- name: raw
database: raw_data
schema: public
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 24, period: hour}
tables:
- name: orders
columns:
- name: id
data_tests:
- unique
- not_null
- name: customer_id
data_tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
- name: status
data_tests:
- accepted_values:
values: ['pending', 'shipped', 'delivered', 'cancelled']
- name: amount
data_tests:
- not_null
Custom Data Tests
-- tests/test_order_amount_positive.sql
-- Business rule: Order amounts must be positive
select
order_id,
amount,
customer_id,
order_date
from {{ ref('fct_orders') }}
where amount < 0
Advanced Custom Test
-- tests/test_monthly_revenue_anomaly.sql
-- Statistical test for revenue anomalies
with monthly_revenue as (
select
date_trunc('month', order_date) as month,
sum(amount) as revenue
from {{ ref('fct_orders') }}
group by 1
),
revenue_stats as (
select
avg(revenue) as avg_revenue,
stddev(revenue) as stddev_revenue
from monthly_revenue
),
anomalies as (
select
mr.month,
mr.revenue,
rs.avg_revenue,
rs.stddev_revenue,
(mr.revenue - rs.avg_revenue) / nullif(rs.stddev_revenue, 0) as z_score
from monthly_revenue mr
cross join revenue_stats rs
)
select
month,
revenue,
avg_revenue,
z_score
from anomalies
where abs(z_score) > 3 -- Flag anomalies beyond 3 standard deviations
Generic Test Macro
-- macros/schema_tests/test_accepted_range.sql
{% test accepted_range(model, column_name, min_value, max_value) %}
with validation as (
select
{{ column_name }} as field_name,
case
when {{ column_name }} < {{ min_value }} then 'below_min'
when {{ column_name }} > {{ max_value }} then 'above_max'
else 'valid'
end as validation_status
from {{ model }}
where {{ column_name }} is not null
)
select *
from validation
where validation_status != 'valid'
{% endtest %}
Test Configuration
# dbt_project.yml
name: 'my_project'
version: '1.0.0'
tests:
my_project:
+severity: error
+store_failures: true
+schema: test_results
staging:
+severity: warn
marts:
+severity: error
+fail_calc: "count(*)"
+warn_if: ">10"
+error_if: ">100"
on-run-end:
- "{{ log('Test execution completed', info=True) }}"
Freshness Configuration
# models/staging/_sources.yml
version: 2
sources:
- name: shopify
database: raw
schema: shopify
loader: fivetran
loaded_at_field: _fivetran_synced
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 24, period: hour}
tables:
- name: orders
description: "Shopify orders"
loaded_at_field: _fivetran_synced
freshness:
warn_after: {count: 1, period: hour}
error_after: {count: 4, period: hour}
Performance Metrics
| Test Type | Execution Time | Resource Usage | Coverage |
|---|---|---|---|
| unique | Low | Low | High |
| not_null | Low | Low | High |
| accepted_values | Medium | Low | Medium |
| relationships | Medium | Medium | High |
| freshness | Low | Low | High |
| Custom SQL | Variable | Variable | Variable |
Best Practices
- Test all critical columns - uniqueness, not_null, relationships
- Use relationships tests - validate foreign key integrity
- Test business rules - custom assertions for domain logic
- Store test failures - for debugging and analysis
- Use severity levels - warn vs error based on impact
- Test freshness - ensure data is current
- Use packages - dbt_expectations for advanced tests
- Document tests - explain what each test validates
- Run tests in CI/CD - catch issues before production
- Monitor test results - track quality metrics over time