dbt SCD Type 2
SCD Type 2 Architecture
SCD Type 2 Pipeline
Formal Definitions
DfSCD Type 2
Slowly Changing Dimension Type 2 tracks all historical changes by creating new records when attributes change. Each version of a row has a unique surrogate key, a valid-from date, a valid-to date, and a current flag. When a source record changes, the previous version is expired (valid-to set to change date, current flag set to false) and a new version is inserted (valid-from set to change date, valid-to set to NULL, current flag set to true). This preserves complete history.
DfHash Diff
A hash diff is a deterministic hash computed from all tracked columns of a row. It detects changes between the current source state and the existing version. If hashdiff(current) != hashdiff(existing), a change has occurred. Formally: HD = HASH(col_1, col_2, ..., col_n) where each column is coalesced to handle nulls consistently.
DfEffective Period
The effective period defines when a version of a record is valid. It consists of valid_from (when this version became active) and valid_to (when this version was superseded or NULL if currently active). For any timestamp T, exactly one version satisfies: valid_from <= T AND (valid_to > T OR valid_to IS NULL).
Detailed Explanation
SCD Type 2 is the gold standard for dimension history in dimensional modeling. Unlike Type 1 (overwrite) or Type 2 with flag (add column), Type 2 creates separate rows for each version, enabling point-in-time analysis with complete accuracy.
dbt Snapshot Approach
dbt provides built-in SCD Type 2 support through the snapshot feature. Snapshots use the check or timestamp strategy to detect changes and automatically manage the dbt_valid_from, dbt_valid_to, dbt_scd_id, and dbt_updated_at columns.
dbt snapshots use two strategies: check (compares all columns or specified columns) and timestamp (monitors a source timestamp column). The check strategy is more reliable for detecting attribute changes, while the timestamp strategy is simpler but may miss changes if the source timestamp isn't updated.
For production SCD Type 2, prefer the check strategy with check_cols: all or explicitly listed columns. This ensures all attribute changes are captured, even if the source system doesn't update a timestamp column. Combine with invalidate_hard_deletes: true to handle deleted source records.
Code Examples
Basic Snapshot Configuration
# snapshots/customers.yml
snapshots:
- name: scd_customers
strategy: check
unique_key: customer_id
check_cols:
- customer_name
- email
- segment
- address
config:
target_schema: snapshots
tags: ['scd', 'dimensions']
invalidate_hard_deletes: true
Snapshot SQL File
-- snapshots/scd_customers.sql
-- dbt snapshot config: scd_customers
-- config(
-- target_schema='snapshots',
-- unique_key='customer_id',
-- strategy='check',
-- check_cols=['customer_name', 'email', 'segment', 'address'],
-- invalidate_hard_deletes=true
-- )
select * from 'erp.customers'
-- endsnapshot
Custom SCD Type 2 with Incremental Model
-- models/marts/dim_customers_scd2.sql
-- dbt config: materialized='incremental', unique_key='surrogate_key',
-- incremental_strategy='merge', tags=['dimensions', 'scd2']
with source_customers as (
select
customer_id,
customer_name,
email,
segment,
address,
updated_at as source_updated_at
from 'erp.customers'
),
-- if is_incremental():
current_versions as (
select
surrogate_key,
customer_id,
customer_name,
email,
segment,
address,
valid_from,
valid_to,
is_current
from 'this_model'
where is_current = true
),
changed_records as (
select
s.customer_id,
s.customer_name,
s.email,
s.segment,
s.address,
s.source_updated_at
from source_customers s
left join current_versions c
on s.customer_id = c.customer_id
where c.customer_id is null -- new records
or 'surrogate_key' != 'surrogate_key'
),
expired_records as (
update 'this_model'
set
valid_to = current_timestamp(),
is_current = false
where customer_id in (select customer_id from changed_records)
and is_current = true
returning *
),
new_versions as (
select
'surrogate_key' as surrogate_key,
customer_id,
customer_name,
email,
segment,
address,
source_updated_at as valid_from,
cast(null as timestamp) as valid_to,
true as is_current
from changed_records
),
combined as (
select * from expired_records
union all
select * from new_versions
)
select * from combined
-- else:
initial_versions as (
select
'surrogate_key' as surrogate_key,
customer_id,
customer_name,
email,
segment,
address,
source_updated_at as valid_from,
cast(null as timestamp) as valid_to,
true as is_current
from source_customers
)
select * from initial_versions
-- endif
SCD Type 2 with Hard Delete Detection
-- models/marts/dim_products_scd2.sql
-- dbt config: materialized='incremental', unique_key='surrogate_key',
-- incremental_strategy='merge'
with source_products as (
select
product_id,
product_name,
category,
price,
is_active
from 'erp.products'
),
-- if is_incremental():
active_versions as (
select *
from 'this_model'
where is_current = true
),
-- Detect inserts and updates
changed_products as (
select
s.product_id,
s.product_name,
s.category,
s.price,
s.is_active
from source_products s
left join active_versions a on s.product_id = a.product_id
where a.product_id is null
or 'surrogate_key' != 'surrogate_key'
),
-- Detect hard deletes
deleted_products as (
select a.product_id
from active_versions a
left join source_products s on a.product_id = s.product_id
where s.product_id is null
),
expired_updates as (
update 'this_model'
set
valid_to = current_timestamp(),
is_current = false
where product_id in (select product_id from changed_products)
and is_current = true
returning *
),
expired_deletes as (
update 'this_model'
set
valid_to = current_timestamp(),
is_current = false,
is_deleted = true
where product_id in (select product_id from deleted_products)
and is_current = true
returning *
),
new_records as (
select
'surrogate_key' as surrogate_key,
product_id,
product_name,
category,
price,
is_active,
current_timestamp() as valid_from,
cast(null as timestamp) as valid_to,
true as is_current,
false as is_deleted
from changed_products
),
combined as (
select * from expired_updates
union all
select * from expired_deletes
union all
select * from new_records
)
select * from combined
-- else:
initial_records as (
select
'surrogate_key' as surrogate_key,
product_id,
product_name,
category,
price,
is_active,
current_timestamp() as valid_from,
cast(null as timestamp) as valid_to,
true as is_current,
false as is_deleted
from source_products
)
select * from initial_records
-- endif
Querying SCD Type 2 Tables
-- Get current version of all records
select * from 'dim_customers_scd2'
where is_current = true
-- Point-in-time lookup
select *
from 'dim_customers_scd2'
where valid_from <= '2024-06-01'
and (valid_to > '2024-06-01' OR valid_to IS NULL)
-- Get all versions of a specific customer
select *
from 'dim_customers_scd2'
where customer_id = 'CUST-001'
order by valid_from
-- Find records that changed in a specific period
select *
from 'dim_customers_scd2'
where valid_from >= '2024-01-01'
and valid_from < '2024-02-01'
Comparison: SCD Strategies
| Strategy | History | Storage | Query Complexity | Use Case |
|---|---|---|---|---|
| Type 1 | None (overwrite) | Minimal | Simple | Current state only |
| Type 2 | Full (new rows) | High | Moderate | Audit and history |
| Type 3 | Limited (one prev) | Low | Moderate | Previous value only |
| Type 4 | Full (history table) | High | High | Separate history table |
| Type 6 | Hybrid (flag + history) | High | High | Current + history |
Performance Metrics
| Metric | Description | Target |
|---|---|---|
| Snapshot Duration | Time to detect changes | < 10 minutes |
| Record Growth Rate | New versions per run | Monitor for anomalies |
| Query Performance | Point-in-time lookup | < 2 seconds |
| Storage Efficiency | Row count vs unique keys | < 10x growth |
| Change Detection Accuracy | Missed changes | 0% |
Best Practices
- Use check strategy - More reliable than timestamp for detecting changes
- Track all important columns - Include all columns needed for analysis
- Handle hard deletes - Enable
invalidate_hard_deletesfor source deletions - Index surrogate_key - Ensure efficient joins with fact tables
- Partition by valid_from - Optimize historical queries
- Create current views - Materialized views for
is_current = true - Monitor row growth - Alert on unexpected version proliferation
- Document grain - Clearly state the SCD strategy in model descriptions
See Also
- Models, Seeds, and Snapshots — dbt snapshot fundamentals
- Incremental Models — Incremental materialization patterns
- Data Vault Modeling — Alternative history tracking approach
- dbt Best Practices — Documentation and testing patterns