CW

dbt SCD Type 2

Free Lesson

Advertisement

dbt SCD Type 2

SCD Type 2 Architecture

SCD Type 2 Pipeline

Formal Definitions

DfSCD Type 2

Slowly Changing Dimension Type 2 tracks all historical changes by creating new records when attributes change. Each version of a row has a unique surrogate key, a valid-from date, a valid-to date, and a current flag. When a source record changes, the previous version is expired (valid-to set to change date, current flag set to false) and a new version is inserted (valid-from set to change date, valid-to set to NULL, current flag set to true). This preserves complete history.

DfHash Diff

A hash diff is a deterministic hash computed from all tracked columns of a row. It detects changes between the current source state and the existing version. If hashdiff(current) != hashdiff(existing), a change has occurred. Formally: HD = HASH(col_1, col_2, ..., col_n) where each column is coalesced to handle nulls consistently.

DfEffective Period

The effective period defines when a version of a record is valid. It consists of valid_from (when this version became active) and valid_to (when this version was superseded or NULL if currently active). For any timestamp T, exactly one version satisfies: valid_from <= T AND (valid_to > T OR valid_to IS NULL).

Detailed Explanation

SCD Type 2 is the gold standard for dimension history in dimensional modeling. Unlike Type 1 (overwrite) or Type 2 with flag (add column), Type 2 creates separate rows for each version, enabling point-in-time analysis with complete accuracy.

dbt Snapshot Approach

dbt provides built-in SCD Type 2 support through the snapshot feature. Snapshots use the check or timestamp strategy to detect changes and automatically manage the dbt_valid_from, dbt_valid_to, dbt_scd_id, and dbt_updated_at columns.

dbt snapshots use two strategies: check (compares all columns or specified columns) and timestamp (monitors a source timestamp column). The check strategy is more reliable for detecting attribute changes, while the timestamp strategy is simpler but may miss changes if the source timestamp isn't updated.

For production SCD Type 2, prefer the check strategy with check_cols: all or explicitly listed columns. This ensures all attribute changes are captured, even if the source system doesn't update a timestamp column. Combine with invalidate_hard_deletes: true to handle deleted source records.

Code Examples

Basic Snapshot Configuration

# snapshots/customers.yml
snapshots:
  - name: scd_customers
    strategy: check
    unique_key: customer_id
    check_cols:
      - customer_name
      - email
      - segment
      - address
    config:
      target_schema: snapshots
      tags: ['scd', 'dimensions']
      invalidate_hard_deletes: true

Snapshot SQL File

-- snapshots/scd_customers.sql
-- dbt snapshot config: scd_customers

-- config(
--     target_schema='snapshots',
--     unique_key='customer_id',
--     strategy='check',
--     check_cols=['customer_name', 'email', 'segment', 'address'],
--     invalidate_hard_deletes=true
-- )

select * from 'erp.customers'

-- endsnapshot

Custom SCD Type 2 with Incremental Model

-- models/marts/dim_customers_scd2.sql
-- dbt config: materialized='incremental', unique_key='surrogate_key',
-- incremental_strategy='merge', tags=['dimensions', 'scd2']

with source_customers as (
    select
        customer_id,
        customer_name,
        email,
        segment,
        address,
        updated_at as source_updated_at
    from 'erp.customers'
),

-- if is_incremental():
current_versions as (
    select
        surrogate_key,
        customer_id,
        customer_name,
        email,
        segment,
        address,
        valid_from,
        valid_to,
        is_current
    from 'this_model'
    where is_current = true
),

changed_records as (
    select
        s.customer_id,
        s.customer_name,
        s.email,
        s.segment,
        s.address,
        s.source_updated_at
    from source_customers s
    left join current_versions c
        on s.customer_id = c.customer_id
    where c.customer_id is null  -- new records
       or 'surrogate_key' != 'surrogate_key'
),

expired_records as (
    update 'this_model'
    set
        valid_to = current_timestamp(),
        is_current = false
    where customer_id in (select customer_id from changed_records)
      and is_current = true
    returning *
),

new_versions as (
    select
        'surrogate_key' as surrogate_key,
        customer_id,
        customer_name,
        email,
        segment,
        address,
        source_updated_at as valid_from,
        cast(null as timestamp) as valid_to,
        true as is_current
    from changed_records
),

combined as (
    select * from expired_records
    union all
    select * from new_versions
)
select * from combined

-- else:
initial_versions as (
    select
        'surrogate_key' as surrogate_key,
        customer_id,
        customer_name,
        email,
        segment,
        address,
        source_updated_at as valid_from,
        cast(null as timestamp) as valid_to,
        true as is_current
    from source_customers
)
select * from initial_versions
-- endif

SCD Type 2 with Hard Delete Detection

-- models/marts/dim_products_scd2.sql
-- dbt config: materialized='incremental', unique_key='surrogate_key',
-- incremental_strategy='merge'

with source_products as (
    select
        product_id,
        product_name,
        category,
        price,
        is_active
    from 'erp.products'
),

-- if is_incremental():
active_versions as (
    select *
    from 'this_model'
    where is_current = true
),

-- Detect inserts and updates
changed_products as (
    select
        s.product_id,
        s.product_name,
        s.category,
        s.price,
        s.is_active
    from source_products s
    left join active_versions a on s.product_id = a.product_id
    where a.product_id is null
       or 'surrogate_key' != 'surrogate_key'
),

-- Detect hard deletes
deleted_products as (
    select a.product_id
    from active_versions a
    left join source_products s on a.product_id = s.product_id
    where s.product_id is null
),

expired_updates as (
    update 'this_model'
    set
        valid_to = current_timestamp(),
        is_current = false
    where product_id in (select product_id from changed_products)
      and is_current = true
    returning *
),

expired_deletes as (
    update 'this_model'
    set
        valid_to = current_timestamp(),
        is_current = false,
        is_deleted = true
    where product_id in (select product_id from deleted_products)
      and is_current = true
    returning *
),

new_records as (
    select
        'surrogate_key' as surrogate_key,
        product_id,
        product_name,
        category,
        price,
        is_active,
        current_timestamp() as valid_from,
        cast(null as timestamp) as valid_to,
        true as is_current,
        false as is_deleted
    from changed_products
),

combined as (
    select * from expired_updates
    union all
    select * from expired_deletes
    union all
    select * from new_records
)
select * from combined

-- else:
initial_records as (
    select
        'surrogate_key' as surrogate_key,
        product_id,
        product_name,
        category,
        price,
        is_active,
        current_timestamp() as valid_from,
        cast(null as timestamp) as valid_to,
        true as is_current,
        false as is_deleted
    from source_products
)
select * from initial_records
-- endif

Querying SCD Type 2 Tables

-- Get current version of all records
select * from 'dim_customers_scd2'
where is_current = true

-- Point-in-time lookup
select *
from 'dim_customers_scd2'
where valid_from <= '2024-06-01'
  and (valid_to > '2024-06-01' OR valid_to IS NULL)

-- Get all versions of a specific customer
select *
from 'dim_customers_scd2'
where customer_id = 'CUST-001'
order by valid_from

-- Find records that changed in a specific period
select *
from 'dim_customers_scd2'
where valid_from >= '2024-01-01'
  and valid_from < '2024-02-01'

Comparison: SCD Strategies

StrategyHistoryStorageQuery ComplexityUse Case
Type 1None (overwrite)MinimalSimpleCurrent state only
Type 2Full (new rows)HighModerateAudit and history
Type 3Limited (one prev)LowModeratePrevious value only
Type 4Full (history table)HighHighSeparate history table
Type 6Hybrid (flag + history)HighHighCurrent + history

Performance Metrics

MetricDescriptionTarget
Snapshot DurationTime to detect changes< 10 minutes
Record Growth RateNew versions per runMonitor for anomalies
Query PerformancePoint-in-time lookup< 2 seconds
Storage EfficiencyRow count vs unique keys< 10x growth
Change Detection AccuracyMissed changes0%

Best Practices

  1. Use check strategy - More reliable than timestamp for detecting changes
  2. Track all important columns - Include all columns needed for analysis
  3. Handle hard deletes - Enable invalidate_hard_deletes for source deletions
  4. Index surrogate_key - Ensure efficient joins with fact tables
  5. Partition by valid_from - Optimize historical queries
  6. Create current views - Materialized views for is_current = true
  7. Monitor row growth - Alert on unexpected version proliferation
  8. Document grain - Clearly state the SCD strategy in model descriptions

See Also

Advertisement

Need Expert dbt Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement