CW

dbt Data Vault Modeling

Free Lesson

Advertisement

dbt Data Vault Modeling

Data Vault Architecture

Data Vault Pipeline

Formal Definitions

DfHub

A hub stores the unique business keys that represent core business entities (customers, products, orders). Each hub contains: a hash key (MD5/SHA256 of business key columns), the business key value(s), a load date timestamp, and a record source identifier. Hubs are inserted only when a new business key is encountered. Formally, a hub H = {HK, BK, LD, RS} where HK is the hash key, BK is the business key, LD is load date, and RS is record source.

DfLink

A link represents a many-to-many relationship between business keys from one or more hubs. Links contain: a hash key (from the combination of related hub hash keys), hash keys referencing each hub, a load date, and a record source. Links capture associations like customer-to-order or order-to-product. Formally, a link L = {HK, HK_1, HK_2, ..., HK_n, LD, RS}.

DfSatellite

A satellite stores descriptive attributes and context for a hub or link, tracked over time. Each satellite contains: a parent hash key (from the hub or link), a hash difference (detecting changes), all descriptive columns, a load date, and a record source. Satellites are inserted whenever attributes change. Formally, a satellite S = {PK, HD, Attr_1, ..., Attr_n, LD, RS}.

DfHash Key

A hash key is a deterministic, fixed-length identifier generated by hashing the business key columns using MD5 or SHA256. For hubs, HK = HASH(BK_1, BK_2, ..., BK_n). For links, HK = HASH(HK_hub_1, HK_hub_2, ..., HK_hub_n). Hash keys enable efficient joins, detect duplicates, and decouple the vault from surrogate key generation.

Detailed Explanation

Data Vault 2.0 is an agile, auditable data modeling methodology designed for enterprise data warehousing. Unlike normalized or dimensional models, Data Vault separates business keys (hubs), relationships (links), and descriptive attributes (satellites) into distinct entity types. This separation enables parallel loading, full auditability, and graceful handling of source system changes.

Why Data Vault with dbt

dbt's SQL-based transformation approach aligns perfectly with Data Vault's append-only philosophy. Hubs and links are inserted when new keys appear (INSERT-only), while satellites track all historical changes. dbt's incremental materialization, combined with hash-based deduplication, makes Data Vault implementation efficient and maintainable.

Data Vault 2.0 uses hash keys instead of sequence-based surrogate keys. This enables parallel loading across hubs, links, and satellites without coordination. Hash keys are deterministic: the same business key always produces the same hash, regardless of load order.

Use dbt's dbt_utils.generate_surrogate_key or custom macros for hash key generation. SHA256 is recommended over MD5 for collision resistance. Store hash keys as strings (VARCHAR) rather than binary to simplify debugging and downstream joins.

Code Examples

Hub Model

-- models/vault/hub_customer.sql
{{
    config(
        materialized='incremental',
        unique_key='hub_customer_hashkey',
        incremental_strategy='merge',
        tags=['vault', 'hub']
    )
}}

{%- set source_models = ['stg_crm_customers', 'stg_erp_customers'] -%}

with source_union as (
    {%- for source in source_models %}
    select
        customer_id as business_key,
        'source_{{ source }}' as record_source,
        _loaded_at as load_date
    from {{ ref(source) }}
    {%- if not loop.last %}
    union all
    {%- endif %}
    {%- endfor %}
),

deduplicated as (
    select
        {{ dbt_utils.generate_surrogate_key(['business_key']) }} as hub_customer_hashkey,
        business_key,
        min(load_date) as load_date,
        min(record_source) as record_source
    from source_union
    group by business_key
)

select
    hub_customer_hashkey,
    business_key,
    load_date,
    record_source
from deduplicated

{% if is_incremental() %}
where hub_customer_hashkey not in (
    select hub_customer_hashkey from {{ this }}
)
{% endif %}

Link Model

-- models/vault/link_order_customer.sql
{{
    config(
        materialized='incremental',
        unique_key='link_order_customer_hashkey',
        incremental_strategy='merge',
        tags=['vault', 'link']
    )
}}

with source_data as (
    select
        order_id as order_business_key,
        customer_id as customer_business_key,
        'stg_orders' as record_source,
        _loaded_at as load_date
    from {{ ref('stg_orders') }}
),

hash_keys as (
    select
        {{ dbt_utils.generate_surrogate_key(['order_business_key']) }} as hub_order_hashkey,
        {{ dbt_utils.generate_surrogate_key(['customer_business_key']) }} as hub_customer_hashkey,
        {{ dbt_utils.generate_surrogate_key(['order_business_key', 'customer_business_key']) }} as link_order_customer_hashkey,
        record_source,
        load_date
    from source_data
)

select distinct
    link_order_customer_hashkey,
    hub_order_hashkey,
    hub_customer_hashkey,
    load_date,
    record_source
from hash_keys

{% if is_incremental() %}
where link_order_customer_hashkey not in (
    select link_order_customer_hashkey from {{ this }}
)
{% endif %}

Satellite Model

-- models/vault/sat_order_details.sql
{{
    config(
        materialized='incremental',
        unique_key=['hub_order_hashkey', 'load_date'],
        incremental_strategy='merge',
        tags=['vault', 'satellite']
    )
}}

with source_data as (
    select
        order_id as business_key,
        status,
        amount,
        currency,
        order_date,
        _loaded_at as load_date,
        'stg_orders' as record_source
    from {{ ref('stg_orders') }}
),

hash_keys as (
    select
        {{ dbt_utils.generate_surrogate_key(['business_key']) }} as hub_order_hashkey,
        status,
        amount,
        currency,
        order_date,
        load_date,
        record_source,
        {{ dbt_utils.generate_surrogate_key(['status', 'amount', 'currency', 'order_date']) }} as hashdiff
    from source_data
),

{% if is_incremental() %}
previous_hashes as (
    select hub_order_hashkey, hashdiff
    from {{ this }}
),

new_records as (
    select
        h.hub_order_hashkey,
        h.status,
        h.amount,
        h.currency,
        h.order_date,
        h.load_date,
        h.record_source,
        h.hashdiff
    from hash_keys h
    left join previous_hashes p on h.hub_order_hashkey = p.hub_order_hashkey
    where p.hub_order_hashkey is null
       or h.hashdiff != p.hashdiff
)
select * from new_records
{% else %}
select * from hash_keys
{% endif %}

Hash Key Macro

-- macros/generate_hashkey.sql
{% macro generate_hashkey(columns, algorithm='sha256') %}
    {% if target.type == 'bigquery' %}
        {{ algorithm }}(concat({% for col in columns %}coalesce(cast({{ col }} as string), '^^')
        {%- if not loop.last %}, '||'{% endif %}{% endfor %}))
    {% elif target.type == 'snowflake' %}
        hash({% for col in columns %}coalesce(cast({{ col }} as varchar), '^^')
        {%- if not loop.last %}, '||'{% endif %}{% endfor %}, '{{ algorithm }}')
    {% else %}
        md5({% for col in columns %}coalesce(cast({{ col }} as varchar), '^^')
        {%- if not loop.last %}, '||'{% endif %}{% endfor %})
    {% endif %}
{% endmacro %}

Point-in-Time Table

-- models/vault/pit_order.sql
{{
    config(
        materialized='incremental',
        unique_key='hub_order_hashkey',
        tags=['vault', 'pit']
    )
}}

with dates as (
    select * from {{ ref('dim_date') }}
    where date_value >= '2024-01-01'
),

satellites as (
    select
        hub_order_hashkey,
        load_date,
        status,
        amount
    from {{ ref('sat_order_details') }}
),

pit_records as (
    select
        d.date_value as pit_date,
        s.hub_order_hashkey,
        s.status,
        s.amount,
        s.load_date as sat_load_date
    from dates d
    cross join (
        select distinct hub_order_hashkey from satellites
    ) s
    left join satellites s
        on s.hub_order_hashkey = s.hub_order_hashkey
        and s.load_date = (
            select max(load_date)
            from satellites
            where hub_order_hashkey = s.hub_order_hashkey
            and load_date <= d.date_value
        )
)

select * from pit_records

Comparison: Data Vault vs Dimensional

AspectData VaultDimensional
Design GoalAuditability, agilityQuery performance
StructureHubs, Links, SatellitesFacts, Dimensions
LoadingAppend-only (INSERT)Upsert (MERGE)
HistoryFull history in satellitesSCD Type 1 or 2
KeysHash keys (MD5/SHA256)Surrogate keys
PerformanceRequires views for queriesPre-joined star schema
ScalabilityExcellent parallel loadingGood with partitioning
ComplexityHigher modeling overheadLower modeling overhead

Performance Metrics

MetricDescriptionTarget
Hub Load TimeTime to insert new keys< 5 minutes
Satellite Change DetectionHashdiff comparison speed< 10 minutes
View MaterializationBusiness view creation< 15 minutes
Hash Collision RateDuplicate hash keys< 1 in 10^15
Audit CompletenessRecord source tracking100%

Best Practices

  1. Hash key consistency - Use the same algorithm and null handling across all models
  2. Record source tracking - Always capture the originating system
  3. Load date accuracy - Use source system timestamps, not dbt run timestamps
  4. Incremental strategy - Use merge for hubs/links, append for satellites
  5. Hashdiff optimization - Only include changed attributes in hashdiff
  6. Business vault - Apply business rules in a separate business vault layer
  7. PIT tables - Create point-in-time tables for efficient querying
  8. Documentation - Document business keys and relationships explicitly

See Also

Advertisement

Need Expert dbt Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement