dbt Data Vault Modeling
Data Vault Architecture
Data Vault Pipeline
Formal Definitions
DfHub
A hub stores the unique business keys that represent core business entities (customers, products, orders). Each hub contains: a hash key (MD5/SHA256 of business key columns), the business key value(s), a load date timestamp, and a record source identifier. Hubs are inserted only when a new business key is encountered. Formally, a hub H = {HK, BK, LD, RS} where HK is the hash key, BK is the business key, LD is load date, and RS is record source.
DfLink
A link represents a many-to-many relationship between business keys from one or more hubs. Links contain: a hash key (from the combination of related hub hash keys), hash keys referencing each hub, a load date, and a record source. Links capture associations like customer-to-order or order-to-product. Formally, a link L = {HK, HK_1, HK_2, ..., HK_n, LD, RS}.
DfSatellite
A satellite stores descriptive attributes and context for a hub or link, tracked over time. Each satellite contains: a parent hash key (from the hub or link), a hash difference (detecting changes), all descriptive columns, a load date, and a record source. Satellites are inserted whenever attributes change. Formally, a satellite S = {PK, HD, Attr_1, ..., Attr_n, LD, RS}.
DfHash Key
A hash key is a deterministic, fixed-length identifier generated by hashing the business key columns using MD5 or SHA256. For hubs, HK = HASH(BK_1, BK_2, ..., BK_n). For links, HK = HASH(HK_hub_1, HK_hub_2, ..., HK_hub_n). Hash keys enable efficient joins, detect duplicates, and decouple the vault from surrogate key generation.
Detailed Explanation
Data Vault 2.0 is an agile, auditable data modeling methodology designed for enterprise data warehousing. Unlike normalized or dimensional models, Data Vault separates business keys (hubs), relationships (links), and descriptive attributes (satellites) into distinct entity types. This separation enables parallel loading, full auditability, and graceful handling of source system changes.
Why Data Vault with dbt
dbt's SQL-based transformation approach aligns perfectly with Data Vault's append-only philosophy. Hubs and links are inserted when new keys appear (INSERT-only), while satellites track all historical changes. dbt's incremental materialization, combined with hash-based deduplication, makes Data Vault implementation efficient and maintainable.
Data Vault 2.0 uses hash keys instead of sequence-based surrogate keys. This enables parallel loading across hubs, links, and satellites without coordination. Hash keys are deterministic: the same business key always produces the same hash, regardless of load order.
Use dbt's dbt_utils.generate_surrogate_key or custom macros for hash key generation. SHA256 is recommended over MD5 for collision resistance. Store hash keys as strings (VARCHAR) rather than binary to simplify debugging and downstream joins.
Code Examples
Hub Model
-- models/vault/hub_customer.sql
{{
config(
materialized='incremental',
unique_key='hub_customer_hashkey',
incremental_strategy='merge',
tags=['vault', 'hub']
)
}}
{%- set source_models = ['stg_crm_customers', 'stg_erp_customers'] -%}
with source_union as (
{%- for source in source_models %}
select
customer_id as business_key,
'source_{{ source }}' as record_source,
_loaded_at as load_date
from {{ ref(source) }}
{%- if not loop.last %}
union all
{%- endif %}
{%- endfor %}
),
deduplicated as (
select
{{ dbt_utils.generate_surrogate_key(['business_key']) }} as hub_customer_hashkey,
business_key,
min(load_date) as load_date,
min(record_source) as record_source
from source_union
group by business_key
)
select
hub_customer_hashkey,
business_key,
load_date,
record_source
from deduplicated
{% if is_incremental() %}
where hub_customer_hashkey not in (
select hub_customer_hashkey from {{ this }}
)
{% endif %}
Link Model
-- models/vault/link_order_customer.sql
{{
config(
materialized='incremental',
unique_key='link_order_customer_hashkey',
incremental_strategy='merge',
tags=['vault', 'link']
)
}}
with source_data as (
select
order_id as order_business_key,
customer_id as customer_business_key,
'stg_orders' as record_source,
_loaded_at as load_date
from {{ ref('stg_orders') }}
),
hash_keys as (
select
{{ dbt_utils.generate_surrogate_key(['order_business_key']) }} as hub_order_hashkey,
{{ dbt_utils.generate_surrogate_key(['customer_business_key']) }} as hub_customer_hashkey,
{{ dbt_utils.generate_surrogate_key(['order_business_key', 'customer_business_key']) }} as link_order_customer_hashkey,
record_source,
load_date
from source_data
)
select distinct
link_order_customer_hashkey,
hub_order_hashkey,
hub_customer_hashkey,
load_date,
record_source
from hash_keys
{% if is_incremental() %}
where link_order_customer_hashkey not in (
select link_order_customer_hashkey from {{ this }}
)
{% endif %}
Satellite Model
-- models/vault/sat_order_details.sql
{{
config(
materialized='incremental',
unique_key=['hub_order_hashkey', 'load_date'],
incremental_strategy='merge',
tags=['vault', 'satellite']
)
}}
with source_data as (
select
order_id as business_key,
status,
amount,
currency,
order_date,
_loaded_at as load_date,
'stg_orders' as record_source
from {{ ref('stg_orders') }}
),
hash_keys as (
select
{{ dbt_utils.generate_surrogate_key(['business_key']) }} as hub_order_hashkey,
status,
amount,
currency,
order_date,
load_date,
record_source,
{{ dbt_utils.generate_surrogate_key(['status', 'amount', 'currency', 'order_date']) }} as hashdiff
from source_data
),
{% if is_incremental() %}
previous_hashes as (
select hub_order_hashkey, hashdiff
from {{ this }}
),
new_records as (
select
h.hub_order_hashkey,
h.status,
h.amount,
h.currency,
h.order_date,
h.load_date,
h.record_source,
h.hashdiff
from hash_keys h
left join previous_hashes p on h.hub_order_hashkey = p.hub_order_hashkey
where p.hub_order_hashkey is null
or h.hashdiff != p.hashdiff
)
select * from new_records
{% else %}
select * from hash_keys
{% endif %}
Hash Key Macro
-- macros/generate_hashkey.sql
{% macro generate_hashkey(columns, algorithm='sha256') %}
{% if target.type == 'bigquery' %}
{{ algorithm }}(concat({% for col in columns %}coalesce(cast({{ col }} as string), '^^')
{%- if not loop.last %}, '||'{% endif %}{% endfor %}))
{% elif target.type == 'snowflake' %}
hash({% for col in columns %}coalesce(cast({{ col }} as varchar), '^^')
{%- if not loop.last %}, '||'{% endif %}{% endfor %}, '{{ algorithm }}')
{% else %}
md5({% for col in columns %}coalesce(cast({{ col }} as varchar), '^^')
{%- if not loop.last %}, '||'{% endif %}{% endfor %})
{% endif %}
{% endmacro %}
Point-in-Time Table
-- models/vault/pit_order.sql
{{
config(
materialized='incremental',
unique_key='hub_order_hashkey',
tags=['vault', 'pit']
)
}}
with dates as (
select * from {{ ref('dim_date') }}
where date_value >= '2024-01-01'
),
satellites as (
select
hub_order_hashkey,
load_date,
status,
amount
from {{ ref('sat_order_details') }}
),
pit_records as (
select
d.date_value as pit_date,
s.hub_order_hashkey,
s.status,
s.amount,
s.load_date as sat_load_date
from dates d
cross join (
select distinct hub_order_hashkey from satellites
) s
left join satellites s
on s.hub_order_hashkey = s.hub_order_hashkey
and s.load_date = (
select max(load_date)
from satellites
where hub_order_hashkey = s.hub_order_hashkey
and load_date <= d.date_value
)
)
select * from pit_records
Comparison: Data Vault vs Dimensional
| Aspect | Data Vault | Dimensional |
|---|---|---|
| Design Goal | Auditability, agility | Query performance |
| Structure | Hubs, Links, Satellites | Facts, Dimensions |
| Loading | Append-only (INSERT) | Upsert (MERGE) |
| History | Full history in satellites | SCD Type 1 or 2 |
| Keys | Hash keys (MD5/SHA256) | Surrogate keys |
| Performance | Requires views for queries | Pre-joined star schema |
| Scalability | Excellent parallel loading | Good with partitioning |
| Complexity | Higher modeling overhead | Lower modeling overhead |
Performance Metrics
| Metric | Description | Target |
|---|---|---|
| Hub Load Time | Time to insert new keys | < 5 minutes |
| Satellite Change Detection | Hashdiff comparison speed | < 10 minutes |
| View Materialization | Business view creation | < 15 minutes |
| Hash Collision Rate | Duplicate hash keys | < 1 in 10^15 |
| Audit Completeness | Record source tracking | 100% |
Best Practices
- Hash key consistency - Use the same algorithm and null handling across all models
- Record source tracking - Always capture the originating system
- Load date accuracy - Use source system timestamps, not dbt run timestamps
- Incremental strategy - Use merge for hubs/links, append for satellites
- Hashdiff optimization - Only include changed attributes in hashdiff
- Business vault - Apply business rules in a separate business vault layer
- PIT tables - Create point-in-time tables for efficient querying
- Documentation - Document business keys and relationships explicitly
See Also
- Incremental Models — Incremental materialization strategies
- Materializations — Table, view, and ephemeral strategies
- dbt Best Practices — Project structure and conventions
- Slowly Changing Dimensions — SCD patterns with dbt