Models, Seeds, and Snapshots
Component Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DBT DATA TRANSFORMATION COMPONENTS β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β SOURCES β β SEEDS β β SNAPSHOTS β β
β β β β β β β β
β β External β β CSV Files β β Historical β β
β β Data Tables β β Static Data β β Tracking β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STAGING MODELS β β
β β (1:1 mapping to sources) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β INTERMEDIATE MODELS β β
β β (Business logic transformations) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MART MODELS β β
β β (Analytical datasets) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Flow Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA FLOW THROUGH DBT β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β EXTERNAL STAGING INTERMEDIATE MART β
β SOURCES LAYER LAYER LAYER β
β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β Shopify βββββββΆβ stg_ ββββββββΆβ int_ βββββββββΆβ fct_ β β
β β Orders β β orders β β orders β β orders β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β Stripe βββββββΆβ stg_ ββββββββΆβ int_ βββββββββΆβ dim_ β β
β βPayments β β paymentsβ β paymentsβ β customersβ β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β CSV βββββββΆβ seed_ ββββββββΆβ int_ βββββββββΆβ fct_ β β
β β Productsβ β productsβ β productsβ β productsβ β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Snapshot Lifecycle
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SLOWLY CHANGING DIMENSION (SCD) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β TIME βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΆ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SNAPSHOT TABLE STATE β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β β β
β β Record ID Status Valid From Valid To Current? β β
β β βββββββββ ββββββ ββββββββββ ββββββββ ββββββββ β β
β β 1001 Active 2024-01-01 2024-03-15 No β β
β β 1001 Inactive 2024-03-15 9999-12-31 Yes β β
β β 1002 Active 2024-02-01 9999-12-31 Yes β β
β β 1003 Pending 2024-03-01 2024-03-10 No β β
β β 1003 Active 2024-03-10 9999-12-31 Yes β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Detailed Explanation
dbt provides several core components for data transformation: models, seeds, snapshots, and sources. Understanding when and how to use each component is crucial for building robust data pipelines.
Models
Models are the primary building blocks of dbt projects. They are SQL files (or Python scripts) that define transformations on your data. Models can be materialized as tables, views, incremental models, or ephemeral CTEs.
Key Characteristics:
- Models are defined in SQL or Python
- They support Jinja templating for dynamic SQL generation
- Models are organized into a DAG (Directed Acyclic Graph) based on dependencies
- Each model can have different materialization strategies
- Models support testing and documentation
Seeds
Seeds are CSV files that dbt can load into your data warehouse. They are useful for:
- Static lookup tables (e.g., country codes, product categories)
- Configuration data that needs to be queryable
- Small datasets that complement your main data sources
Key Characteristics:
- CSV files stored in the
seeds/directory - Loaded as tables in your warehouse
- Support dbt tests (uniqueness, not_null, relationships)
- Can be referenced in models using
{{ ref('seed_name') }}
Snapshots
Snapshots capture the historical state of data over time, implementing Slowly Changing Dimension (SCD) Type 2 logic. They track changes in source data and maintain a complete history of all changes.
Key Characteristics:
- Track changes in source data over time
- Maintain valid_from and valid_to timestamps
- Support current row identification
- Use either timestamp-based or check-based change detection
- Stored in the
snapshots/directory
Sources
Sources are declarations that tell dbt about your existing tables in the warehouse. They provide metadata about external data and enable source freshness monitoring.
Key Characteristics:
- Declared in YAML files under
sources: - Enable source freshness checks
- Provide documentation for upstream tables
- Support testing at the source level
- Track data lineage from source to mart
Code Examples
Model Materialization Patterns
-- models/staging/stg_orders.sql
{{
config(
materialized='view',
schema='staging'
)
}}
with source as (
select * from {{ source('raw', 'orders') }}
),
renamed as (
select
id as order_id,
customer_id,
status,
created_at as order_date,
updated_at as modified_date,
_fivetran_synced as synced_at
from source
)
select * from renamed
Incremental Model with Merge Strategy
-- models/marts/fct_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
partition_by={
"field": "order_date",
"data_type": "timestamp"
},
cluster_by=['customer_id', 'order_status']
)
}}
with orders as (
select * from {{ ref('stg_orders') }}
),
customers as (
select * from {{ ref('dim_customers') }}
),
final as (
select
orders.order_id,
orders.customer_id,
customers.customer_name,
orders.order_date,
orders.status as order_status,
orders.amount,
current_timestamp() as dbt_updated_at
from orders
left join customers on orders.customer_id = customers.customer_id
)
select * from final
{% if is_incremental() %}
where dbt_updated_at > (select max(dbt_updated_at) from {{ this }})
{% endif %}
Snapshot Definition
-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at',
invalidate_hard_deletes=True
)
}}
select * from {{ source('raw', 'customers') }}
{% endsnapshot %}
Seed Configuration
# seeds/country_codes.yml
version: 2
seeds:
- name: country_codes
description: "ISO country codes for geo analysis"
config:
column_types:
country_code: varchar(2)
country_name: varchar(100)
continent: varchar(50)
region: varchar(50)
columns:
- name: country_code
description: "ISO 3166-1 alpha-2 country code"
data_tests:
- unique
- not_null
- name: country_name
description: "Full country name"
data_tests:
- not_null
Source Freshness Configuration
# models/staging/_sources.yml
version: 2
sources:
- name: shopify
database: raw
schema: shopify
loader: fivetran
loaded_at_field: _fivetran_synced
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 24, period: hour}
tables:
- name: orders
description: "Shopify orders"
data_tests:
- unique:
column_name: id
- not_null:
column_name: id
- name: customers
description: "Shopify customers"
columns:
- name: id
data_tests:
- unique
- not_null
- name: email
data_tests:
- not_null
Performance Metrics
| Component | Load Time | Storage | Query Performance | Use Case |
|---|---|---|---|---|
| Table | Medium | High | Excellent | Full refresh |
| View | Fast | Low | Good | Simple transforms |
| Incremental | Variable | Medium | Excellent | Large datasets |
| Ephemeral | N/A | None | Varies | Reusable CTEs |
| Seed | Fast | Low | Good | Static data |
| Snapshot | Medium | High | Good | Historical tracking |
Best Practices
- Use staging models for 1:1 mappings with source tables
- Keep models small and focused on single responsibilities
- Use ephemeral models for intermediate transformations that don't need persistence
- Implement incremental models for large fact tables
- Use snapshots for SCD Type 2 requirements
- Load seeds for static lookup data
- Always add tests to models and sources
- Document everything with descriptions in YAML files