dbt Documentation and Lineage
Documentation Architecture
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DBT DOCUMENTATION ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β SOURCES β β MODELS β β EXPOSURES β β
β β β β β β β β
β β _sources.ymlβ β model.yml β β exposure.ymlβ β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β METADATA AGGREGATION β β
β β β β
β β β’ Column descriptions β’ Data types β β
β β β’ Model descriptions β’ Test results β β
β β β’ Source freshness β’ Lineage information β β
β β β’ Custom metadata β’ Tags and labels β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DOCUMENTATION SITE β β
β β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β β Browse β β Lineage β β Schema β β Source β β Search β β β
β β β Models β β Graph β β Details β β Fresh. β β Content β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Lineage Graph Structure
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA LINEAGE VISUALIZATION β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββ β
β β RAW DATA β β
β β SOURCES β β
β ββββββββββ¬βββββββββ β
β β β
β ββββββββββββββββββββββΌβββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β stg_orders β β stg_customersβ β stg_payments β β
β β β β β β β β
β β Model: view β β Model: view β β Model: view β β
β β Tests: 3 β β Tests: 2 β β Tests: 2 β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β int_orders β β int_customersβ β int_payments β β
β β β β β β β β
β β Model: CTE β β Model: CTE β β Model: CTE β β
β β Tests: 1 β β Tests: 1 β β Tests: 1 β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β β
β ββββββββββββββββββββββΌβββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββ β
β β fct_orders β β
β β β β
β β Model: incr β β
β β Tests: 5 β β
β ββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Documentation Generation Flow
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DOCUMENTATION GENERATION β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β INPUT SOURCES β β
β β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β β
β β β YAML Files β β SQL Models β β Manifest JSON β β β
β β β β β β β β β β
β β β β’ Descriptionsβ β β’ Column refsβ β β’ Compiled metadata β β β
β β β β’ Tags β β β’ References β β β’ Graph structure β β β
β β β β’ Tests β β β’ Comments β β β’ Test results β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PROCESSING β β
β β β β
β β 1. Parse YAML definitions β β
β β 2. Extract SQL metadata β β
β β 3. Build lineage graph β β
β β 4. Compile documentation β β
β β 5. Generate search index β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β OUTPUT ARTIFACTS β β
β β β β
β β β’ catalog.json β’ manifest.json β’ index.html β β
β β β’ run_results.json β’ compiled/ β’ search/ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Detailed Explanation
dbt documentation provides a comprehensive way to document your data transformation pipeline, track data lineage, and generate interactive documentation sites.
Documentation Components
1. Model Documentation
Models can be documented with descriptions, tags, and metadata:
- Description: What the model does and its business context
- Columns: Detailed descriptions for each column
- Tests: Data quality tests attached to models
- Tags: Organizational tags for filtering and grouping
2. Source Documentation
Sources provide metadata about external data:
- Freshness: Monitoring data freshness
- Descriptions: What each source table contains
- Columns: Column-level documentation
- Loader: Information about the data loading process
3. Exposure Documentation
Exposures define how data is used downstream:
- Dashboards: BI tool dashboards
- Applications: Data applications
- Exports: Data exports to external systems
Lineage Tracking
dbt automatically tracks data lineage through the ref() and source() functions:
- Column-level lineage: Track how columns flow through transformations
- Model-level lineage: See dependencies between models
- Impact analysis: Understand downstream effects of changes
- Root cause analysis: Trace issues back to their source
Documentation Site Generation
Running dbt docs generate creates a static documentation site:
- Catalog: Schema information for all models
- Manifest: Complete project metadata
- Lineage Graph: Interactive visualization
- Search: Full-text search across documentation
Metadata Management
dbt collects and manages various metadata:
- Schema information: Column names, types, descriptions
- Test results: Pass/fail status for all tests
- Run history: Execution times and status
- Freshness: Data freshness metrics
Code Examples
Model Documentation YAML
# models/marts/fct_orders.yml
version: 2
models:
- name: fct_orders
description: >
Fact table containing all order transactions. This is the central
fact table for the order analytics domain. It contains one row
per order with aggregated metrics and dimension attributes.
config:
tags: ['finance', 'core']
meta:
owner: data-engineering
team: analytics
cost_center: marketing
columns:
- name: order_id
description: "Unique identifier for each order"
data_tests:
- unique
- not_null
meta:
system: shopify
pii: false
- name: customer_id
description: "Foreign key to dim_customers"
data_tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
meta:
pii: false
- name: order_date
description: "Date when the order was placed"
data_tests:
- not_null
meta:
format: YYYY-MM-DD
Source Documentation
# models/staging/_sources.yml
version: 2
sources:
- name: shopify
description: "Raw data from Shopify e-commerce platform"
database: raw
schema: shopify
loader: fivetran
loaded_at_field: _fivetran_synced
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 24, period: hour}
meta:
owner: data-engineering
team: ecommerce
cost_center: platform
tables:
- name: orders
description: "All orders from Shopify"
meta:
incremental: true
partition_column: created_at
columns:
- name: id
description: "Order ID from Shopify"
data_tests:
- unique
- not_null
- name: customer_id
description: "Customer ID from Shopify"
data_tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
Exposure Documentation
# exposures/order_dashboard.yml
version: 2
exposures:
- name: order_analytics_dashboard
type: dashboard
description: "Main dashboard for order analytics"
url: https://looker.example.com/dashboards/order_analytics
depends_on:
- ref('fct_orders')
- ref('dim_customers')
- ref('fct_order_items')
meta:
owner: analytics-team
team: business-intelligence
refresh_frequency: hourly
owner:
name: Data Engineering
email: data-eng@company.com
Custom Metadata Tags
# models/marts/dim_customers.yml
version: 2
models:
- name: dim_customers
description: "Customer dimension table"
config:
tags: ['customer', 'dimension', 'pii']
meta:
data_classification: confidential
retention_days: 730
backup_policy: daily
compliance:
- gdpr
- ccpa
columns:
- name: customer_id
description: "Unique customer identifier"
meta:
pii: false
system: primary_key
- name: email
description: "Customer email address"
meta:
pii: true
encryption: aes-256
masking_policy: hash
Lineage Query
-- Query to find all downstream dependencies
-- (using metadata tables)
with model_dependencies as (
select
source_model,
target_model
from {{ ref('dbt_model_dependencies') }}
),
recursive_lineage as (
select
target_model as model_name,
1 as level,
cast(target_model as varchar(1000)) as path
from model_dependencies
where source_model = 'fct_orders'
union all
select
md.target_model,
rl.level + 1,
cast(rl.path || ' -> ' || md.target_model as varchar(1000))
from model_dependencies md
inner join recursive_lineage rl on md.source_model = rl.model_name
where rl.level < 10
)
select distinct
model_name,
level,
path
from recursive_lineage
order by level, model_name
Performance Metrics
| Metric | Description | Typical Value |
|---|---|---|
| Doc Generation Time | Time to build docs site | 10-30 seconds |
| Search Index Size | Size of search index | 1-5 MB |
| Lineage Graph Size | Nodes in lineage graph | 100-1000+ |
| Documentation Coverage | % of models documented | 80-100% |
| Test Coverage | % of columns tested | 70-90% |
Best Practices
- Document everything - Add descriptions to all models and columns
- Use consistent naming - Follow a naming convention for tags and metadata
- Track lineage - Use ref() and source() for complete lineage
- Monitor freshness - Configure source freshness checks
- Tag appropriately - Use tags for filtering and organization
- Define exposures - Document how data is consumed downstream
- Review regularly - Keep documentation up to date
- Use metadata - Add custom metadata for governance and compliance