dbt Core Architecture
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β dbt CORE ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β MANIFEST βββββΆβ COMPILER βββββΆβ RUNNER β β
β β (YAML/ β β (Jinja + β β (Execution β β
β β SQL) β β SQL) β β Engine) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β PROJECT β β GRAPH β β RESULTS β β
β β CONFIG β β OPERATOR β β (JSON) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Compilation Pipeline
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COMPILATION PIPELINE FLOW β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β PARSE ββββΆβ RESOLVE ββββΆβ COMPILE ββββΆβ EXECUTE ββββΆβ TEST β β
β β β β β β β β β β β β
β ββ’YAML β ββ’Refs β ββ’Jinja β ββ’SQL β ββ’Schema β β
β ββ’SQL β ββ’Sources β ββ’Render β ββ’BQ/SF β ββ’Data β β
β ββ’Graph β ββ’Packagesβ ββ’Validateβ ββ’Redshiftβ ββ’Custom β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β β β β β β
β βΌ βΌ βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ARTIFACTS OUTPUT β β
β β β’ manifest.json β’ run_results.json β’ catalog.json β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Project Structure
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DBT PROJECT DIRECTORY STRUCTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β my_dbt_project/ β
β βββ dbt_project.yml βββ Project configuration β
β βββ packages.yml βββ Package dependencies β
β βββ profiles.yml βββ Connection profiles β
β βββ models/ β
β β βββ staging/ βββ Staging models (1:1 with sources) β
β β β βββ _sources.yml βββ Source definitions β
β β β βββ stg_customers.sql β
β β β βββ stg_orders.sql β
β β βββ intermediate/ βββ Business logic transformations β
β β β βββ int_orders.sql β
β β βββ marts/ βββ Final analytical models β
β β βββ finance/ β
β β β βββ fct_orders.sql β
β β βββ customers/ β
β β βββ dim_customers.sql β
β βββ seeds/ βββ CSV files loaded as tables β
β βββ snapshots/ βββ Slowly changing dimensions β
β βββ tests/ βββ Custom data tests β
β βββ macros/ βββ Reusable Jinja macros β
β βββ analysis/ βββ Ad-hoc analysis queries β
β βββ target/ βββ Compiled output (gitignored) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Detailed Explanation
dbt (data build tool) represents a paradigm shift in data transformation, embodying the principles of software engineering applied to analytics code. At its core, dbt is a SQL-based transformation tool that enables data teams to build clean, tested, documented data models using SQL and Python.
Compilation Process
The compilation process in dbt follows a sophisticated pipeline that transforms human-readable SQL and YAML configurations into executable queries. When you invoke dbt, it first parses the project manifest by reading all YAML configuration files and SQL model files. The manifest serves as the single source of truth for the entire project, containing metadata about models, sources, tests, macros, and their relationships.
During parsing, dbt constructs a directed acyclic graph (DAG) that represents all dependencies between models. This graph is crucial for determining execution order, as dbt must process models in topological order to ensure that all upstream dependencies are materialized before downstream models reference them.
The resolution phase handles the ref() function calls, which are special dbt functions that resolve to the actual database object references. When dbt encounters {{ ref('model_name') }}, it looks up the model in the manifest and replaces it with the appropriate database-specific reference (e.g., database.schema.model_name for Snowflake or project.dataset.table for BigQuery).
Execution Engine
dbt's execution engine manages the actual SQL execution against the target data warehouse. It handles connection pooling, query submission, error handling, and result collection. The engine supports multiple materialization strategies, each with its own execution pattern:
- Table Materialization: Drops and recreates the target table on each run
- View Materialization: Creates or replaces a view definition
- Incremental Materialization: Only processes new or changed records
- Ephemeral Materialization: Injects CTEs into downstream models
Graph-Based Processing
dbt's DAG-based execution model provides several advantages:
- Parallel Execution: Independent models can be executed concurrently
- Selective Materialization: Only changed models and their dependents are rebuilt
- Lineage Tracking: Complete audit trail of data transformations
- Impact Analysis: Understand downstream effects of schema changes
Key Concepts
| Concept | Description | Purpose |
|---|---|---|
| Manifest | JSON representation of the project | Central metadata store |
| DAG | Directed Acyclic Graph of dependencies | Execution ordering |
| Materialization | How models are persisted | Storage strategy |
| Ref Function | Model reference resolution | Dependency management |
| Source Function | Source table references | Data lineage |
| Test | Data quality assertions | Validation |
| Macro | Reusable Jinja code blocks | Code reuse |
| Snapshot | Historical state tracking | SCD Type 2 |
Code Examples
dbt_project.yml Configuration
# dbt_project.yml
name: 'my_analytics_project'
version: '1.0.0'
config-version: 2
profile: 'analytics'
model-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
docs-paths: ["docs"]
clean-targets:
- "target"
- "dbt_packages"
- "dbt_modules"
models:
my_analytics_project:
staging:
+materialized: view
+schema: staging
intermediate:
+materialized: ephemeral
marts:
+materialized: incremental
+schema: analytics
vars:
start_date: '2020-01-01'
enable_debug: false
Model Compilation Example
-- models/marts/fct_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
partition_by={
"field": "order_date",
"data_type": "date",
"granularity": "day"
},
cluster_by=['customer_id', 'product_category']
)
}}
with orders as (
select * from {{ ref('stg_orders') }}
),
customers as (
select * from {{ ref('dim_customers') }}
),
order_items as (
select * from {{ ref('stg_order_items') }}
),
final as (
select
orders.order_id,
orders.order_date,
orders.status,
customers.customer_id,
customers.customer_name,
customers.segment,
sum(order_items.quantity * order_items.unit_price) as order_total,
count(order_items.item_id) as item_count,
{{ dbt_utils.current_timestamp() }} as updated_at
from orders
left join customers on orders.customer_id = customers.customer_id
left join order_items on orders.order_id = order_items.order_id
group by 1, 2, 3, 4, 5, 6
)
select * from final
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
Source Definition
# models/staging/_sources.yml
version: 2
sources:
- name: raw
database: raw_data
schema: public
loader: fivetran
loaded_at_field: _fivetran_synced
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 12, period: hour}
tables:
- name: orders
description: "Raw orders from Shopify"
columns:
- name: id
description: "Primary key"
data_tests:
- unique
- not_null
- name: customer_id
description: "Foreign key to customers"
data_tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
- name: status
description: "Order status"
data_tests:
- accepted_values:
values: ['pending', 'shipped', 'delivered', 'cancelled']
Performance Metrics
| Metric | Description | Typical Value |
|---|---|---|
| Parse Time | Time to read manifest | 2-5 seconds |
| Compilation | Jinja rendering time | 1-3 seconds |
| Execution | SQL execution time | Variable |
| Test Time | Data test execution | 1-10 seconds |
| Documentation | Docs generation | 5-15 seconds |
Best Practices
- Layered Architecture: Organize models into staging, intermediate, and mart layers
- Naming Conventions: Use
stg_prefix for staging,fct_for facts,dim_for dimensions - Single Source of Truth: Define all sources in
_sources.ymlfiles - Incremental Models: Use incremental materialization for large tables
- Testing: Add tests to all critical models and columns
- Documentation: Document every model and column with descriptions
- Version Control: Use Git for all dbt code with proper branching strategies
- CI/CD: Implement automated testing and deployment pipelines