dbt Core Architecture

Free Lesson

Advertisement

dbt Core Architecture

Architecture Overview

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           dbt CORE ARCHITECTURE                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚  β”‚   MANIFEST   │───▢│  COMPILER    │───▢│   RUNNER     β”‚                  β”‚
β”‚  β”‚   (YAML/     β”‚    β”‚  (Jinja +    β”‚    β”‚  (Execution  β”‚                  β”‚
β”‚  β”‚    SQL)      β”‚    β”‚   SQL)       β”‚    β”‚   Engine)    β”‚                  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β”‚         β”‚                   β”‚                   β”‚                           β”‚
β”‚         β–Ό                   β–Ό                   β–Ό                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚  β”‚   PROJECT    β”‚    β”‚  GRAPH       β”‚    β”‚   RESULTS    β”‚                  β”‚
β”‚  β”‚   CONFIG     β”‚    β”‚  OPERATOR    β”‚    β”‚   (JSON)     β”‚                  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Compilation Pipeline

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      COMPILATION PIPELINE FLOW                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚ PARSE   │──▢│ RESOLVE │──▢│ COMPILE │──▢│ EXECUTE │──▢│  TEST   β”‚     β”‚
β”‚  β”‚         β”‚   β”‚         β”‚   β”‚         β”‚   β”‚         β”‚   β”‚         β”‚     β”‚
β”‚  β”‚β€’YAML    β”‚   β”‚β€’Refs    β”‚   β”‚β€’Jinja   β”‚   β”‚β€’SQL     β”‚   β”‚β€’Schema  β”‚     β”‚
β”‚  β”‚β€’SQL     β”‚   β”‚β€’Sources β”‚   β”‚β€’Render  β”‚   β”‚β€’BQ/SF   β”‚   β”‚β€’Data    β”‚     β”‚
β”‚  β”‚β€’Graph   β”‚   β”‚β€’Packagesβ”‚   β”‚β€’Validateβ”‚   β”‚β€’Redshiftβ”‚   β”‚β€’Custom  β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚       β”‚              β”‚             β”‚             β”‚             β”‚            β”‚
β”‚       β–Ό              β–Ό             β–Ό             β–Ό             β–Ό            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚                    ARTIFACTS OUTPUT                             β”‚       β”‚
β”‚  β”‚  β€’ manifest.json    β€’ run_results.json    β€’ catalog.json       β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Project Structure

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     DBT PROJECT DIRECTORY STRUCTURE                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚  my_dbt_project/                                                            β”‚
β”‚  β”œβ”€β”€ dbt_project.yml          ◄── Project configuration                    β”‚
β”‚  β”œβ”€β”€ packages.yml             ◄── Package dependencies                     β”‚
β”‚  β”œβ”€β”€ profiles.yml             ◄── Connection profiles                      β”‚
β”‚  β”œβ”€β”€ models/                                                              β”‚
β”‚  β”‚   β”œβ”€β”€ staging/             ◄── Staging models (1:1 with sources)       β”‚
β”‚  β”‚   β”‚   β”œβ”€β”€ _sources.yml     ◄── Source definitions                       β”‚
β”‚  β”‚   β”‚   β”œβ”€β”€ stg_customers.sql                                            β”‚
β”‚  β”‚   β”‚   └── stg_orders.sql                                               β”‚
β”‚  β”‚   β”œβ”€β”€ intermediate/        ◄── Business logic transformations          β”‚
β”‚  β”‚   β”‚   └── int_orders.sql                                                β”‚
β”‚  β”‚   └── marts/               ◄── Final analytical models                 β”‚
β”‚  β”‚       β”œβ”€β”€ finance/                                                       β”‚
β”‚  β”‚       β”‚   └── fct_orders.sql                                            β”‚
β”‚  β”‚       └── customers/                                                     β”‚
β”‚  β”‚           └── dim_customers.sql                                          β”‚
β”‚  β”œβ”€β”€ seeds/                   ◄── CSV files loaded as tables              β”‚
β”‚  β”œβ”€β”€ snapshots/               ◄── Slowly changing dimensions              β”‚
β”‚  β”œβ”€β”€ tests/                   ◄── Custom data tests                       β”‚
β”‚  β”œβ”€β”€ macros/                  ◄── Reusable Jinja macros                   β”‚
β”‚  β”œβ”€β”€ analysis/                ◄── Ad-hoc analysis queries                 β”‚
β”‚  └── target/                  ◄── Compiled output (gitignored)            β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Detailed Explanation

dbt (data build tool) represents a paradigm shift in data transformation, embodying the principles of software engineering applied to analytics code. At its core, dbt is a SQL-based transformation tool that enables data teams to build clean, tested, documented data models using SQL and Python.

Compilation Process

The compilation process in dbt follows a sophisticated pipeline that transforms human-readable SQL and YAML configurations into executable queries. When you invoke dbt, it first parses the project manifest by reading all YAML configuration files and SQL model files. The manifest serves as the single source of truth for the entire project, containing metadata about models, sources, tests, macros, and their relationships.

During parsing, dbt constructs a directed acyclic graph (DAG) that represents all dependencies between models. This graph is crucial for determining execution order, as dbt must process models in topological order to ensure that all upstream dependencies are materialized before downstream models reference them.

The resolution phase handles the ref() function calls, which are special dbt functions that resolve to the actual database object references. When dbt encounters {{ ref('model_name') }}, it looks up the model in the manifest and replaces it with the appropriate database-specific reference (e.g., database.schema.model_name for Snowflake or project.dataset.table for BigQuery).

Execution Engine

dbt's execution engine manages the actual SQL execution against the target data warehouse. It handles connection pooling, query submission, error handling, and result collection. The engine supports multiple materialization strategies, each with its own execution pattern:

  1. Table Materialization: Drops and recreates the target table on each run
  2. View Materialization: Creates or replaces a view definition
  3. Incremental Materialization: Only processes new or changed records
  4. Ephemeral Materialization: Injects CTEs into downstream models

Graph-Based Processing

dbt's DAG-based execution model provides several advantages:

  • Parallel Execution: Independent models can be executed concurrently
  • Selective Materialization: Only changed models and their dependents are rebuilt
  • Lineage Tracking: Complete audit trail of data transformations
  • Impact Analysis: Understand downstream effects of schema changes

Key Concepts

ConceptDescriptionPurpose
ManifestJSON representation of the projectCentral metadata store
DAGDirected Acyclic Graph of dependenciesExecution ordering
MaterializationHow models are persistedStorage strategy
Ref FunctionModel reference resolutionDependency management
Source FunctionSource table referencesData lineage
TestData quality assertionsValidation
MacroReusable Jinja code blocksCode reuse
SnapshotHistorical state trackingSCD Type 2

Code Examples

dbt_project.yml Configuration

# dbt_project.yml
name: 'my_analytics_project'
version: '1.0.0'
config-version: 2

profile: 'analytics'

model-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
docs-paths: ["docs"]

clean-targets:
  - "target"
  - "dbt_packages"
  - "dbt_modules"

models:
  my_analytics_project:
    staging:
      +materialized: view
      +schema: staging
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: incremental
      +schema: analytics

vars:
  start_date: '2020-01-01'
  enable_debug: false

Model Compilation Example

-- models/marts/fct_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_id',
        partition_by={
            "field": "order_date",
            "data_type": "date",
            "granularity": "day"
        },
        cluster_by=['customer_id', 'product_category']
    )
}}

with orders as (
    select * from {{ ref('stg_orders') }}
),

customers as (
    select * from {{ ref('dim_customers') }}
),

order_items as (
    select * from {{ ref('stg_order_items') }}
),

final as (
    select
        orders.order_id,
        orders.order_date,
        orders.status,
        customers.customer_id,
        customers.customer_name,
        customers.segment,
        sum(order_items.quantity * order_items.unit_price) as order_total,
        count(order_items.item_id) as item_count,
        {{ dbt_utils.current_timestamp() }} as updated_at
    from orders
    left join customers on orders.customer_id = customers.customer_id
    left join order_items on orders.order_id = order_items.order_id
    group by 1, 2, 3, 4, 5, 6
)

select * from final

{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

Source Definition

# models/staging/_sources.yml
version: 2

sources:
  - name: raw
    database: raw_data
    schema: public
    loader: fivetran
    loaded_at_field: _fivetran_synced
    freshness:
      warn_after: {count: 6, period: hour}
      error_after: {count: 12, period: hour}
    tables:
      - name: orders
        description: "Raw orders from Shopify"
        columns:
          - name: id
            description: "Primary key"
            data_tests:
              - unique
              - not_null
          - name: customer_id
            description: "Foreign key to customers"
            data_tests:
              - not_null
              - relationships:
                  to: ref('stg_customers')
                  field: customer_id
          - name: status
            description: "Order status"
            data_tests:
              - accepted_values:
                  values: ['pending', 'shipped', 'delivered', 'cancelled']

Performance Metrics

MetricDescriptionTypical Value
Parse TimeTime to read manifest2-5 seconds
CompilationJinja rendering time1-3 seconds
ExecutionSQL execution timeVariable
Test TimeData test execution1-10 seconds
DocumentationDocs generation5-15 seconds

Best Practices

  1. Layered Architecture: Organize models into staging, intermediate, and mart layers
  2. Naming Conventions: Use stg_ prefix for staging, fct_ for facts, dim_ for dimensions
  3. Single Source of Truth: Define all sources in _sources.yml files
  4. Incremental Models: Use incremental materialization for large tables
  5. Testing: Add tests to all critical models and columns
  6. Documentation: Document every model and column with descriptions
  7. Version Control: Use Git for all dbt code with proper branching strategies
  8. CI/CD: Implement automated testing and deployment pipelines

Advertisement

Need Expert dbt Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement