Analytics Engineering: dbt

Module 4: Specialization + CareerFree Lesson

Advertisement

What is Analytics Engineering?

DfAnalytics Engineering

Analytics Engineering sits at the intersection of data engineering and data analysis. It's the discipline of transforming raw data into clean, tested, documented datasets that power business intelligence and data science. It brings software engineering practices (version control, testing, modularity) to the data transformation layer.

Architecture Diagram
+------------------------------------------------------------------+
|                    The Analytics Engineering Bridge                |
|                                                                   |
|  Data Engineering          Analytics Eng.          Data Analysis  |
|  +--------------+        +--------------+        +--------------+|
|  |              |        |              |        |              ||
|  |  Raw Data    |------->|  Clean,      |------->|  Dashboards  ||
|  |  Pipelines   |        |  Modeled     |        |  Reports     ||
|  |  (Ingest)    |        |  (dbt)       |        |  Insights    ||
|  |              |        |              |        |              ||
|  +--------------+        +--------------+        +--------------+|
|                                                                   |
|  "Get data in"          "Transform + Test"       "Get insights"  |
+------------------------------------------------------------------+

Why Analytics Engineering Matters

â„šī¸ Why Analytics Engineering Matters

Without analytics engineering, data teams face inconsistent metrics, no data quality checks, undocumented transformations, and fragile pipelines. Analytics engineering solves these problems by bringing rigor to the transformation layer.

  • Inconsistent metrics — different teams compute "revenue" differently
  • No data quality checks — broken data flows silently into dashboards
  • Undocumented transformations — nobody knows what a column means
  • Fragile pipelines — one schema change breaks everything downstream

dbt Fundamentals

dbt (data build tool) is the industry standard for analytics engineering. It brings software engineering practices to data transformation.

The Core Concept: SQL + Python + Git

Architecture Diagram
+-----------------------------------------------------+
|                   dbt Architecture                   |
|                                                      |
|  +----------+    +----------+    +----------+      |
|  |  Source   |    |   dbt    |    |  Target  |      |
|  |  Data     |-->| Transform|-->|  Tables  |      |
|  |(Snowflake)|    |  Models  |    |(Analytics)|     |
|  +----------+    +----------+    +----------+      |
|                      |                                |
|               +------+------+                       |
|               |   Git Repo  |                       |
|               |  Versioned  |                       |
|               |  Tested     |                       |
|               |  Documented |                       |
|               +-------------+                       |
+-----------------------------------------------------+

dbt Project Structure

Architecture Diagram
my_dbt_project/
+-- dbt_project.yml          # Project configuration
+-- packages.yml             # Package dependencies
+-- profiles.yml             # Connection config (local)
+-- models/
|   +-- staging/             # 1:1 with source tables
|   |   +-- _staging__sources.yml
|   |   +-- _staging__models.yml
|   |   +-- stg_orders.sql
|   +-- intermediate/        # Business logic transformations
|   |   +-- int_orders_full.sql
|   +-- marts/               # Final business tables
|       +-- finance/
|       |   +-- fct_revenue.sql
|       +-- sales/
|           +-- fct_deals.sql
+-- tests/                   # Custom data tests
+-- macros/                  # Reusable SQL functions
+-- seeds/                   # CSV files loaded as tables
+-- snapshots/               # Slowly changing dimensions

dbt_project.yml

name: 'my_analytics_project'
version: '1.0.0'
config-version: 2

profile: 'my_profile'

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["data"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"
clean-targets:
  - "target"
  - "dbt_packages"

models:
  my_analytics_project:
    staging:
      +materialized: view
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: table

Data Modeling with dbt

The Medallion Architecture (Bronze -> Silver -> Gold)

DfMedallion Architecture

A multi-layer data architecture pattern: Bronze (raw), Silver (cleaned), and Gold (business-ready) tables. Each layer adds quality, documentation, and business logic.

Architecture Diagram
+-------------------------------------------------------------+
|                    Data Modeling Layers                       |
|                                                              |
|  +-------------+   +-------------+   +-------------+      |
|  |   BRONZE    |   |   SILVER    |   |    GOLD     |      |
|  |             |   |             |   |             |      |
|  | Raw data    |-->| Cleaned     |-->| Business    |      |
|  | from source |   | + deduped   |   | aggregates  |      |
|  |             |   | + typed     |   | + joined    |      |
|  | (staging)   |   |(intermediate|   |  (marts)   |      |
|  +-------------+   +-------------+   +-------------+      |
|                                                              |
|  Staging:  1:1 mapping to source                             |
|  Intermediate: Business logic, reusable CTEs                |
|  Marts: Final tables for dashboards & science               |
+-------------------------------------------------------------+

Staging Models

-- models/staging/stg_orders.sql
{{ config(materialized='view') }}

with source as (
    select * from {{ source('raw', 'orders') }}
),

renamed as (
    select
        id as order_id,
        user_id,
        status,
        cast(created_at as timestamp) as ordered_at,
        cast(updated_at as timestamp) as updated_at,
        credit_card as payment_method,
        coupon as coupon_code,
        amount as order_amount
    from source
    where status != 'deleted'
)

select * from renamed
# models/staging/_staging__sources.yml
version: 2

sources:
  - name: raw
    database: raw_data
    schema: production
    tables:
      - name: orders
        loaded_at_field: updated_at
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}
      - name: users
      - name: products

Intermediate Models

-- models/intermediate/int_orders_full.sql
{{ config(materialized='ephemeral') }}

with orders as (
    select * from {{ ref('stg_orders') }}
),

customers as (
    select * from {{ ref('stg_customers') }}
),

products as (
    select * from {{ ref('stg_products') }}
),

order_items as (
    select * from {{ ref('stg_order_items') }}
),

order_payments as (
    select * from {{ ref('stg_order_payments') }}
),

full_orders as (
    select
        orders.order_id,
        orders.user_id,
        customers.customer_name,
        customers.email,
        orders.ordered_at,
        orders.status,
        orders.payment_method,
        order_items.product_id,
        products.product_name,
        products.category,
        products.price as product_price,
        order_payments.amount as payment_amount,
        order_payments.discount_amount,
        order_payments.amount - order_payments.discount_amount as net_amount
    from orders
    left join customers on orders.user_id = customers.user_id
    left join order_items on orders.order_id = order_items.order_id
    left join products on order_items.product_id = products.product_id
    left join order_payments on orders.order_id = order_payments.order_id
)

select * from full_orders

Mart Models

-- models/marts/finance/fct_revenue.sql
{{ config(materialized='table') }}

with orders as (
    select * from {{ ref('int_orders_full') }}
),

monthly_revenue as (
    select
        date_trunc('month', ordered_at) as revenue_month,
        count(distinct order_id) as total_orders,
        count(distinct user_id) as unique_customers,
        sum(net_amount) as total_revenue,
        avg(net_amount) as avg_order_value,
        sum(net_amount) / count(distinct user_id) as revenue_per_customer
    from orders
    where status = 'completed'
    group by 1
)

select
    revenue_month,
    total_orders,
    unique_customers,
    total_revenue,
    avg_order_value,
    revenue_per_customer,
    -- Month-over-month growth
    (total_revenue - lag(total_revenue) over (order by revenue_month))
        / lag(total_revenue) over (order by revenue_month) * 100
        as revenue_growth_pct
from monthly_revenue
order by revenue_month

Testing and Documentation

dbt Tests

Dfdbt Tests

dbt tests are assertions about your data. They catch data quality issues before they reach dashboards. dbt supports schema tests (built-in) and data tests (custom SQL).

# models/marts/finance/_finance__models.yml
version: 2

models:
  - name: fct_revenue
    description: "Monthly revenue metrics for the finance team"
    columns:
      - name: revenue_month
        description: "First day of the month"
        tests:
          - unique
          - not_null

      - name: total_revenue
        description: "Sum of completed order amounts"
        tests:
          - not_null
          - dbt_utils.greater_than:
              value: 0

      - name: unique_customers
        tests:
          - not_null
          - dbt_utils.greater_than:
              value: 0

Custom Data Tests

-- tests/assert_revenue_matches_payments.sql
-- This test checks that revenue in our mart matches source payments

with revenue as (
    select sum(total_revenue) as mart_revenue
    from {{ ref('fct_revenue') }}
),

payments as (
    select sum(amount) as source_revenue
    from {{ source('raw', 'payments') }}
    where status = 'completed'
    and date_trunc('month', created_at) >=
        (select min(revenue_month) from {{ ref('fct_revenue') }})
)

select
    mart_revenue,
    source_revenue,
    abs(mart_revenue - source_revenue) as difference
from revenue, payments
where abs(mart_revenue - source_revenue) > 1.00  -- Allow $1 rounding tolerance

Documentation

dbt generates automatic documentation including:

  • Column descriptions from YAML files
  • Lineage graphs showing model dependencies
  • Test results showing data quality
  • Source freshness monitoring
# Generate and view documentation
dbt docs generate
dbt docs serve
Architecture Diagram
+----------------------------------------------------------+
|               dbt Documentation Lineage Graph              |
|                                                           |
|  +----------+     +----------+     +----------+        |
|  |  stg_    |     |  int_    |     |  fct_    |        |
|  |  orders  |---->|  orders  |---->|  revenue |        |
|  +----------+     |  _full   |     +----------+        |
|                   +----+----+                           |
|  +----------+          |        +----------+            |
|  |  stg_    |----------+------->|  dim_    |            |
|  |  customers|          |        | customers|            |
|  +----------+          |        +----------+            |
|  +----------+          |                                 |
|  |  stg_    |----------+                                |
|  | products |                                           |
|  +----------+                                           |
+----------------------------------------------------------+

dbt Macros and Packages

Reusable Macros

-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
    {%- set default_schema = target.schema -%}
    {%- if custom_schema_name is none -%}
        {{ default_schema }}
    {%- else -%}
        {{ default_schema }}_{{ custom_schema_name }}
    {%- endif -%}
{%- endmacro %}
-- macros/calculate_growth_rate.sql
{% macro calculate_growth_rate(current_col, previous_col) %}
    case
        when {{ previous_col }} = 0 then null
        else ({{ current_col }} - {{ previous_col }}) / {{ previous_col }} * 100
    end
{% endmacro %}

Essential Packages

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: ">=1.0.0"

  - package: dbt-labs/codegen
    version: ">=0.4.0"

  - package: calogica/dbt_expectations
    version: ">=0.8.0"
# Usage: dbt_utils tests
columns:
  - name: user_id
    tests:
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns:
            - user_id
            - order_date

Complete dbt Project Example

# scripts/run_dbt.py
import subprocess
import sys
import json
from datetime import datetime

class DbtRunner:
    def __init__(self, project_dir="."):
        self.project_dir = project_dir
        self.results = {}

    def run_command(self, cmd: list[str]) -> dict:
        """Run a dbt command and capture output."""
        print(f"\n{'='*60}")
        print(f"Running: {' '.join(cmd)}")
        print(f"{'='*60}")

        result = subprocess.run(
            cmd,
            cwd=self.project_dir,
            capture_output=True,
            text=True
        )

        output = {
            "command": ' '.join(cmd),
            "returncode": result.returncode,
            "stdout": result.stdout,
            "stderr": result.stderr,
            "timestamp": datetime.now().isoformat()
        }

        if result.returncode != 0:
            print(f"ERROR: {result.stderr}")
        else:
            print(result.stdout)

        return output

    def run_full_pipeline(self):
        """Execute a complete dbt pipeline."""
        commands = [
            ["dbt", "deps"],           # Install packages
            ["dbt", "seed"],           # Load CSV seeds
            ["dbt", "run", "--full-refresh"],  # Build models
            ["dbt", "test"],           # Run tests
            ["dbt", "docs", "generate"],  # Generate docs
        ]

        for cmd in commands:
            result = self.run_command(cmd)
            self.results[cmd[1]] = result

            if result["returncode"] != 0:
                print(f"\nPipeline failed at step: {cmd[1]}")
                return False

        print("\n+ Full pipeline completed successfully!")
        return True

    def run_specific_models(self, selector: str):
        """Run specific models using selectors."""
        return self.run_command(
            ["dbt", "run", "--select", selector]
        )

    def generate_test_report(self) -> str:
        """Generate a test summary report."""
        report = f"""
dbt Pipeline Report
Generated: {datetime.now().isoformat()}

Results:
"""
        for step, result in self.results.items():
            status = "+ PASS" if result["returncode"] == 0 else "- FAIL"
            report += f"  {step}: {status}\n"

        return report


if __name__ == "__main__":
    runner = DbtRunner()

    if len(sys.argv) > 1 and sys.argv[1] == "--select":
        selector = sys.argv[2] if len(sys.argv) > 2 else "fct_revenue"
        runner.run_specific_models(selector)
    else:
        runner.run_full_pipeline()
        report = runner.generate_test_report()
        print(report)

Real-World dbt Patterns

Slowly Changing Dimensions (SCD Type 2)

-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}

{{
    config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='timestamp',
        updated_at='updated_at',
        invalidate_hard_deletes=True
    )
}}

select * from {{ ref('stg_customers') }}

{% endsnapshot %}

Incremental Models

-- models/marts/marketing/fct_user_events.sql
{{ config(
    materialized='incremental',
    unique_key='event_id',
    incremental_strategy='merge'
) }}

with events as (
    select * from {{ ref('stg_events') }}

    {% if is_incremental() %}
        where event_time > (select max(event_time) from {{ this }})
    {% endif %}
)

select
    event_id,
    user_id,
    event_type,
    event_time,
    event_properties
from events

Key Takeaways

📋Summary: Analytics Engineering with dbt

  1. Analytics Engineering bridges data engineering and analysis with tested, documented transforms — bringing software engineering rigor to the data layer
  2. dbt brings software engineering practices (Git, testing, modularity) to SQL — enabling version-controlled, reproducible transformations
  3. Use the staging -> intermediate -> marts pattern for clean data modeling — each layer has a clear purpose and responsibility
  4. Tests catch data quality issues before they reach dashboards — schema tests and data tests provide defense-in-depth
  5. Documentation and lineage graphs make data self-service — stakeholders can understand what metrics mean and where they come from
  6. Incremental models and snapshots handle large-scale data efficiently — avoiding full table rebuilds

Practice Exercises

  1. Set up a dbt project connecting to a sample database (PostgreSQL or SQLite)
  2. Build a staging layer for 3 source tables with proper naming and types
  3. Create a mart that joins 2+ tables and computes business metrics
  4. Write 5 tests including uniqueness, not-null, and a custom test
  5. Generate documentation and explore the lineage graph
  6. Add an incremental model for a high-volume event table
  7. Use a package (dbt_utils) to test composite uniqueness

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement