What is Analytics Engineering?
DfAnalytics Engineering
Analytics Engineering sits at the intersection of data engineering and data analysis. It's the discipline of transforming raw data into clean, tested, documented datasets that power business intelligence and data science. It brings software engineering practices (version control, testing, modularity) to the data transformation layer.
+------------------------------------------------------------------+
| The Analytics Engineering Bridge |
| |
| Data Engineering Analytics Eng. Data Analysis |
| +--------------+ +--------------+ +--------------+|
| | | | | | ||
| | Raw Data |------->| Clean, |------->| Dashboards ||
| | Pipelines | | Modeled | | Reports ||
| | (Ingest) | | (dbt) | | Insights ||
| | | | | | ||
| +--------------+ +--------------+ +--------------+|
| |
| "Get data in" "Transform + Test" "Get insights" |
+------------------------------------------------------------------+
Why Analytics Engineering Matters
âšī¸ Why Analytics Engineering Matters
Without analytics engineering, data teams face inconsistent metrics, no data quality checks, undocumented transformations, and fragile pipelines. Analytics engineering solves these problems by bringing rigor to the transformation layer.
- Inconsistent metrics â different teams compute "revenue" differently
- No data quality checks â broken data flows silently into dashboards
- Undocumented transformations â nobody knows what a column means
- Fragile pipelines â one schema change breaks everything downstream
dbt Fundamentals
dbt (data build tool) is the industry standard for analytics engineering. It brings software engineering practices to data transformation.
The Core Concept: SQL + Python + Git
+-----------------------------------------------------+
| dbt Architecture |
| |
| +----------+ +----------+ +----------+ |
| | Source | | dbt | | Target | |
| | Data |-->| Transform|-->| Tables | |
| |(Snowflake)| | Models | |(Analytics)| |
| +----------+ +----------+ +----------+ |
| | |
| +------+------+ |
| | Git Repo | |
| | Versioned | |
| | Tested | |
| | Documented | |
| +-------------+ |
+-----------------------------------------------------+
dbt Project Structure
my_dbt_project/
+-- dbt_project.yml # Project configuration
+-- packages.yml # Package dependencies
+-- profiles.yml # Connection config (local)
+-- models/
| +-- staging/ # 1:1 with source tables
| | +-- _staging__sources.yml
| | +-- _staging__models.yml
| | +-- stg_orders.sql
| +-- intermediate/ # Business logic transformations
| | +-- int_orders_full.sql
| +-- marts/ # Final business tables
| +-- finance/
| | +-- fct_revenue.sql
| +-- sales/
| +-- fct_deals.sql
+-- tests/ # Custom data tests
+-- macros/ # Reusable SQL functions
+-- seeds/ # CSV files loaded as tables
+-- snapshots/ # Slowly changing dimensions
dbt_project.yml
name: 'my_analytics_project'
version: '1.0.0'
config-version: 2
profile: 'my_profile'
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["data"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"
clean-targets:
- "target"
- "dbt_packages"
models:
my_analytics_project:
staging:
+materialized: view
intermediate:
+materialized: ephemeral
marts:
+materialized: table
Data Modeling with dbt
The Medallion Architecture (Bronze -> Silver -> Gold)
DfMedallion Architecture
A multi-layer data architecture pattern: Bronze (raw), Silver (cleaned), and Gold (business-ready) tables. Each layer adds quality, documentation, and business logic.
+-------------------------------------------------------------+
| Data Modeling Layers |
| |
| +-------------+ +-------------+ +-------------+ |
| | BRONZE | | SILVER | | GOLD | |
| | | | | | | |
| | Raw data |-->| Cleaned |-->| Business | |
| | from source | | + deduped | | aggregates | |
| | | | + typed | | + joined | |
| | (staging) | |(intermediate| | (marts) | |
| +-------------+ +-------------+ +-------------+ |
| |
| Staging: 1:1 mapping to source |
| Intermediate: Business logic, reusable CTEs |
| Marts: Final tables for dashboards & science |
+-------------------------------------------------------------+
Staging Models
-- models/staging/stg_orders.sql
{{ config(materialized='view') }}
with source as (
select * from {{ source('raw', 'orders') }}
),
renamed as (
select
id as order_id,
user_id,
status,
cast(created_at as timestamp) as ordered_at,
cast(updated_at as timestamp) as updated_at,
credit_card as payment_method,
coupon as coupon_code,
amount as order_amount
from source
where status != 'deleted'
)
select * from renamed
# models/staging/_staging__sources.yml
version: 2
sources:
- name: raw
database: raw_data
schema: production
tables:
- name: orders
loaded_at_field: updated_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
- name: users
- name: products
Intermediate Models
-- models/intermediate/int_orders_full.sql
{{ config(materialized='ephemeral') }}
with orders as (
select * from {{ ref('stg_orders') }}
),
customers as (
select * from {{ ref('stg_customers') }}
),
products as (
select * from {{ ref('stg_products') }}
),
order_items as (
select * from {{ ref('stg_order_items') }}
),
order_payments as (
select * from {{ ref('stg_order_payments') }}
),
full_orders as (
select
orders.order_id,
orders.user_id,
customers.customer_name,
customers.email,
orders.ordered_at,
orders.status,
orders.payment_method,
order_items.product_id,
products.product_name,
products.category,
products.price as product_price,
order_payments.amount as payment_amount,
order_payments.discount_amount,
order_payments.amount - order_payments.discount_amount as net_amount
from orders
left join customers on orders.user_id = customers.user_id
left join order_items on orders.order_id = order_items.order_id
left join products on order_items.product_id = products.product_id
left join order_payments on orders.order_id = order_payments.order_id
)
select * from full_orders
Mart Models
-- models/marts/finance/fct_revenue.sql
{{ config(materialized='table') }}
with orders as (
select * from {{ ref('int_orders_full') }}
),
monthly_revenue as (
select
date_trunc('month', ordered_at) as revenue_month,
count(distinct order_id) as total_orders,
count(distinct user_id) as unique_customers,
sum(net_amount) as total_revenue,
avg(net_amount) as avg_order_value,
sum(net_amount) / count(distinct user_id) as revenue_per_customer
from orders
where status = 'completed'
group by 1
)
select
revenue_month,
total_orders,
unique_customers,
total_revenue,
avg_order_value,
revenue_per_customer,
-- Month-over-month growth
(total_revenue - lag(total_revenue) over (order by revenue_month))
/ lag(total_revenue) over (order by revenue_month) * 100
as revenue_growth_pct
from monthly_revenue
order by revenue_month
Testing and Documentation
dbt Tests
Dfdbt Tests
dbt tests are assertions about your data. They catch data quality issues before they reach dashboards. dbt supports schema tests (built-in) and data tests (custom SQL).
# models/marts/finance/_finance__models.yml
version: 2
models:
- name: fct_revenue
description: "Monthly revenue metrics for the finance team"
columns:
- name: revenue_month
description: "First day of the month"
tests:
- unique
- not_null
- name: total_revenue
description: "Sum of completed order amounts"
tests:
- not_null
- dbt_utils.greater_than:
value: 0
- name: unique_customers
tests:
- not_null
- dbt_utils.greater_than:
value: 0
Custom Data Tests
-- tests/assert_revenue_matches_payments.sql
-- This test checks that revenue in our mart matches source payments
with revenue as (
select sum(total_revenue) as mart_revenue
from {{ ref('fct_revenue') }}
),
payments as (
select sum(amount) as source_revenue
from {{ source('raw', 'payments') }}
where status = 'completed'
and date_trunc('month', created_at) >=
(select min(revenue_month) from {{ ref('fct_revenue') }})
)
select
mart_revenue,
source_revenue,
abs(mart_revenue - source_revenue) as difference
from revenue, payments
where abs(mart_revenue - source_revenue) > 1.00 -- Allow $1 rounding tolerance
Documentation
dbt generates automatic documentation including:
- Column descriptions from YAML files
- Lineage graphs showing model dependencies
- Test results showing data quality
- Source freshness monitoring
# Generate and view documentation
dbt docs generate
dbt docs serve
+----------------------------------------------------------+
| dbt Documentation Lineage Graph |
| |
| +----------+ +----------+ +----------+ |
| | stg_ | | int_ | | fct_ | |
| | orders |---->| orders |---->| revenue | |
| +----------+ | _full | +----------+ |
| +----+----+ |
| +----------+ | +----------+ |
| | stg_ |----------+------->| dim_ | |
| | customers| | | customers| |
| +----------+ | +----------+ |
| +----------+ | |
| | stg_ |----------+ |
| | products | |
| +----------+ |
+----------------------------------------------------------+
dbt Macros and Packages
Reusable Macros
-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ default_schema }}_{{ custom_schema_name }}
{%- endif -%}
{%- endmacro %}
-- macros/calculate_growth_rate.sql
{% macro calculate_growth_rate(current_col, previous_col) %}
case
when {{ previous_col }} = 0 then null
else ({{ current_col }} - {{ previous_col }}) / {{ previous_col }} * 100
end
{% endmacro %}
Essential Packages
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: ">=1.0.0"
- package: dbt-labs/codegen
version: ">=0.4.0"
- package: calogica/dbt_expectations
version: ">=0.8.0"
# Usage: dbt_utils tests
columns:
- name: user_id
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- user_id
- order_date
Complete dbt Project Example
# scripts/run_dbt.py
import subprocess
import sys
import json
from datetime import datetime
class DbtRunner:
def __init__(self, project_dir="."):
self.project_dir = project_dir
self.results = {}
def run_command(self, cmd: list[str]) -> dict:
"""Run a dbt command and capture output."""
print(f"\n{'='*60}")
print(f"Running: {' '.join(cmd)}")
print(f"{'='*60}")
result = subprocess.run(
cmd,
cwd=self.project_dir,
capture_output=True,
text=True
)
output = {
"command": ' '.join(cmd),
"returncode": result.returncode,
"stdout": result.stdout,
"stderr": result.stderr,
"timestamp": datetime.now().isoformat()
}
if result.returncode != 0:
print(f"ERROR: {result.stderr}")
else:
print(result.stdout)
return output
def run_full_pipeline(self):
"""Execute a complete dbt pipeline."""
commands = [
["dbt", "deps"], # Install packages
["dbt", "seed"], # Load CSV seeds
["dbt", "run", "--full-refresh"], # Build models
["dbt", "test"], # Run tests
["dbt", "docs", "generate"], # Generate docs
]
for cmd in commands:
result = self.run_command(cmd)
self.results[cmd[1]] = result
if result["returncode"] != 0:
print(f"\nPipeline failed at step: {cmd[1]}")
return False
print("\n+ Full pipeline completed successfully!")
return True
def run_specific_models(self, selector: str):
"""Run specific models using selectors."""
return self.run_command(
["dbt", "run", "--select", selector]
)
def generate_test_report(self) -> str:
"""Generate a test summary report."""
report = f"""
dbt Pipeline Report
Generated: {datetime.now().isoformat()}
Results:
"""
for step, result in self.results.items():
status = "+ PASS" if result["returncode"] == 0 else "- FAIL"
report += f" {step}: {status}\n"
return report
if __name__ == "__main__":
runner = DbtRunner()
if len(sys.argv) > 1 and sys.argv[1] == "--select":
selector = sys.argv[2] if len(sys.argv) > 2 else "fct_revenue"
runner.run_specific_models(selector)
else:
runner.run_full_pipeline()
report = runner.generate_test_report()
print(report)
Real-World dbt Patterns
Slowly Changing Dimensions (SCD Type 2)
-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at',
invalidate_hard_deletes=True
)
}}
select * from {{ ref('stg_customers') }}
{% endsnapshot %}
Incremental Models
-- models/marts/marketing/fct_user_events.sql
{{ config(
materialized='incremental',
unique_key='event_id',
incremental_strategy='merge'
) }}
with events as (
select * from {{ ref('stg_events') }}
{% if is_incremental() %}
where event_time > (select max(event_time) from {{ this }})
{% endif %}
)
select
event_id,
user_id,
event_type,
event_time,
event_properties
from events
Key Takeaways
đSummary: Analytics Engineering with dbt
- Analytics Engineering bridges data engineering and analysis with tested, documented transforms â bringing software engineering rigor to the data layer
- dbt brings software engineering practices (Git, testing, modularity) to SQL â enabling version-controlled, reproducible transformations
- Use the staging -> intermediate -> marts pattern for clean data modeling â each layer has a clear purpose and responsibility
- Tests catch data quality issues before they reach dashboards â schema tests and data tests provide defense-in-depth
- Documentation and lineage graphs make data self-service â stakeholders can understand what metrics mean and where they come from
- Incremental models and snapshots handle large-scale data efficiently â avoiding full table rebuilds
Practice Exercises
- Set up a dbt project connecting to a sample database (PostgreSQL or SQLite)
- Build a staging layer for 3 source tables with proper naming and types
- Create a mart that joins 2+ tables and computes business metrics
- Write 5 tests including uniqueness, not-null, and a custom test
- Generate documentation and explore the lineage graph
- Add an incremental model for a high-volume event table
- Use a package (dbt_utils) to test composite uniqueness