dbt with Redshift
Redshift Architecture
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REDSHIFT + DBT ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β REDSHIFT CLUSTER β β
β β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β β
β β β LEADER β β NODES β β STORAGE β β β
β β β NODE β β β β β β β
β β β β’ Query β β β’ Compute β β β’ Columnar β β β
β β β parsing β β β’ Parallel β β β’ Compressed β β β
β β β β’ Planning β β β’ Distributedβ β β’ Distributed β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DISTRIBUTION STYLES β β
β β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β β
β β β EVEN β β KEY β β ALL β β β
β β β β β β β β β β
β β β β’ Round-robinβ β β’ Hash-based β β β’ Copy to all nodes β β β
β β β β’ Default β β β’ Join- β β β’ Dimension tables β β β
β β β β’εεεεΈ β β optimized β β β’ Small tables β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Sort Key Strategy
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SORT KEY OPTIMIZATION β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SORT KEY TYPES β β
β β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β β
β β β COMPOUND β β INTERLEAVEDβ β AUTO β β β
β β β β β β β β β β
β β β β’ Multi-col β β β’ Equal β β β’ System managed β β β
β β β β’ Prefix β β weight β β β’ Automatic β β β
β β β matching β β β’ Flexible β β β’ Optimized β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SORT KEY QUERY OPTIMIZATION β β
β β β β
β β Query: SELECT * FROM orders WHERE order_date = '2024-01-01' β β
β β β β
β β Without sort key: β β
β β βββ Full table scan: 100M rows β β
β β βββ Cost: $5.00 β β
β β βββ Time: 30 seconds β β
β β β β
β β With compound sort key (order_date, customer_id): β β
β β βββ Zone map pruning: 1M rows (99% reduction) β β
β β βββ Cost: $0.05 β β
β β βββ Time: 0.3 seconds β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Distribution Strategy
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DISTRIBUTION STRATEGY β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATA DISTRIBUTION β β
β β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β β
β β β DISTRIBUTIONβ β DISTRIBUTIONβ β DISTRIBUTION β β β
β β β EVEN β β KEY β β ALL β β β
β β β β β β β β β β
β β β ββββ ββββ β β ββββ ββββ β β ββββ ββββ β β β
β β β β1 β β2 β β β β1 β β2 β β β βA β βA β β β β
β β β β3 β β4 β β β β3 β β4 β β β βA β βA β β β β
β β β ββββ ββββ β β ββββ ββββ β β ββββ ββββ β β β
β β β Node1 Node2 β β Node1 Node2 β β Node1 Node2 β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β DISTRIBUTION KEY SELECTION: β
β βββ Fact tables: Distribute by join key (customer_id, order_id) β
β βββ Dimension tables: Use ALL distribution (small tables) β
β βββ Log tables: Use EVEN distribution (uniform access) β
β βββ Join optimization: Match distribution keys across tables β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Detailed Explanation
Amazon Redshift is a fully managed, petabyte-scale data warehouse that provides powerful analytics capabilities.
Redshift + dbt Features
- Columnar Storage: Optimized for analytics queries
- Massively Parallel Processing (MPP): Distribute queries across nodes
- Distribution Keys: Optimize data distribution for joins
- Sort Keys: Optimize data ordering for queries
- Compression: Automatic data compression
Distribution Styles
Redshift offers three distribution styles:
- EVEN: Round-robin distribution (default)
- KEY: Hash-based distribution on a column
- ALL: Copy all data to every node (for small tables)
Sort Key Types
Redshift provides different sort key options:
- Compound: Multi-column sort with prefix matching
- Interleaved: Equal-weight multi-column sort
- Auto: System-managed sort key selection
Best Practices for Redshift
- Choose distribution keys wisely - Match join patterns
- Use compound sort keys - For prefix-matching queries
- Distribute dimension tables as ALL - For small tables
- Use vacuum - Reclaim space after deletes
- Analyze statistics - Keep query planner informed
- Monitor query performance - Use STL tables
- Use result caching - Cache frequent queries
- Implement workload management - Prioritize queries
Code Examples
Redshift Profile Configuration
# profiles.yml
my_profile:
target: dev
outputs:
dev:
type: redshift
host: my-cluster.xxxxxxxxxxxx.us-west-2.redshift.amazonaws.com
port: 5439
user: my_user
password: "{{ env_var('REDSHIFT_PASSWORD') }}"
dbname: analytics_dev
schema: dbt_dev
threads: 4
connect_timeout: 10
ra3_node: true
prod:
type: redshift
host: my-prod-cluster.xxxxxxxxxxxx.us-west-2.redshift.amazonaws.com
port: 5439
user: service_account
password: "{{ env_var('REDSHIFT_PASSWORD') }}"
dbname: analytics_prod
schema: public
threads: 8
connect_timeout: 30
ra3_node: true
Distribution Key Configuration
# models/marts/fct_orders.yml
version: 2
models:
- name: fct_orders
description: "Fact table for orders"
config:
materialized: incremental
unique_key: order_id
incremental_strategy: merge
dist_style: key
dist_key: customer_id
sort_type: compound
sort_key:
- order_date
- customer_id
post_hook:
- "vacuum sortkey order by order_date"
- "analyze"
Sort Key Configuration
-- models/marts/fct_events.sql
{{
config(
materialized='incremental',
unique_key='event_id',
incremental_strategy='merge',
dist_style='key',
dist_key='user_id',
sort_type='compound',
sort_key=['event_date', 'user_id', 'event_type']
)
}}
with events as (
select * from {{ ref('stg_events') }}
),
final as (
select
event_id,
user_id,
event_type,
event_timestamp,
cast(event_timestamp as date) as event_date,
event_properties,
current_timestamp() as updated_at
from events
)
select * from final
{% if is_incremental() %}
where event_date >= date_sub(
(select max(event_date) from {{ this }}),
interval 7 day
)
{% endif %}
ALL Distribution for Dimension Tables
-- models/marts/dim_customers.sql
{{
config(
materialized='table',
dist_style='all',
sort_type='compound',
sort_key=['customer_id']
)
}}
with customers as (
select * from {{ ref('stg_customers') }}
),
final as (
select
customer_id,
customer_name,
email,
segment,
current_timestamp() as updated_at
from customers
)
select * from final
Redshift-Specific Macros
-- macros/redshift/vacuum_table.sql
{% macro vacuum_table(table_name, sort_key=None) %}
{% if sort_key %}
vacuum sortkey order by {{ sort_key }} on {{ table_name }};
{% else %}
vacuum on {{ table_name }};
{% endif %}
{% endmacro %}
-- macros/redshift/analyze_table.sql
{% macro analyze_table(table_name) %}
analyze {{ table_name }};
{% endmacro %}
-- macros/redshift/grant_permissions.sql
{% macro grant_permissions(table_name, role='analytics_reader') %}
grant select on {{ table_name }} to role {{ role }};
{% endmacro %}
Workload Management Configuration
-- macros/redshift/create_wlm_queue.sql
{% macro create_wlm_queue(queue_name, memory_percent=40, concurrency=5) %}
-- Note: WLM configuration is typically done via AWS Console
-- This macro provides a template for documentation
/*
WLM Queue Configuration:
Queue: {{ queue_name }}
Memory: {{ memory_percent }}%
Concurrency: {{ concurrency }}
Query Timeout: 600 seconds
Use Cases:
- dbt runs: Use dedicated queue
- Ad-hoc queries: Use separate queue
- ETL jobs: Use high-memory queue
*/
{% endmacro %}
Optimized Redshift Model
-- models/marts/fct_orders_optimized.sql
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
dist_style='key',
dist_key='customer_id',
sort_type='compound',
sort_key=['order_date', 'customer_id'],
post_hook=[
"{{ vacuum_table(this, 'order_date') }}",
"{{ analyze_table(this) }}"
]
)
}}
with orders as (
select * from {{ ref('stg_orders') }}
),
final as (
select
order_id,
customer_id,
order_date,
order_status,
amount,
current_timestamp() as updated_at
from orders
)
select * from final
{% if is_incremental() %}
where order_date >= date_sub(
(select max(order_date) from {{ this }}),
interval 7 day
)
{% endif %}
Performance Metrics
| Metric | Description | Target |
|---|---|---|
| Query Time | Average query execution | <10s |
| Vacuum Time | Time to vacuum tables | <5min |
| Analyze Time | Time to analyze tables | <1min |
| Storage Usage | Percentage of disk used | <70% |
| Query Queue Time | Time waiting in queue | <5s |
Best Practices
- Choose distribution keys wisely - Match join patterns
- Use compound sort keys - For prefix-matching queries
- Distribute dimension tables as ALL - For small tables
- Use vacuum - Reclaim space after deletes
- Analyze statistics - Keep query planner informed
- Monitor query performance - Use STL tables
- Use result caching - Cache frequent queries
- Implement workload management - Prioritize queries