dbt with BigQuery
BigQuery Architecture
Architecture Diagram
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā BIGQUERY + DBT ARCHITECTURE ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā BIGQUERY COMPONENTS ā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā PROJECTS ā ā DATASETS ā ā TABLES ā ā ā
ā ā ā ā ā ā ā ā ā ā
ā ā ā ⢠GCP ā ā ⢠Schema ā ā ⢠Native ā ā ā
ā ā ā project ā ā equivalent ā ā ⢠External ā ā ā
ā ā ā ⢠Billing ā ā ⢠Access ā ā ⢠Materialized ā ā ā
ā ā ā account ā ā control ā ā views ā ā ā
ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā ā¼ ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā DBT + BIGQUERY ā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā PARTITION ā ā CLUSTER ā ā OPTIMIZATION ā ā ā
ā ā ā ā ā ā ā ā ā ā
ā ā ā ⢠Date ā ā ⢠Column ā ā ⢠Slot management ā ā ā
ā ā ā ⢠Timestamp ā ā ordering ā ā ⢠Cost control ā ā ā
ā ā ā ⢠Integer ā ā ⢠Up to 4 ā ā ⢠Query optimization ā ā ā
ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Partitioning Strategy
Architecture Diagram
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā BIGQUERY PARTITIONING ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā PARTITION TYPES ā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā DAY ā ā HOUR ā ā INTEGER RANGE ā ā ā
ā ā ā ā ā ā ā ā ā ā
ā ā ā 365 ā ā 8760 ā ā Custom ā ā ā
ā ā ā partitions ā ā partitions ā ā ranges ā ā ā
ā ā ā per year ā ā per year ā ā ā ā ā
ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā ā¼ ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā PARTITION PRUNING ā ā
ā ā ā ā
ā ā Query: SELECT * FROM table WHERE date = '2024-01-01' ā ā
ā ā ā ā
ā ā Without partitioning: ā ā
ā ā āāā Scan: 100M rows ā ā
ā ā āāā Cost: $5.00 ā ā
ā ā āāā Time: 10 seconds ā ā
ā ā ā ā
ā ā With partitioning: ā ā
ā ā āāā Scan: 274K rows (1 day) ā ā
ā ā āāā Cost: $0.01 ā ā
ā ā āāā Time: 0.5 seconds ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Clustering Strategy
Architecture Diagram
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā BIGQUERY CLUSTERING ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā CLUSTERING COLUMNS ā ā
ā ā ā ā
ā ā Table: fct_orders ā ā
ā ā Clustered by: customer_id, order_status, product_category ā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā Data blocks organized by: ā ā ā
ā ā ā ā ā ā
ā ā ā Block 1: customer_id=1001, status=active, category=books ā ā ā
ā ā ā Block 2: customer_id=1001, status=active, category=tech ā ā ā
ā ā ā Block 3: customer_id=1002, status=pending, category=books ā ā ā
ā ā ā Block 4: customer_id=1002, status=shipped, category=tech ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā ā
ā ā Query optimization: ā ā
ā ā WHERE customer_id = 1001 AND order_status = 'active' ā ā
ā ā ā Only scan Blocks 1-2 (50% reduction) ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Detailed Explanation
BigQuery is a serverless, highly scalable data warehouse that integrates seamlessly with dbt for data transformation.
BigQuery + dbt Features
- Native Integration: Direct SQL execution
- Partitioning: Time and integer-based partitioning
- Clustering: Column-based data organization
- Materialized Views: Pre-computed aggregations
- Slot Management: Compute resource allocation
Partitioning in BigQuery
BigQuery supports several partitioning strategies:
- Day-partitioned: Partition by date column
- Hour-partitioned: Partition by timestamp (hour granularity)
- Integer-range: Partition by integer column
- Ingestion-time: Partition by data insertion time
Clustering in BigQuery
Clustering organizes data within partitions:
- Up to 4 clustering columns
- Automatic re-clustering
- Improved query performance
- Reduced costs
Cost Optimization
BigQuery pricing factors:
- Storage: $0.02/GB/month
- Query: $5/TB scanned
- Streaming: $0.01/200MB
- Slots: Variable pricing
Best Practices for BigQuery
- Always partition large tables
- Cluster by query patterns - filter/join columns
- Use incremental models - minimize data scanned
- Partition expiration - manage storage costs
- Use materialized views - for frequent aggregations
- Monitor slot usage - optimize compute resources
- Use dry runs - estimate query costs
- Leverage caching - reduce repeated queries
Code Examples
BigQuery Profile Configuration
# profiles.yml
my_profile:
target: dev
outputs:
dev:
type: bigquery
method: oauth
project: my-gcp-project
dataset: dbt_dev
threads: 8
timeout_seconds: 300
location: US
priority: interactive
retries: 1
maximum_bytes_billed: 1000000000 # 1GB limit
prod:
type: bigquery
method: service-account
project: my-gcp-project
dataset: analytics_prod
threads: 16
timeout_seconds: 600
location: US
priority: batch
retries: 3
maximum_bytes_billed: 100000000000 # 100GB limit
Partitioned Model
-- models/marts/fct_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
partition_by={
"field": "order_date",
"data_type": "date",
"granularity": "day"
},
cluster_by=['customer_id', 'order_status', 'product_category'],
incremental_strategy='merge'
)
}}
with orders as (
select * from {{ ref('stg_orders') }}
),
final as (
select
order_id,
customer_id,
order_date,
order_status,
product_category,
amount,
current_timestamp() as updated_at
from orders
)
select * from final
{% if is_incremental() %}
where order_date >= date_sub(
(select max(order_date) from {{ this }}),
interval 7 day
)
{% endif %}
Materialized View
-- models/marts/mv_daily_revenue.sql
{{
config(
materialized='materialized_view',
partition_by={
"field": "order_date",
"data_type": "date"
},
enable_refresh=true,
refresh_interval_minutes=60
)
}}
with daily_revenue as (
select
order_date,
product_category,
sum(amount) as total_revenue,
count(*) as order_count
from {{ ref('fct_orders') }}
group by 1, 2
)
select * from daily_revenue
External Table
-- models/staging/stg_external_data.sql
{{
config(
materialized='view'
)
}}
with external_data as (
select * from {{ source('external', 'gcs_data') }}
)
select * from external_data
BigQuery-Specific Macros
-- macros/bigquery/safe_divide.sql
{% macro safe_divide(numerator, denominator) %}
safe_divide({{ numerator }}, {{ denominator }})
{% endmacro %}
-- macros/bigquery/array_agg.sql
{% macro array_agg(column_name) %}
array_agg({{ column_name }} ignore nulls)
{% endmacro %}
-- macros/bigquery/struct.sql
{% macro make_struct(columns) %}
struct({{ columns | join(', ') }})
{% endmacro %}
Incremental with BigQuery Optimizations
-- models/marts/fct_events.sql
{{
config(
materialized='incremental',
unique_key='event_id',
partition_by={
"field": "event_timestamp",
"data_type": "timestamp",
"granularity": "hour"
},
cluster_by=['user_id', 'event_type'],
incremental_strategy='merge',
post_hook=[
"analyze table {{ this }} compute statistics for all columns"
]
)
}}
with events as (
select * from {{ ref('stg_events') }}
),
final as (
select
event_id,
user_id,
event_type,
event_timestamp,
event_properties,
current_timestamp() as updated_at
from events
)
select * from final
{% if is_incremental() %}
where event_timestamp >= timestamp_sub(
(select max(event_timestamp) from {{ this }}),
interval 3 day
)
{% endif %}
Performance Metrics
| Metric | Without Optimization | With Optimization | Improvement |
|---|---|---|---|
| Query Time | 10s | 0.5s | 95% faster |
| Cost | $5.00 | $0.01 | 99.8% cheaper |
| Storage | 100GB | 100GB | Same |
| Scan | 100M rows | 274K rows | 99.7% less |
Best Practices
- Always partition large tables by date
- Cluster by query patterns - filter/join columns
- Use incremental models - minimize data scanned
- Partition expiration - manage storage costs
- Use materialized views - for frequent aggregations
- Monitor slot usage - optimize compute resources
- Use dry runs - estimate query costs
- Leverage caching - reduce repeated queries