BigQuery Performance Optimization

Master BigQuery performance optimization including partitioning, clustering, slot management, query optimization, and cost control.

20 min readAdvanced

Performance Optimization Framework

📊 BigQuery Architecture for Data Engineering

Interview Tip: BigQuery separates storage and compute. Queries are charged by slots (compute) + bytes scanned. Always partition and cluster tables to reduce costs.

Partitioning Strategies

-- Time-unit partitioning (recommended)
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(event_timestamp)
OPTIONS (
  partition_expiration_days = 730,
  require_partition_filter = true
);

-- Integer-range partitioning
CREATE TABLE `project.dataset.sales`
PARTITION BY RANGE_BUCKET(sale_id, GENERATE_ARRAY(0, 10000000, 100000));

-- Partition pruning in queries
SELECT * FROM `project.dataset.events`
WHERE event_date >= '2025-01-01'  -- Partition pruning
  AND event_type = 'purchase';    -- Cluster pruning

Clustering Optimization

-- Clustering for multi-column filtering
CREATE TABLE `project.dataset.sales`
PARTITION BY DATE(order_date)
CLUSTER BY customer_id, product_id, region;

-- Cluster pruning in queries
SELECT
  customer_id,
  SUM(amount) as total
FROM `project.dataset.sales`
WHERE order_date = '2025-01-15'
  AND customer_id = 'user_123'    -- Cluster pruning
  AND region = 'US';              -- Cluster pruning

Query Optimization

-- Bad: Full table scan
SELECT * FROM `project.dataset.events`;

-- Good: Partition and cluster pruning
SELECT
  event_type,
  COUNT(*) as count
FROM `project.dataset.events`
WHERE event_date = '2025-01-15'
GROUP BY 1;

-- Use approximate aggregation for large datasets
SELECT
  APPROX_COUNT_DISTINCT(user_id) as approx_users
FROM `project.dataset.events`
WHERE event_date = '2025-01-15';

-- Optimize JOINs with broadcast hints
SELECT /*+ BROADCAST(p) */
  o.order_id,
  p.product_name
FROM `project.dataset.orders` o
JOIN `project.dataset.small_products` p ON o.product_id = p.product_id;

Slot Monitoring

-- Monitor slot usage
SELECT
  job_creation_time,
  job_id,
  total_bytes_processed,
  slot_ms_used,
  ROUND(slot_ms_used / 1000 / 60, 2) as slot_minutes
FROM `region-us-central1`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE DATE(job_creation_time) = CURRENT_DATE()
ORDER BY slot_ms_used DESC;

-- Check slot utilization
SELECT
  TIMESTAMP_TRUNC(timestamp, MINUTE) as minute,
  AVG(num_slots) as avg_slots,
  MAX(num_slots) as max_slots
FROM `region-us-central1`.INFORMATION_SCHEMA.JOBS_TIMELINE
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY 1
ORDER BY 1;

ℹ️

Pro Tip: Use require_partition_filter = true on all partitioned tables to prevent expensive full-table scans. Set maximum_bytes_billed at the session level to cap query costs. Monitor slot utilization to right-size committed slot purchases.

💬

Common Interview Questions

Q1: What is the difference between partitioning and clustering?

Answer: Partitioning divides tables into segments (max 4,000), best for date/time data. Clustering sorts data within partitions (up to 4 fields), best for high-cardinality filtering. Use both together for optimal performance.

Q2: How do you prevent expensive queries?

Answer: 1) Use require_partition_filter, 2) Set maximum_bytes_billed, 3) Use query dry-run for cost estimation, 4) Avoid SELECT *, 5) Use partition and cluster pruning, 6) Monitor slot usage.

Q3: When should you use materialized views?

Answer: Materialized views are best for frequently-run aggregations on large tables. They're automatically maintained by BigQuery and provide faster query performance. Use them for dashboard queries and common aggregations.

Q4: How do you optimize JOINs in BigQuery?

Answer: 1) Use broadcast JOINs for small tables (<1GB), 2) Join on partitioned/clustering columns, 3) Filter before JOINing, 4) Use APPROX_COUNT_DISTINCT for large cardinality, 5) Consider denormalization for complex joins.

Q5: What is the benefit of BI Engine?

Answer: BI Engine caches hot data in-memory for sub-second dashboard queries. It automatically manages cache invalidation and works with Looker and Connected Sheets. Reserve capacity based on your hot data size.

BigQuery Optimization: Partitioning, Clustering & Slot Management