Materialized Views: Design Patterns & Performance
Difficulty: Senior Level | Companies: Snowflake, Databricks, Google Cloud, AWS, Netflix, Uber
This question tests your ability to design, optimize, and maintain materialized views for high-performance data warehousing and analytics workloads.
1. Materialized Views Fundamentals
What is a Materialized View?
A materialized view is a pre-computed query result stored as a physical table:
Unlike a standard view (virtual):
Core Benefits
Typical speedup: 10x - 1000x depending on query complexity.
2. Refresh Strategies
Full Refresh
Complete rebuild from scratch:
-- Snowflake
CREATE MATERIALIZED VIEW sales_summary_mv AS
SELECT
region,
product_category,
DATE_TRUNC('month', sale_date) AS month,
SUM(amount) AS total_sales,
COUNT(*) AS transaction_count,
AVG(amount) AS avg_order_value
FROM sales
WHERE sale_date >= DATEADD('year', -2, CURRENT_DATE())
GROUP BY 1, 2, 3;
-- Manual refresh
ALTER MATERIALIZED VIEW sales_summary_mv REFRESH;
-- Auto-refresh (Snowflake)
ALTER MATERIALIZED VIEW sales_summary_mv
SET DATA_RETENTION_TIME_IN_DAYS = 1
REFRESH_DATA_AHEAD_FOR_ANALYTICS = TRUE;
Incremental Refresh
Only update changed data:
Where represents the merge operation.
-- PostgreSQL incremental materialized view
CREATE MATERIALIZED VIEW daily_metrics_mv AS
WITH daily_agg AS (
SELECT
DATE(event_time) AS event_date,
COUNT(*) AS event_count,
COUNT(DISTINCT user_id) AS unique_users
FROM events
GROUP BY 1
)
SELECT * FROM daily_agg;
-- Concurrent refresh (PostgreSQL specific)
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_metrics_mv;
-- For concurrent refresh, need unique index
CREATE UNIQUE INDEX idx_daily_metrics_mv_date
ON daily_metrics_mv(event_date);
Delta Lake Incremental Pattern
from pyspark.sql import SparkSession
from delta import DeltaTable
from pyspark.sql.functions import col, max as spark_max, lit
spark = SparkSession.builder \
.appName("IncrementalMV") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.getOrCreate()
# Incremental merge pattern
def incremental_mv_update(spark, base_table_path, mv_table_path, watermark_col):
# Read base table incrementally
last_mv_watermark = spark.read.format("delta").load(mv_table_path) \
.agg(spark_max(watermark_col)).collect()[0][0]
if last_mv_watermark is None:
last_mv_watermark = "1970-01-01"
# Get incremental data
incremental_df = spark.read.format("delta") \
.load(base_table_path) \
.filter(col(watermark_col) > lit(last_mv_watermark)) \
.groupBy("region", "product_category") \
.agg(
F.sum("amount").alias("total_sales"),
F.count("*").alias("transaction_count")
)
# Merge into materialized view
mv_table = DeltaTable.forPath(spark, mv_table_path)
mv_table.alias("mv").merge(
incremental_df.alias("inc"),
"mv.region = inc.region AND mv.product_category = inc.product_category"
).whenMatchedUpdate(
set={
"mv.total_sales": col("mv.total_sales") + col("inc.total_sales"),
"mv.transaction_count": col("mv.transaction_count") + col("inc.transaction_count")
}
).whenNotMatchedInsertAll().execute()
3. Design Patterns
Pattern 1: Aggregate Rollup MV
Pre-aggregate at multiple granularity levels:
-- Base aggregate
CREATE MATERIALIZED VIEW regional_sales_mv AS
SELECT
region,
DATE_TRUNC('week', sale_date) AS week,
SUM(amount) AS weekly_sales
FROM sales
GROUP BY 1, 2;
-- Higher-level rollup
CREATE MATERIALIZED VIEW regional_monthly_mv AS
SELECT
region,
DATE_TRUNC('month', week) AS month,
SUM(weekly_sales) AS monthly_sales,
COUNT(*) AS weeks_with_data
FROM regional_sales_mv
GROUP BY 1, 2;
Pattern 2: Join Pre-computation
Pre-compute expensive joins:
-- Snowflake: Pre-computed enriched events
CREATE MATERIALIZED VIEW enriched_events_mv AS
SELECT
e.event_id,
e.event_time,
e.event_type,
u.user_name,
u.segment,
p.product_name,
p.category,
e.amount
FROM events e
JOIN users u ON e.user_id = u.user_id
JOIN products p ON e.product_id = p.product_id;
-- Query now only needs simple scan
SELECT segment, category, SUM(amount)
FROM enriched_events_mv
WHERE event_time >= '2024-01-01'
GROUP BY 1, 2;
Pattern 3: Window Function Materialization
-- Materialize complex window functions
CREATE MATERIALIZED VIEW user_session_metrics_mv AS
WITH session_bounds AS (
SELECT
user_id,
session_id,
MIN(event_time) AS session_start,
MAX(event_time) AS session_end,
COUNT(*) AS event_count,
SUM(amount) AS session_value,
-- Session duration in seconds
EXTRACT(EPOCH FROM (MAX(event_time) - MIN(event_time))) AS duration_seconds
FROM events
GROUP BY user_id, session_id
)
SELECT
*,
-- Classification based on metrics
CASE
WHEN duration_seconds < 30 THEN 'bounce'
WHEN duration_seconds < 300 THEN 'engaged'
ELSE 'power_user'
END AS session_type
FROM session_bounds;
Pattern 4: Slowly Changing Dimension (SCD) MV
-- Type 2 SCD materialization
CREATE MATERIALIZED VIEW dim_users_scd2_mv AS
WITH ranked_users AS (
SELECT
user_id,
email,
segment,
effective_date,
expiry_date,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY effective_date DESC
) AS rn
FROM dim_users_history
)
SELECT
user_id,
email,
segment,
effective_date,
COALESCE(expiry_date, '9999-12-31') AS expiry_date,
CASE WHEN rn = 1 THEN TRUE ELSE FALSE END AS is_current
FROM ranked_users;
4. Performance Optimization
Predicate Pushdown into MVs
The query optimizer should push predicates into MV access:
-- Optimized query that can use MV
SELECT region, SUM(weekly_sales)
FROM regional_sales_mv
WHERE week >= '2024-01-01' AND week < '2024-02-01'
GROUP BY region;
-- Explain shows MV access instead of base table scan
EXPLAIN
SELECT region, SUM(weekly_sales)
FROM regional_sales_mv
WHERE week >= '2024-01-01'
GROUP BY region;
Partitioning Strategy for MVs
-- Partition MV by time for efficient pruning
CREATE MATERIALIZED VIEW events_daily_mv
PARTITION BY (event_date, event_type) AS
SELECT
DATE(event_time) AS event_date,
event_type,
user_id,
COUNT(*) AS event_count,
SUM(amount) AS total_amount
FROM events
GROUP BY 1, 2, 3;
-- Query benefits from partition pruning
SELECT event_type, SUM(total_amount)
FROM events_daily_mv
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY event_type;
Statistics and Cost-Based Optimization
-- Update statistics for MV
ANALYZE MATERIALIZED VIEW sales_summary_mv;
-- Snowflake: Auto-refresh statistics
ALTER MATERIALIZED VIEW sales_summary_mv
SET REFRESH_ON_CREATE = TRUE;
-- Check MV freshness and size
SELECT
name,
refresh_lag,
stale_after,
data_provenance,
TEXT(parent_queries)
FROM information_schema.materialized_views;
5. Refresh Lag and Freshness
Freshness Guarantees
Lag Detection and Alerting
-- Monitor MV freshness
CREATE OR REPLACE PROCEDURE check_mv_freshness()
RETURNS TABLE (mv_name STRING, freshness_minutes FLOAT, status STRING)
AS
$$
DECLARE
max_lag_minutes FLOAT;
BEGIN
CREATE OR REPLACE TEMP TABLE mv_freshness AS
SELECT
name,
DATEDIFF('minute', last_refresh_time, CURRENT_TIMESTAMP()) AS lag_minutes
FROM information_schema.materialized_view_refresh_history
WHERE refresh_state = 'STALE';
FOR rec IN SELECT * FROM mv_freshness DO
IF rec.lag_minutes > 60 THEN
-- Alert on stale MVs
CALL SYSTEM$SEND_EMAIL(
'alerts@company.com',
'MV Stale Alert: ' || rec.name,
'Materialized view ' || rec.name || ' is ' || rec.lag_minutes || ' minutes stale'
);
END IF;
END FOR;
RETURN TABLE (
SELECT name, lag_minutes,
CASE WHEN lag_minutes > 60 THEN 'CRITICAL'
WHEN lag_minutes > 30 THEN 'WARNING'
ELSE 'OK' END
FROM mv_freshness
);
END;
$$;
6. Advanced: Query Routing to MVs
Automatic Query Rewrite
-- PostgreSQL: Enable query rewrite
CREATE FUNCTION rewrite_query() RETURNS TRIGGER AS $$
BEGIN
-- Check if query can be answered from MV
IF check_mv_coverage(OLD) THEN
RAISE NOTICE 'Query can use materialized view';
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Snowflake automatic rewrite
ALTER SESSION SET use_cached_result = TRUE;
-- Manual rewrite detection
SELECT
query_text,
rewrite_status,
materialized_view_name
FROM query_rewrite_results
WHERE query_id = CURRENT_QUERY_ID();
Cost-Based MV Selection
# Programmatic MV selection logic
def select_optimal_mv(query_filters, available_mvs):
"""Select best MV based on filter coverage"""
best_mv = None
best_cost = float('inf')
for mv in available_mvs:
# Calculate filter coverage
covered_filters = set(query_filters) & set(mv['indexed_columns'])
coverage = len(covered_filters) / len(query_filters) if query_filters else 0
# Cost = size * (1 - coverage)
cost = mv['size_bytes'] * (1 - coverage)
if cost < best_cost:
best_cost = cost
best_mv = mv
return best_mv
# Example usage
query_filters = ['region', 'date', 'product_category']
available_mvs = [
{'name': 'sales_by_region', 'size_bytes': 1024**3, 'indexed_columns': ['region', 'date']},
{'name': 'sales_by_category', 'size_bytes': 2 * 1024**3, 'indexed_columns': ['product_category', 'date']},
{'name': 'sales_full', 'size_bytes': 5 * 1024**3, 'indexed_columns': ['region', 'date', 'product_category']},
]
optimal = select_optimal_mv(query_filters, available_mvs)
print(f"Selected MV: {optimal['name']}")
7. Common Pitfalls
⚠️
MV Explosion: Creating too many MVs can lead to excessive storage and refresh overhead. Monitor total MV size vs base table size ratio. Ideally keep MV storage under 20% of base tables.
⚠️
Stale Data Risk: Always check MV freshness before making business decisions. Implement SLA monitoring for critical MVs.
💡
Refresh Ordering: When MVs depend on other MVs, create a refresh dependency graph to avoid stale data propagation.
ℹ️
Naming Convention: Use consistent naming like {table}_mv or {purpose}_mv_{granularity} to make MV relationships clear.
Follow-up Questions
- How would you implement incremental refresh for a materialized view containing window functions?
- Explain the trade-offs between query rewrite optimization and manual MV selection in Snowflake.
- How do you handle schema evolution in materialized views when base tables change?
- Design a monitoring dashboard for materialized view health and freshness across a data platform.
- How would you implement a materialized view that supports both real-time and batch queries?
- Explain the cost implications of maintaining materialized views vs using query caching.
- How do you handle concurrent refreshes and query consistency in PostgreSQL materialized views?
- Design a strategy for materialized view consolidation in a legacy data warehouse migration.