Materialized Views: Design Patterns & Performance

Difficulty: Senior Level | Companies: Snowflake, Databricks, Google Cloud, AWS, Netflix, Uber

This question tests your ability to design, optimize, and maintain materialized views for high-performance data warehousing and analytics workloads.

1. Materialized Views Fundamentals

What is a Materialized View?

A materialized view is a pre-computed query result stored as a physical table:

\text{MV}(Q) = \text{Execute}(Q, \text{BaseTables}) \rightarrow \text{StoredResult}

Unlike a standard view (virtual):

\text{View}(Q) = \text{Definition}(Q) \text{ - executes on each access}

Core Benefits

\text{Speedup Factor} = \frac{\text{Time}_{\text{base query}}}{\text{Time}_{\text{MV access}}}

Typical speedup: 10x - 1000x depending on query complexity.

2. Refresh Strategies

Full Refresh

Complete rebuild from scratch:

-- Snowflake
CREATE MATERIALIZED VIEW sales_summary_mv AS
SELECT 
    region,
    product_category,
    DATE_TRUNC('month', sale_date) AS month,
    SUM(amount) AS total_sales,
    COUNT(*) AS transaction_count,
    AVG(amount) AS avg_order_value
FROM sales
WHERE sale_date >= DATEADD('year', -2, CURRENT_DATE())
GROUP BY 1, 2, 3;

-- Manual refresh
ALTER MATERIALIZED VIEW sales_summary_mv REFRESH;

-- Auto-refresh (Snowflake)
ALTER MATERIALIZED VIEW sales_summary_mv
    SET DATA_RETENTION_TIME_IN_DAYS = 1
    REFRESH_DATA_AHEAD_FOR_ANALYTICS = TRUE;

Incremental Refresh

Only update changed data:

\text{MV}_{new} = \text{MV}_{old} \oplus \Delta(\text{BaseTable})

Where $\oplus$ represents the merge operation.

-- PostgreSQL incremental materialized view
CREATE MATERIALIZED VIEW daily_metrics_mv AS
WITH daily_agg AS (
    SELECT 
        DATE(event_time) AS event_date,
        COUNT(*) AS event_count,
        COUNT(DISTINCT user_id) AS unique_users
    FROM events
    GROUP BY 1
)
SELECT * FROM daily_agg;

-- Concurrent refresh (PostgreSQL specific)
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_metrics_mv;

-- For concurrent refresh, need unique index
CREATE UNIQUE INDEX idx_daily_metrics_mv_date 
ON daily_metrics_mv(event_date);

Delta Lake Incremental Pattern

from pyspark.sql import SparkSession
from delta import DeltaTable
from pyspark.sql.functions import col, max as spark_max, lit

spark = SparkSession.builder \
    .appName("IncrementalMV") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .getOrCreate()

# Incremental merge pattern
def incremental_mv_update(spark, base_table_path, mv_table_path, watermark_col):
    # Read base table incrementally
    last_mv_watermark = spark.read.format("delta").load(mv_table_path) \
        .agg(spark_max(watermark_col)).collect()[0][0]
    
    if last_mv_watermark is None:
        last_mv_watermark = "1970-01-01"
    
    # Get incremental data
    incremental_df = spark.read.format("delta") \
        .load(base_table_path) \
        .filter(col(watermark_col) > lit(last_mv_watermark)) \
        .groupBy("region", "product_category") \
        .agg(
            F.sum("amount").alias("total_sales"),
            F.count("*").alias("transaction_count")
        )
    
    # Merge into materialized view
    mv_table = DeltaTable.forPath(spark, mv_table_path)
    
    mv_table.alias("mv").merge(
        incremental_df.alias("inc"),
        "mv.region = inc.region AND mv.product_category = inc.product_category"
    ).whenMatchedUpdate(
        set={
            "mv.total_sales": col("mv.total_sales") + col("inc.total_sales"),
            "mv.transaction_count": col("mv.transaction_count") + col("inc.transaction_count")
        }
    ).whenNotMatchedInsertAll().execute()

3. Design Patterns

Pattern 1: Aggregate Rollup MV

Pre-aggregate at multiple granularity levels:

-- Base aggregate
CREATE MATERIALIZED VIEW regional_sales_mv AS
SELECT 
    region,
    DATE_TRUNC('week', sale_date) AS week,
    SUM(amount) AS weekly_sales
FROM sales
GROUP BY 1, 2;

-- Higher-level rollup
CREATE MATERIALIZED VIEW regional_monthly_mv AS
SELECT 
    region,
    DATE_TRUNC('month', week) AS month,
    SUM(weekly_sales) AS monthly_sales,
    COUNT(*) AS weeks_with_data
FROM regional_sales_mv
GROUP BY 1, 2;

Pattern 2: Join Pre-computation

Pre-compute expensive joins:

-- Snowflake: Pre-computed enriched events
CREATE MATERIALIZED VIEW enriched_events_mv AS
SELECT 
    e.event_id,
    e.event_time,
    e.event_type,
    u.user_name,
    u.segment,
    p.product_name,
    p.category,
    e.amount
FROM events e
JOIN users u ON e.user_id = u.user_id
JOIN products p ON e.product_id = p.product_id;

-- Query now only needs simple scan
SELECT segment, category, SUM(amount)
FROM enriched_events_mv
WHERE event_time >= '2024-01-01'
GROUP BY 1, 2;

Pattern 3: Window Function Materialization

-- Materialize complex window functions
CREATE MATERIALIZED VIEW user_session_metrics_mv AS
WITH session_bounds AS (
    SELECT 
        user_id,
        session_id,
        MIN(event_time) AS session_start,
        MAX(event_time) AS session_end,
        COUNT(*) AS event_count,
        SUM(amount) AS session_value,
        -- Session duration in seconds
        EXTRACT(EPOCH FROM (MAX(event_time) - MIN(event_time))) AS duration_seconds
    FROM events
    GROUP BY user_id, session_id
)
SELECT 
    *,
    -- Classification based on metrics
    CASE 
        WHEN duration_seconds < 30 THEN 'bounce'
        WHEN duration_seconds < 300 THEN 'engaged'
        ELSE 'power_user'
    END AS session_type
FROM session_bounds;

Pattern 4: Slowly Changing Dimension (SCD) MV

-- Type 2 SCD materialization
CREATE MATERIALIZED VIEW dim_users_scd2_mv AS
WITH ranked_users AS (
    SELECT 
        user_id,
        email,
        segment,
        effective_date,
        expiry_date,
        ROW_NUMBER() OVER (
            PARTITION BY user_id 
            ORDER BY effective_date DESC
        ) AS rn
    FROM dim_users_history
)
SELECT 
    user_id,
    email,
    segment,
    effective_date,
    COALESCE(expiry_date, '9999-12-31') AS expiry_date,
    CASE WHEN rn = 1 THEN TRUE ELSE FALSE END AS is_current
FROM ranked_users;

4. Performance Optimization

Predicate Pushdown into MVs

The query optimizer should push predicates into MV access:

\text{Query} = \sigma_{\text{predicate}}(\text{MV})

-- Optimized query that can use MV
SELECT region, SUM(weekly_sales)
FROM regional_sales_mv
WHERE week >= '2024-01-01' AND week < '2024-02-01'
GROUP BY region;

-- Explain shows MV access instead of base table scan
EXPLAIN 
SELECT region, SUM(weekly_sales)
FROM regional_sales_mv
WHERE week >= '2024-01-01'
GROUP BY region;

Partitioning Strategy for MVs

-- Partition MV by time for efficient pruning
CREATE MATERIALIZED VIEW events_daily_mv
PARTITION BY (event_date, event_type) AS
SELECT 
    DATE(event_time) AS event_date,
    event_type,
    user_id,
    COUNT(*) AS event_count,
    SUM(amount) AS total_amount
FROM events
GROUP BY 1, 2, 3;

-- Query benefits from partition pruning
SELECT event_type, SUM(total_amount)
FROM events_daily_mv
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY event_type;

Statistics and Cost-Based Optimization

-- Update statistics for MV
ANALYZE MATERIALIZED VIEW sales_summary_mv;

-- Snowflake: Auto-refresh statistics
ALTER MATERIALIZED VIEW sales_summary_mv 
    SET REFRESH_ON_CREATE = TRUE;

-- Check MV freshness and size
SELECT 
    name,
    refresh_lag,
    stale_after,
    data_provenance,
    TEXT(parent_queries)
FROM information_schema.materialized_views;

5. Refresh Lag and Freshness

Freshness Guarantees

\text{Freshness} = \max_{t \in \text{MV}} (t_{\text{now}} - t_{\text{last\_refresh}})

Lag Detection and Alerting

-- Monitor MV freshness
CREATE OR REPLACE PROCEDURE check_mv_freshness()
RETURNS TABLE (mv_name STRING, freshness_minutes FLOAT, status STRING)
AS
$$
DECLARE
    max_lag_minutes FLOAT;
BEGIN
    CREATE OR REPLACE TEMP TABLE mv_freshness AS
    SELECT 
        name,
        DATEDIFF('minute', last_refresh_time, CURRENT_TIMESTAMP()) AS lag_minutes
    FROM information_schema.materialized_view_refresh_history
    WHERE refresh_state = 'STALE';
    
    FOR rec IN SELECT * FROM mv_freshness DO
        IF rec.lag_minutes > 60 THEN
            -- Alert on stale MVs
            CALL SYSTEM$SEND_EMAIL(
                'alerts@company.com',
                'MV Stale Alert: ' || rec.name,
                'Materialized view ' || rec.name || ' is ' || rec.lag_minutes || ' minutes stale'
            );
        END IF;
    END FOR;
    
    RETURN TABLE (
        SELECT name, lag_minutes, 
            CASE WHEN lag_minutes > 60 THEN 'CRITICAL' 
                 WHEN lag_minutes > 30 THEN 'WARNING' 
                 ELSE 'OK' END
        FROM mv_freshness
    );
END;
$$;

6. Advanced: Query Routing to MVs

Automatic Query Rewrite

-- PostgreSQL: Enable query rewrite
CREATE FUNCTION rewrite_query() RETURNS TRIGGER AS $$
BEGIN
    -- Check if query can be answered from MV
    IF check_mv_coverage(OLD) THEN
        RAISE NOTICE 'Query can use materialized view';
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Snowflake automatic rewrite
ALTER SESSION SET use_cached_result = TRUE;

-- Manual rewrite detection
SELECT 
    query_text,
    rewrite_status,
    materialized_view_name
FROM query_rewrite_results
WHERE query_id = CURRENT_QUERY_ID();

Cost-Based MV Selection

\text{Cost}(\text{MV}_i) = \text{Size}(\text{MV}_i) \times \text{Selectivity}(\text{Query})

\text{Best MV} = \arg\min_{i} \text{Cost}(\text{MV}_i)

# Programmatic MV selection logic
def select_optimal_mv(query_filters, available_mvs):
    """Select best MV based on filter coverage"""
    best_mv = None
    best_cost = float('inf')
    
    for mv in available_mvs:
        # Calculate filter coverage
        covered_filters = set(query_filters) & set(mv['indexed_columns'])
        coverage = len(covered_filters) / len(query_filters) if query_filters else 0
        
        # Cost = size * (1 - coverage)
        cost = mv['size_bytes'] * (1 - coverage)
        
        if cost < best_cost:
            best_cost = cost
            best_mv = mv
    
    return best_mv

# Example usage
query_filters = ['region', 'date', 'product_category']
available_mvs = [
    {'name': 'sales_by_region', 'size_bytes': 1024**3, 'indexed_columns': ['region', 'date']},
    {'name': 'sales_by_category', 'size_bytes': 2 * 1024**3, 'indexed_columns': ['product_category', 'date']},
    {'name': 'sales_full', 'size_bytes': 5 * 1024**3, 'indexed_columns': ['region', 'date', 'product_category']},
]

optimal = select_optimal_mv(query_filters, available_mvs)
print(f"Selected MV: {optimal['name']}")

7. Common Pitfalls

⚠️

MV Explosion: Creating too many MVs can lead to excessive storage and refresh overhead. Monitor total MV size vs base table size ratio. Ideally keep MV storage under 20% of base tables.

⚠️

Stale Data Risk: Always check MV freshness before making business decisions. Implement SLA monitoring for critical MVs.

💡

Refresh Ordering: When MVs depend on other MVs, create a refresh dependency graph to avoid stale data propagation.

ℹ️

Naming Convention: Use consistent naming like {table}_mv or {purpose}_mv_{granularity} to make MV relationships clear.

Follow-up Questions

How would you implement incremental refresh for a materialized view containing window functions?
Explain the trade-offs between query rewrite optimization and manual MV selection in Snowflake.
How do you handle schema evolution in materialized views when base tables change?
Design a monitoring dashboard for materialized view health and freshness across a data platform.
How would you implement a materialized view that supports both real-time and batch queries?
Explain the cost implications of maintaining materialized views vs using query caching.
How do you handle concurrent refreshes and query consistency in PostgreSQL materialized views?
Design a strategy for materialized view consolidation in a legacy data warehouse migration.