Optimization Techniques: Clustering, Search Optimization & Materialized Views

Free Lesson

Advertisement

Optimization Techniques: Clustering, Search Optimization & Materialized Views

Architecture Diagram 1: Clustering Key Architecture

Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────────┐
│                    CLUSTERING KEY ARCHITECTURE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  WITHOUT CLUSTERING KEY:                                                    │
│  ═══════════════════════                                                    │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  Micro-Partitions are randomly distributed:                         │   │
│  │                                                                      │   │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐     │   │
│  │  │ MP_001  │ │ MP_002  │ │ MP_003  │ │ MP_004  │ │ MP_005  │     │   │
│  │  │         │ │         │ │         │ │         │ │         │     │   │
│  │  │ US: 30% │ │ EU: 40% │ │ APAC:25%│ │ US: 35% │ │ EU: 20% │     │   │
│  │  │ EU: 40% │ │ US: 30% │ │ US: 40% │ │ EU: 30% │ │ US: 45% │     │   │
│  │  │APAC:30% │ │APAC:30% │ │ EU: 35% │ │APAC:35% │ │APAC:35% │     │   │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘     │   │
│  │                                                                      │   │
│  │  Query: SELECT * FROM sales WHERE region = 'US'                     │   │
│  │  Result: Must scan ALL 5 micro-partitions (US data scattered)       │   │
│  │                                                                      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                        │
│                                    │  CLUSTER BY (region)                   │
│                                    ▼                                        │
│  WITH CLUSTERING KEY:                                                       │
│  ═════════════════════                                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  Micro-Partitions are organized by clustering key:                  │   │
│  │                                                                      │   │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐     │   │
│  │  │ MP_001  │ │ MP_002  │ │ MP_003  │ │ MP_004  │ │ MP_005  │     │   │
│  │  │         │ │         │ │         │ │         │ │         │     │   │
│  │  │ US: 95% │ │ US: 90% │ │ EU: 85% │ │ EU: 90% │ │APAC:95% │     │   │
│  │  │ EU: 5%  │ │APAC:10% │ │APAC:15% │ │ US: 10% │ │ US: 5%  │     │   │
│  │  │APAC: 0% │ │ EU: 0%  │ │ US: 0%  │ │APAC: 0% │ │ EU: 0%  │     │   │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘     │   │
│  │                                                                      │   │
│  │  Query: SELECT * FROM sales WHERE region = 'US'                     │   │
│  │  Result: Only scan MP_001 & MP_002 (US data concentrated)           │   │
│  │                                                                      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  CLUSTERING DEPTH & OVERLAP:                                                │
│  ═══════════════════════════                                                │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  Deep Clustering (Good):                                            │   │
│  │  ┌──────────────────────────────────────────────────────────────┐   │   │
│  │  │  MP_001: US [95%]  |  MP_002: US [90%]  |  MP_003: EU [85%] │   │   │
│  │  │  Deepness: 1.0     |  Deepness: 1.0     |  Deepness: 1.0     │   │   │
│  │  │  Overlap: 5%       |  Overlap: 10%      |  Overlap: 15%      │   │   │
│  │  └──────────────────────────────────────────────────────────────┘   │   │
│  │                                                                      │   │
│  │  Shallow Clustering (Bad):                                          │   │
│  │  ┌──────────────────────────────────────────────────────────────┐   │   │
│  │  │  MP_001: US [60%] EU [30%] APAC [10%]                       │   │   │
│  │  │  Deepness: 3.5     (High overlap, poor pruning)              │   │   │
│  │  │  Overlap: 40%      (Data not well-separated)                 │   │   │
│  │  └──────────────────────────────────────────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Architecture Diagram 2: Search Optimization Service

Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────────┐
│                    SEARCH OPTIMIZATION SERVICE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  PROBLEM: Point lookups on non-clustering columns                           │
│  ═══════════════════════════════════════════════                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  Table: customers (1 billion rows)                                  │   │
│  │  Clustering Key: (country)                                          │   │
│  │                                                                      │   │
│  │  Query: SELECT * FROM customers WHERE email = 'john@example.com'    │   │
│  │                                                                      │   │
│  │  Without Search Optimization:                                       │   │
│  │  ┌──────────────────────────────────────────────────────────────┐   │   │
│  │  │  Must scan: All micro-partitions (1000+)                      │   │   │
│  │  │  Reason: email is NOT the clustering key                      │   │   │
│  │  │  Time: 30-60 seconds                                          │   │   │
│  │  └──────────────────────────────────────────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                        │
│                                    │  SEARCH OPTIMIZATION ON               │
│                                    ▼                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  SOLUTION: Search Optimization Service                              │   │
│  │                                                                      │   │
│  │  ┌──────────────────────────────────────────────────────────────┐   │   │
│  │  │  Search Optimization Data Structure:                          │   │   │
│  │  │                                                               │   │   │
│  │  │  ┌────────────────────────────────────────────────────────┐  │   │   │
│  │  │  │  Point Lookup Index (POI)                               │  │   │   │
│  │  │  │                                                         │  │   │   │
│  │  │  │  Email Value        │ Micro-Partition IDs              │  │   │   │
│  │  │  │  ──────────────────┼───────────────────────────────── │  │   │   │
│  │  │  │  aaron@test.com    │ MP_001, MP_156, MP_892           │  │   │   │
│  │  │  │  alice@company.com │ MP_023, MP_445, MP_1001          │  │   │   │
│  │  │  │  bob@example.com   │ MP_012, MP_234                   │  │   │   │
│  │  │  │  ...              │ ...                               │  │   │   │
│  │  │  │  john@example.com │ MP_045, MP_678                   │  │   │   │
│  │  │  │                                                         │  │   │   │
│  │  │  │  Total entries: ~500M unique emails                     │  │   │   │
│  │  │  │  Storage overhead: ~10-20% of table size                │  │   │   │
│  │  │  └────────────────────────────────────────────────────────┘  │   │   │
│  │  └──────────────────────────────────────────────────────────────┘   │   │
│  │                                                                      │   │
│  │  With Search Optimization:                                          │   │
│  │  ┌──────────────────────────────────────────────────────────────┐   │   │
│  │  │  Must scan: Only 2 micro-partitions (MP_045, MP_678)         │   │   │
│  │  │  Reason: POI directly maps email to micro-partitions         │   │   │
│  │  │  Time: 0.1-0.5 seconds                                       │   │   │
│  │  │  Improvement: 60-600x faster                                 │   │   │
│  │  └──────────────────────────────────────────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  SEARCH OPTIMIZATION MAINTENANCE:                                          │
│  ═════════════════════════════════                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  Automatic Background Processes:                                     │   │
│  │                                                                      │   │
│  │  ┌──────────────────────────────────────────────────────────────┐   │   │
│  │  │  1. Re-clustering:                                            │   │   │
│  │  │     • Continuously optimizes data placement                   │   │   │
│  │  │     • Runs when table statistics change                       │   │   │
│  │  │     • Cost: Compute credits (proportional to data changes)    │   │   │
│  │  │                                                               │   │   │
│  │  │  2. Index Maintenance:                                        │   │   │
│  │  │     • Updates POI when data is inserted/updated/deleted       │   │   │
│  │  │     • Maintains consistency with table data                   │   │   │
│  │  │     • Cost: Minimal (background thread)                       │   │   │
│  │  │                                                               │   │   │
│  │  │  3. Statistics Collection:                                     │   │   │
│  │  │     • Gathers column-level statistics                         │   │   │
│  │  │     • Updates min/max/distinct counts                         │   │   │
│  │  │     • Cost: Minimal                                           │   │   │
│  │  └──────────────────────────────────────────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Architecture Diagram 3: Materialized Views Architecture

Architecture Diagram
┌─────────────────────────────────────────────────────────────────────────────┐
│                    MATERIALIZED VIEWS ARCHITECTURE                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  TRADITIONAL VIEW:                                                          │
│  ═════════════════                                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  SELECT * FROM sales_summary;                                       │   │
│  │                                                                      │   │
│  │  ┌──────────────────────────────────────────────────────────────┐   │   │
│  │  │  Execution Plan:                                              │   │   │
│  │  │                                                               │   │   │
│  │  │  1. Parse view definition                                     │   │   │
│  │  │  2. Rewrite query as inline SQL                               │   │   │
│  │  │  3. Execute against base table                                │   │   │
│  │  │  4. Return results                                            │   │   │
│  │  │                                                               │   │   │
│  │  │  Every query re-executes the full aggregation                 │   │   │
│  │  │  Cost: O(n) per query where n = base table size              │   │   │
│  │  └──────────────────────────────────────────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                        │
│                                    │  CREATE MATERIALIZED VIEW             │
│                                    ▼                                        │
│  MATERIALIZED VIEW:                                                         │
│  ══════════════════                                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  SELECT * FROM sales_mv;                                            │   │
│  │                                                                      │   │
│  │  ┌──────────────────────────────────────────────────────────────┐   │   │
│  │  │  Pre-computed Results:                                        │   │   │
│  │  │                                                               │   │   │
│  │  │  ┌────────────────────────────────────────────────────────┐  │   │   │
│  │  │  │  sales_mv (Materialized)                               │  │   │   │
│  │  │  │                                                         │  │   │   │
│  │  │  │  ┌─────────┬─────────┬─────────┬─────────┬──────────┐ │  │   │   │
│  │  │  │  │ Region  │ Product │ Q1 2024 │ Q2 2024 │ Total    │ │  │   │   │
│  │  │  │  ├─────────┼─────────┼─────────┼─────────┼──────────┤ │  │   │   │
│  │  │  │  │ US      │ Laptop  │ 150,000 │ 180,000 │ 330,000  │ │  │   │   │
│  │  │  │  │ US      │ Phone   │ 200,000 │ 220,000 │ 420,000  │ │  │   │   │
│  │  │  │  │ EU      │ Laptop  │ 120,000 │ 140,000 │ 260,000  │ │  │   │   │
│  │  │  │  │ EU      │ Phone   │ 180,000 │ 190,000 │ 370,000  │ │  │   │   │
│  │  │  │  └─────────┴─────────┴─────────┴─────────┴──────────┘ │  │   │   │
│  │  │  │                                                         │  │   │   │
│  │  │  │  Results stored on disk (not recomputed)                │  │   │   │
│  │  │  │  Cost: O(1) per query                                   │  │   │   │
│  │  │  └────────────────────────────────────────────────────────┘  │   │   │
│  │  └──────────────────────────────────────────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                        │
│                                    ▼                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  REFRESH MECHANISMS:                                                │   │
│  │                                                                      │   │
│  │  ┌──────────────────────────────────────────────────────────────┐   │   │
│  │  │                                                               │   │   │
│  │  │  Automatic Refresh (Background):                              │   │   │
│  │  │  ┌────────────────────────────────────────────────────────┐  │   │   │
│  │  │  │  • Detects base table changes                            │  │   │   │
│  │  │  │  • Incrementally updates materialized results           │  │   │   │
│  │  │  │  • Runs periodically (not real-time)                    │  │   │   │
│  │  │  │  • Cost: Background compute credits                     │  │   │   │
│  │  │  └────────────────────────────────────────────────────────┘  │   │   │
│  │  │                                                               │   │   │
│  │  │  Manual Refresh:                                              │   │   │
│  │  │  ┌────────────────────────────────────────────────────────┐  │   │   │
│  │  │  │  ALTER MATERIALIZED VIEW sales_mv REFRESH;              │  │   │   │
│  │  │  │                                                         │  │   │   │
│  │  │  │  • Forces immediate refresh                            │  │   │   │
│  │  │  │  • Waits for completion                                │  │   │   │
│  │  │  │  • Cost: Compute credits for refresh                   │  │   │   │
│  │  │  └────────────────────────────────────────────────────────┘  │   │   │
│  │  └──────────────────────────────────────────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

A clustering key defines how data is organized within micro-partitions. Snowflake automatically reorganizes micro-partitions to group rows with similar key values together, improving partition pruning effectiveness for queries that filter on key columns.

The Search Optimization Service creates auxiliary point lookup indexes (POI) that map specific column values to their micro-partition locations. This enables efficient equality searches on non-clustering columns, providing 60–600× performance improvements for point lookups.

A materialized view pre-computes and stores query results on disk, eliminating re-execution of complex aggregations. Unlike standard views (recomputed each access), materialized views return pre-computed results at O(1) per query, with automatic or manual refresh.

Materialized views support automatic incremental refresh (not real-time). The optimizer transparently rewrites queries against standard views to use materialized views when data freshness requirements are met.

  1. Analyze query patterns: Use QUERY_HISTORY to identify frequent WHERE/JOIN columns
  2. Choose clustering keys: Select low-to-medium cardinality columns used in filters
  3. Check clustering depth: Use CLUSTERING_INFORMATION(); target depth < 1.5
  4. Consider search optimization: Enable for high-cardinality equality lookups only
  5. Materialize aggregations: Create materialized views for complex, frequently-run aggregations
  6. Monitor: Track pruning efficiency, depth, and materialized view freshness
  • Clustering keys: Organize data for better partition pruning (target depth < 1.5)
  • Search optimization: 60–600× faster point lookups; 10–20% storage overhead
  • Materialized views: Pre-computed results for complex aggregations
  • Automatic clustering: Background service optimizes data placement continuously
  • Balance: Not all tables need clustering — consider query patterns and cost

Detailed Explanation

Clustering Keys: Data Organization Strategy

Clustering keys define how data is organized within micro-partitions, optimizing query performance by improving data locality. When you specify a clustering key, Snowflake automatically reorganizes micro-partitions to group rows with similar key values together. This organization enables more effective partition pruning during query execution, eliminating entire micro-partitions that don't contain qualifying data.

The selection of clustering keys requires understanding query patterns and data distribution. Effective clustering keys are columns frequently used in WHERE, JOIN, and GROUP BY clauses. Low-to-medium cardinality columns (e.g., date, region, status) typically provide better clustering effectiveness than high-cardinality columns (e.g., UUID, timestamp with microseconds).

Snowflake maintains clustering metadata that tracks the distribution of key values across micro-partitions. This metadata includes minimum and maximum key values per micro-partition, enabling the optimizer to eliminate partitions that don't contain relevant data. The clustering depth metric indicates how well data is organized, with lower values indicating better clustering (less overlap between micro-partitions).

Automatic Clustering

Automatic clustering is a background service that continuously optimizes data placement based on clustering keys. Unlike manual re-clustering, automatic clustering runs asynchronously without user intervention, adapting to data changes and query pattern evolution. The service uses a cost-based approach, only re-clustering when the benefits outweigh the compute costs.

The automatic clustering algorithm analyzes micro-partition statistics and identifies opportunities to improve data organization. It moves rows between micro-partitions to reduce overlap and improve clustering depth. This process runs in the background, consuming minimal compute resources while maintaining optimal data organization.

Search Optimization Service

The search optimization service addresses a specific performance gap: point lookups on non-clustering columns. Without search optimization, finding a specific value in a non-clustering column requires scanning all micro-partitions, even if the table is well-clustered. The search optimization service creates auxiliary data structures (point lookup indexes) that map specific values to their micro-partition locations.

The service maintains a point lookup index (POI) for each column with search optimization enabled. When a query performs an equality search on an indexed column, the optimizer uses the POI to identify the exact micro-partitions containing the target value, eliminating unnecessary scans. This can provide 60-600x performance improvements for point lookup queries.

Search optimization is most effective for tables with high-cardinality columns (millions of distinct values) that are frequently used in equality searches. The service incurs storage overhead (10-20% of table size) and maintenance costs for index updates. It's not recommended for columns with low cardinality or columns primarily used in range queries.

Materialized Views

Materialized views pre-compute and store query results, eliminating the need to re-execute complex aggregations or joins for each query. Unlike standard views that execute their defining query each time they're accessed, materialized views store results on disk and return them directly. This provides significant performance improvements for queries that repeatedly access the same aggregated data.

Materialized views support automatic and manual refresh mechanisms. Automatic refresh detects changes to base tables and incrementally updates materialized results. This refresh is not real-time; there's a lag between base table changes and materialized view updates. Manual refresh forces immediate updates but requires explicit execution and consumes compute resources.

The optimizer automatically rewrites queries against standard views to use materialized views when possible. This transparency means users don't need to know whether they're querying a view or materialized view. The optimizer considers data freshness requirements and chooses the appropriate data source based on query characteristics.

Performance Monitoring and Tuning

Effective optimization requires continuous monitoring and tuning. Snowflake provides several tools for analyzing query performance, including the QUERY_HISTORY view for execution statistics, PROFILE for detailed execution plans, and WAREHOUSE_METERING_HISTORY for resource utilization.

Key metrics to monitor include partition pruning effectiveness (percentage of micro-partitions eliminated), clustering depth (overlap between micro-partitions), and materialized view freshness (time since last refresh). These metrics indicate whether optimization strategies are effective and identify opportunities for improvement.

Key Concepts Table

OptimizationPurposeOverheadBest For
Clustering KeysData organizationLowRange queries, JOINs
Search OptimizationPoint lookupsHighEquality searches
Materialized ViewsPre-computed resultsMediumComplex aggregations
Result CacheQuery result cachingNoneRepeated identical queries
Clustering Key SelectionCardinalityQuery PatternEffectiveness
Date columnLow-MediumRange queriesHigh
Region/CountryLowGROUP BY, WHEREHigh
Customer IDHighPoint lookupsMedium
TimestampHighRange queriesMedium
Composite keyVariesComplex queriesHigh
Search OptimizationColumn TypeQuery PatternStorage Overhead
High cardinalitySTRING/NUMBEREquality (=)10-20%
Low cardinalitySTRINGEquality (=)5-10%
Range queriesNUMBER/DATERange (>, &lt;)Not recommended

Code Examples

-- Example 1: Create table with clustering key
CREATE TABLE sales_data (
    id INTEGER,
    transaction_date DATE,
    region VARCHAR(10),
    product_id VARCHAR(20),
    amount NUMBER(10,2)
)
CLUSTER BY (transaction_date, region);

-- Example 2: Alter table to add clustering key
ALTER TABLE sales_data CLUSTER BY (transaction_date, region);

-- Example 3: Check clustering information
SELECT *
FROM TABLE(INFORMATION_SCHEMA.CLUSTERING_INFORMATION(
    TABLE_NAME => 'sales_data',
    SCHEMA_NAME => 'PUBLIC'
));

-- Example 4: Create materialized view
CREATE MATERIALIZED VIEW sales_summary_mv
AS
SELECT 
    transaction_date,
    region,
    COUNT(*) as transaction_count,
    SUM(amount) as total_amount,
    AVG(amount) as avg_amount
FROM sales_data
GROUP BY 1, 2;

-- Example 5: Enable search optimization
ALTER TABLE sales_data ADD SEARCH OPTIMIZATION ON EQUALITY (product_id);

-- Example 6: Refresh materialized view
ALTER MATERIALIZED VIEW sales_summary_mv REFRESH;

-- Example 7: Check materialized view freshness
SELECT 
    name,
    refresh_group,
    last_refresh_time,
    seconds_since_refresh
FROM INFORMATION_SCHEMA.MATERIALIZED_VIEWS
WHERE name = 'SALES_SUMMARY_MV';

-- Example 8: Analyze clustering effectiveness
SELECT 
    table_name,
    clustering_key,
    total_partition_count,
    total_constant_partition_count,
    average_partition_depth,
    average_overlap_bytes,
    average_partition_depth * average_overlap_bytes as clustering_score
FROM TABLE(INFORMATION_SCHEMA.CLUSTERING_INFORMATION(
    TABLE_NAME => 'sales_data'
));

-- Example 9: Check search optimization status
SHOW SEARCH OPTIMIZATION IN TABLE sales_data;

-- Example 10: Compare query performance
-- Without materialized view
EXPLAIN SELECT 
    transaction_date,
    region,
    SUM(amount)
FROM sales_data
WHERE transaction_date >= '2024-01-01'
GROUP BY 1, 2;

-- With materialized view (optimizer should rewrite)
EXPLAIN SELECT 
    transaction_date,
    region,
    SUM(amount)
FROM sales_summary_mv
WHERE transaction_date >= '2024-01-01'
GROUP BY 1, 2;

Performance Metrics

MetricTargetWarningCriticalDescription
Clustering Depth< 1.51.5-2.5> 2.5Average micro-partition overlap
Partition Pruning> 80%60-80%< 60%Percentage of partitions eliminated
Search Optimization Hit> 90%70-90%< 70%POI lookup success rate
Materialized View Freshness< 1 hour1-24 hours> 24 hoursTime since last refresh
Cache Hit Rate> 50%30-50%< 30%Result cache effectiveness

Best Practices

  1. Choose clustering keys wisely: Select columns frequently used in WHERE, JOIN, and GROUP BY clauses. Prefer low-to-medium cardinality columns for better pruning.

  2. Monitor clustering depth: Regularly check clustering information to identify tables with poor clustering. Consider re-clustering for tables with depth > 2.0.

  3. Use search optimization selectively: Enable search optimization only for high-cardinality columns used in equality searches. Avoid for low-cardinality or range query columns.

  4. Create materialized views for aggregations: Materialize complex aggregations that are queried frequently. Consider refresh frequency requirements and storage costs.

  5. Monitor materialized view freshness: Ensure materialized views are refreshed within SLA requirements. Use manual refresh for time-critical scenarios.

  6. Leverage result cache: Ensure identical queries can hit the cache by avoiding non-deterministic functions. Use QUERY_TAG to group similar queries.

  7. Analyze query patterns: Use QUERY_HISTORY to identify slow queries and optimize accordingly. Focus on queries with poor partition pruning.

  8. Consider composite clustering keys: Use multiple columns for clustering when queries frequently filter on multiple columns. Order columns by selectivity.

  9. Balance optimization overhead: Consider storage and compute costs when implementing optimization features. Not all tables need clustering or search optimization.

  10. Regular performance reviews: Conduct weekly optimization reviews to identify performance trends and adjust strategies accordingly.


See Also

Advertisement

Need Expert Snowflake Help?

Get personalized warehouse optimization, data modeling, or Snowflake platform consulting.

Advertisement