Optimization Techniques: Clustering, Search Optimization & Materialized Views
Architecture Diagram 1: Clustering Key Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ CLUSTERING KEY ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ WITHOUT CLUSTERING KEY: │
│ ═══════════════════════ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Micro-Partitions are randomly distributed: │ │
│ │ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ MP_001 │ │ MP_002 │ │ MP_003 │ │ MP_004 │ │ MP_005 │ │ │
│ │ │ │ │ │ │ │ │ │ │ │ │ │
│ │ │ US: 30% │ │ EU: 40% │ │ APAC:25%│ │ US: 35% │ │ EU: 20% │ │ │
│ │ │ EU: 40% │ │ US: 30% │ │ US: 40% │ │ EU: 30% │ │ US: 45% │ │ │
│ │ │APAC:30% │ │APAC:30% │ │ EU: 35% │ │APAC:35% │ │APAC:35% │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ │ │
│ │ Query: SELECT * FROM sales WHERE region = 'US' │ │
│ │ Result: Must scan ALL 5 micro-partitions (US data scattered) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ CLUSTER BY (region) │
│ ▼ │
│ WITH CLUSTERING KEY: │
│ ═════════════════════ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Micro-Partitions are organized by clustering key: │ │
│ │ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ MP_001 │ │ MP_002 │ │ MP_003 │ │ MP_004 │ │ MP_005 │ │ │
│ │ │ │ │ │ │ │ │ │ │ │ │ │
│ │ │ US: 95% │ │ US: 90% │ │ EU: 85% │ │ EU: 90% │ │APAC:95% │ │ │
│ │ │ EU: 5% │ │APAC:10% │ │APAC:15% │ │ US: 10% │ │ US: 5% │ │ │
│ │ │APAC: 0% │ │ EU: 0% │ │ US: 0% │ │APAC: 0% │ │ EU: 0% │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ │ │
│ │ Query: SELECT * FROM sales WHERE region = 'US' │ │
│ │ Result: Only scan MP_001 & MP_002 (US data concentrated) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ CLUSTERING DEPTH & OVERLAP: │
│ ═══════════════════════════ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Deep Clustering (Good): │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ MP_001: US [95%] | MP_002: US [90%] | MP_003: EU [85%] │ │ │
│ │ │ Deepness: 1.0 | Deepness: 1.0 | Deepness: 1.0 │ │ │
│ │ │ Overlap: 5% | Overlap: 10% | Overlap: 15% │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Shallow Clustering (Bad): │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ MP_001: US [60%] EU [30%] APAC [10%] │ │ │
│ │ │ Deepness: 3.5 (High overlap, poor pruning) │ │ │
│ │ │ Overlap: 40% (Data not well-separated) │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Architecture Diagram 2: Search Optimization Service
┌─────────────────────────────────────────────────────────────────────────────┐
│ SEARCH OPTIMIZATION SERVICE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PROBLEM: Point lookups on non-clustering columns │
│ ═══════════════════════════════════════════════ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Table: customers (1 billion rows) │ │
│ │ Clustering Key: (country) │ │
│ │ │ │
│ │ Query: SELECT * FROM customers WHERE email = 'john@example.com' │ │
│ │ │ │
│ │ Without Search Optimization: │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ Must scan: All micro-partitions (1000+) │ │ │
│ │ │ Reason: email is NOT the clustering key │ │ │
│ │ │ Time: 30-60 seconds │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ SEARCH OPTIMIZATION ON │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ SOLUTION: Search Optimization Service │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ Search Optimization Data Structure: │ │ │
│ │ │ │ │ │
│ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ Point Lookup Index (POI) │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ Email Value │ Micro-Partition IDs │ │ │ │
│ │ │ │ ──────────────────┼───────────────────────────────── │ │ │ │
│ │ │ │ aaron@test.com │ MP_001, MP_156, MP_892 │ │ │ │
│ │ │ │ alice@company.com │ MP_023, MP_445, MP_1001 │ │ │ │
│ │ │ │ bob@example.com │ MP_012, MP_234 │ │ │ │
│ │ │ │ ... │ ... │ │ │ │
│ │ │ │ john@example.com │ MP_045, MP_678 │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ Total entries: ~500M unique emails │ │ │ │
│ │ │ │ Storage overhead: ~10-20% of table size │ │ │ │
│ │ │ └────────────────────────────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ With Search Optimization: │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ Must scan: Only 2 micro-partitions (MP_045, MP_678) │ │ │
│ │ │ Reason: POI directly maps email to micro-partitions │ │ │
│ │ │ Time: 0.1-0.5 seconds │ │ │
│ │ │ Improvement: 60-600x faster │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ SEARCH OPTIMIZATION MAINTENANCE: │
│ ═════════════════════════════════ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Automatic Background Processes: │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ 1. Re-clustering: │ │ │
│ │ │ • Continuously optimizes data placement │ │ │
│ │ │ • Runs when table statistics change │ │ │
│ │ │ • Cost: Compute credits (proportional to data changes) │ │ │
│ │ │ │ │ │
│ │ │ 2. Index Maintenance: │ │ │
│ │ │ • Updates POI when data is inserted/updated/deleted │ │ │
│ │ │ • Maintains consistency with table data │ │ │
│ │ │ • Cost: Minimal (background thread) │ │ │
│ │ │ │ │ │
│ │ │ 3. Statistics Collection: │ │ │
│ │ │ • Gathers column-level statistics │ │ │
│ │ │ • Updates min/max/distinct counts │ │ │
│ │ │ • Cost: Minimal │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Architecture Diagram 3: Materialized Views Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ MATERIALIZED VIEWS ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ TRADITIONAL VIEW: │
│ ═════════════════ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ SELECT * FROM sales_summary; │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ Execution Plan: │ │ │
│ │ │ │ │ │
│ │ │ 1. Parse view definition │ │ │
│ │ │ 2. Rewrite query as inline SQL │ │ │
│ │ │ 3. Execute against base table │ │ │
│ │ │ 4. Return results │ │ │
│ │ │ │ │ │
│ │ │ Every query re-executes the full aggregation │ │ │
│ │ │ Cost: O(n) per query where n = base table size │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ CREATE MATERIALIZED VIEW │
│ ▼ │
│ MATERIALIZED VIEW: │
│ ══════════════════ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ SELECT * FROM sales_mv; │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ Pre-computed Results: │ │ │
│ │ │ │ │ │
│ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ sales_mv (Materialized) │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ ┌─────────┬─────────┬─────────┬─────────┬──────────┐ │ │ │ │
│ │ │ │ │ Region │ Product │ Q1 2024 │ Q2 2024 │ Total │ │ │ │ │
│ │ │ │ ├─────────┼─────────┼─────────┼─────────┼──────────┤ │ │ │ │
│ │ │ │ │ US │ Laptop │ 150,000 │ 180,000 │ 330,000 │ │ │ │ │
│ │ │ │ │ US │ Phone │ 200,000 │ 220,000 │ 420,000 │ │ │ │ │
│ │ │ │ │ EU │ Laptop │ 120,000 │ 140,000 │ 260,000 │ │ │ │ │
│ │ │ │ │ EU │ Phone │ 180,000 │ 190,000 │ 370,000 │ │ │ │ │
│ │ │ │ └─────────┴─────────┴─────────┴─────────┴──────────┘ │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ Results stored on disk (not recomputed) │ │ │ │
│ │ │ │ Cost: O(1) per query │ │ │ │
│ │ │ └────────────────────────────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ REFRESH MECHANISMS: │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────────────┐ │ │
│ │ │ │ │ │
│ │ │ Automatic Refresh (Background): │ │ │
│ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ • Detects base table changes │ │ │ │
│ │ │ │ • Incrementally updates materialized results │ │ │ │
│ │ │ │ • Runs periodically (not real-time) │ │ │ │
│ │ │ │ • Cost: Background compute credits │ │ │ │
│ │ │ └────────────────────────────────────────────────────────┘ │ │ │
│ │ │ │ │ │
│ │ │ Manual Refresh: │ │ │
│ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ ALTER MATERIALIZED VIEW sales_mv REFRESH; │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ • Forces immediate refresh │ │ │ │
│ │ │ │ • Waits for completion │ │ │ │
│ │ │ │ • Cost: Compute credits for refresh │ │ │ │
│ │ │ └────────────────────────────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
A clustering key defines how data is organized within micro-partitions. Snowflake automatically reorganizes micro-partitions to group rows with similar key values together, improving partition pruning effectiveness for queries that filter on key columns.
The Search Optimization Service creates auxiliary point lookup indexes (POI) that map specific column values to their micro-partition locations. This enables efficient equality searches on non-clustering columns, providing 60–600× performance improvements for point lookups.
A materialized view pre-computes and stores query results on disk, eliminating re-execution of complex aggregations. Unlike standard views (recomputed each access), materialized views return pre-computed results at O(1) per query, with automatic or manual refresh.
Materialized views support automatic incremental refresh (not real-time). The optimizer transparently rewrites queries against standard views to use materialized views when data freshness requirements are met.
- Analyze query patterns: Use QUERY_HISTORY to identify frequent WHERE/JOIN columns
- Choose clustering keys: Select low-to-medium cardinality columns used in filters
- Check clustering depth: Use CLUSTERING_INFORMATION(); target depth < 1.5
- Consider search optimization: Enable for high-cardinality equality lookups only
- Materialize aggregations: Create materialized views for complex, frequently-run aggregations
- Monitor: Track pruning efficiency, depth, and materialized view freshness
- Clustering keys: Organize data for better partition pruning (target depth < 1.5)
- Search optimization: 60–600× faster point lookups; 10–20% storage overhead
- Materialized views: Pre-computed results for complex aggregations
- Automatic clustering: Background service optimizes data placement continuously
- Balance: Not all tables need clustering — consider query patterns and cost
Detailed Explanation
Clustering Keys: Data Organization Strategy
Clustering keys define how data is organized within micro-partitions, optimizing query performance by improving data locality. When you specify a clustering key, Snowflake automatically reorganizes micro-partitions to group rows with similar key values together. This organization enables more effective partition pruning during query execution, eliminating entire micro-partitions that don't contain qualifying data.
The selection of clustering keys requires understanding query patterns and data distribution. Effective clustering keys are columns frequently used in WHERE, JOIN, and GROUP BY clauses. Low-to-medium cardinality columns (e.g., date, region, status) typically provide better clustering effectiveness than high-cardinality columns (e.g., UUID, timestamp with microseconds).
Snowflake maintains clustering metadata that tracks the distribution of key values across micro-partitions. This metadata includes minimum and maximum key values per micro-partition, enabling the optimizer to eliminate partitions that don't contain relevant data. The clustering depth metric indicates how well data is organized, with lower values indicating better clustering (less overlap between micro-partitions).
Automatic Clustering
Automatic clustering is a background service that continuously optimizes data placement based on clustering keys. Unlike manual re-clustering, automatic clustering runs asynchronously without user intervention, adapting to data changes and query pattern evolution. The service uses a cost-based approach, only re-clustering when the benefits outweigh the compute costs.
The automatic clustering algorithm analyzes micro-partition statistics and identifies opportunities to improve data organization. It moves rows between micro-partitions to reduce overlap and improve clustering depth. This process runs in the background, consuming minimal compute resources while maintaining optimal data organization.
Search Optimization Service
The search optimization service addresses a specific performance gap: point lookups on non-clustering columns. Without search optimization, finding a specific value in a non-clustering column requires scanning all micro-partitions, even if the table is well-clustered. The search optimization service creates auxiliary data structures (point lookup indexes) that map specific values to their micro-partition locations.
The service maintains a point lookup index (POI) for each column with search optimization enabled. When a query performs an equality search on an indexed column, the optimizer uses the POI to identify the exact micro-partitions containing the target value, eliminating unnecessary scans. This can provide 60-600x performance improvements for point lookup queries.
Search optimization is most effective for tables with high-cardinality columns (millions of distinct values) that are frequently used in equality searches. The service incurs storage overhead (10-20% of table size) and maintenance costs for index updates. It's not recommended for columns with low cardinality or columns primarily used in range queries.
Materialized Views
Materialized views pre-compute and store query results, eliminating the need to re-execute complex aggregations or joins for each query. Unlike standard views that execute their defining query each time they're accessed, materialized views store results on disk and return them directly. This provides significant performance improvements for queries that repeatedly access the same aggregated data.
Materialized views support automatic and manual refresh mechanisms. Automatic refresh detects changes to base tables and incrementally updates materialized results. This refresh is not real-time; there's a lag between base table changes and materialized view updates. Manual refresh forces immediate updates but requires explicit execution and consumes compute resources.
The optimizer automatically rewrites queries against standard views to use materialized views when possible. This transparency means users don't need to know whether they're querying a view or materialized view. The optimizer considers data freshness requirements and chooses the appropriate data source based on query characteristics.
Performance Monitoring and Tuning
Effective optimization requires continuous monitoring and tuning. Snowflake provides several tools for analyzing query performance, including the QUERY_HISTORY view for execution statistics, PROFILE for detailed execution plans, and WAREHOUSE_METERING_HISTORY for resource utilization.
Key metrics to monitor include partition pruning effectiveness (percentage of micro-partitions eliminated), clustering depth (overlap between micro-partitions), and materialized view freshness (time since last refresh). These metrics indicate whether optimization strategies are effective and identify opportunities for improvement.
Key Concepts Table
| Optimization | Purpose | Overhead | Best For |
|---|---|---|---|
| Clustering Keys | Data organization | Low | Range queries, JOINs |
| Search Optimization | Point lookups | High | Equality searches |
| Materialized Views | Pre-computed results | Medium | Complex aggregations |
| Result Cache | Query result caching | None | Repeated identical queries |
| Clustering Key Selection | Cardinality | Query Pattern | Effectiveness |
|---|---|---|---|
| Date column | Low-Medium | Range queries | High |
| Region/Country | Low | GROUP BY, WHERE | High |
| Customer ID | High | Point lookups | Medium |
| Timestamp | High | Range queries | Medium |
| Composite key | Varies | Complex queries | High |
| Search Optimization | Column Type | Query Pattern | Storage Overhead |
|---|---|---|---|
| High cardinality | STRING/NUMBER | Equality (=) | 10-20% |
| Low cardinality | STRING | Equality (=) | 5-10% |
| Range queries | NUMBER/DATE | Range (>, <) | Not recommended |
Code Examples
-- Example 1: Create table with clustering key
CREATE TABLE sales_data (
id INTEGER,
transaction_date DATE,
region VARCHAR(10),
product_id VARCHAR(20),
amount NUMBER(10,2)
)
CLUSTER BY (transaction_date, region);
-- Example 2: Alter table to add clustering key
ALTER TABLE sales_data CLUSTER BY (transaction_date, region);
-- Example 3: Check clustering information
SELECT *
FROM TABLE(INFORMATION_SCHEMA.CLUSTERING_INFORMATION(
TABLE_NAME => 'sales_data',
SCHEMA_NAME => 'PUBLIC'
));
-- Example 4: Create materialized view
CREATE MATERIALIZED VIEW sales_summary_mv
AS
SELECT
transaction_date,
region,
COUNT(*) as transaction_count,
SUM(amount) as total_amount,
AVG(amount) as avg_amount
FROM sales_data
GROUP BY 1, 2;
-- Example 5: Enable search optimization
ALTER TABLE sales_data ADD SEARCH OPTIMIZATION ON EQUALITY (product_id);
-- Example 6: Refresh materialized view
ALTER MATERIALIZED VIEW sales_summary_mv REFRESH;
-- Example 7: Check materialized view freshness
SELECT
name,
refresh_group,
last_refresh_time,
seconds_since_refresh
FROM INFORMATION_SCHEMA.MATERIALIZED_VIEWS
WHERE name = 'SALES_SUMMARY_MV';
-- Example 8: Analyze clustering effectiveness
SELECT
table_name,
clustering_key,
total_partition_count,
total_constant_partition_count,
average_partition_depth,
average_overlap_bytes,
average_partition_depth * average_overlap_bytes as clustering_score
FROM TABLE(INFORMATION_SCHEMA.CLUSTERING_INFORMATION(
TABLE_NAME => 'sales_data'
));
-- Example 9: Check search optimization status
SHOW SEARCH OPTIMIZATION IN TABLE sales_data;
-- Example 10: Compare query performance
-- Without materialized view
EXPLAIN SELECT
transaction_date,
region,
SUM(amount)
FROM sales_data
WHERE transaction_date >= '2024-01-01'
GROUP BY 1, 2;
-- With materialized view (optimizer should rewrite)
EXPLAIN SELECT
transaction_date,
region,
SUM(amount)
FROM sales_summary_mv
WHERE transaction_date >= '2024-01-01'
GROUP BY 1, 2;
Performance Metrics
| Metric | Target | Warning | Critical | Description |
|---|---|---|---|---|
| Clustering Depth | < 1.5 | 1.5-2.5 | > 2.5 | Average micro-partition overlap |
| Partition Pruning | > 80% | 60-80% | < 60% | Percentage of partitions eliminated |
| Search Optimization Hit | > 90% | 70-90% | < 70% | POI lookup success rate |
| Materialized View Freshness | < 1 hour | 1-24 hours | > 24 hours | Time since last refresh |
| Cache Hit Rate | > 50% | 30-50% | < 30% | Result cache effectiveness |
Best Practices
-
Choose clustering keys wisely: Select columns frequently used in WHERE, JOIN, and GROUP BY clauses. Prefer low-to-medium cardinality columns for better pruning.
-
Monitor clustering depth: Regularly check clustering information to identify tables with poor clustering. Consider re-clustering for tables with depth > 2.0.
-
Use search optimization selectively: Enable search optimization only for high-cardinality columns used in equality searches. Avoid for low-cardinality or range query columns.
-
Create materialized views for aggregations: Materialize complex aggregations that are queried frequently. Consider refresh frequency requirements and storage costs.
-
Monitor materialized view freshness: Ensure materialized views are refreshed within SLA requirements. Use manual refresh for time-critical scenarios.
-
Leverage result cache: Ensure identical queries can hit the cache by avoiding non-deterministic functions. Use QUERY_TAG to group similar queries.
-
Analyze query patterns: Use QUERY_HISTORY to identify slow queries and optimize accordingly. Focus on queries with poor partition pruning.
-
Consider composite clustering keys: Use multiple columns for clustering when queries frequently filter on multiple columns. Order columns by selectivity.
-
Balance optimization overhead: Consider storage and compute costs when implementing optimization features. Not all tables need clustering or search optimization.
-
Regular performance reviews: Conduct weekly optimization reviews to identify performance trends and adjust strategies accordingly.
See Also
- PySpark Iceberg - Iceberg table optimization
- Delta Lake on Databricks - Delta Lake optimization
- Data Warehouse Concepts - Data warehouse design principles