Snowflake: Clustering, Time Travel, Zero-Copy Cloning
Deep dive into Snowflake architecture and features
Interview Question
"You have a 5TB Snowflake table with 1 billion rows. Queries filtering by 'customer_id' and 'order_date' are slow. Design: (1) clustering strategy, (2) time travel configuration, (3) zero-copy cloning for testing, (4) cost optimization. Include specific Snowflake SQL commands."
Difficulty: Hard | Frequently asked at Snowflake, Databricks, Netflix, Uber
Theoretical Foundation
Snowflake Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Snowflake Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Cloud Services Layer β β
β β - Authentication β β
β β - Infrastructure management β β
β β - Query optimization β β
β β - Metadata management β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Query Processing Layer β β
β β - Virtual warehouses (compute) β β
β β - Auto-scaling β β
β β - Query result cache β β
β β - Multi-cluster warehouses β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Storage Layer β β
β β - Micro-partitions (columnar) β β
β β - Automatic clustering β β
β β - Compression β β
β β - Metadata β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Micro-Partitions
Snowflake stores data in micro-partitions (50-500MB compressed).
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Micro-Partitions β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Table: orders β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Micro-Partition 1: rows 1-1,000,000 β β
β β - customer_id range: 1-1000 β β
β β - order_date range: 2024-01-01 to 2024-01-31 β β
β β - Min/Max values stored in metadata β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Micro-Partition 2: rows 1,000,001-2,000,000 β β
β β - customer_id range: 1001-2000 β β
β β - order_date range: 2024-02-01 to 2024-02-29 β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β ... β
β β
β Pruning: Skip partitions where query filters don't match β
β Query: WHERE customer_id = 500 AND order_date = '2024-01-15'β
β Only scans Micro-Partition 1 (metadata check) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Clustering
Clustering organizes micro-partitions by column values for better pruning.
Without Clustering:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Micro-Partition 1: customer_id [1-1000], date [Jan] β
β Micro-Partition 2: customer_id [500-1500], date [Feb] β
β Micro-Partition 3: customer_id [1-500], date [Mar] β
β Micro-Partition 4: customer_id [1000-2000], date [Jan] β
β β
β Query: WHERE customer_id = 500 β
β Must scan ALL 4 partitions (no pruning) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
With Clustering on customer_id:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Micro-Partition 1: customer_id [1-500], date [Jan, Mar] β
β Micro-Partition 2: customer_id [500-1000], date [Jan, Feb] β
β Micro-Partition 3: customer_id [1000-1500], date [Feb] β
β Micro-Partition 4: customer_id [1500-2000], date [Jan] β
β β
β Query: WHERE customer_id = 500 β
β Only scans Micro-Partition 2 (effective pruning) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Clustering metrics:
Target: Depth close to 1, Overlap close to 0.
Time Travel
Snowflake retains historical data for a configurable period.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Time Travel β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Current State: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β orders: {id: 1, amount: 100}, {id: 2, amount: 200} β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β After UPDATE: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Current: {id: 1, amount: 150}, {id: 2, amount: 200} β β
β β Historical (1 hour ago): {id: 1, amount: 100}, ... β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Query Historical: β
β SELECT * FROM orders AT (OFFSET => -3600*24); -- 24 hours agoβ
β SELECT * FROM orders AT (TIMESTAMP => '2024-01-15 10:00:00');β
β SELECT * FROM orders BEFORE (STATEMENT => 'abc123'); β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time Travel retention:
- Standard: 1 day (default)
- Enterprise: 90 days
- Business Critical: 90 days
- Custom: Up to 90 days
Zero-Copy Cloning
Create a clone without copying data (metadata operation only).
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Zero-Copy Cloning β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Original Table: orders (5TB) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Micro-partitions (shared) β β
β β Metadata: {table: orders, rows: 1B} β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Clone: orders_clone (0 bytes additional storage) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Micro-partitions (SHARED with original) β β
β β Metadata: {table: orders_clone, rows: 1B} β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β After DML on clone: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Original: Unchanged β β
β β Clone: New micro-partitions for changed rows β β
β β (Copy-on-Write) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Snowflake Warehouse Sizes
| Size | vCPUs | Memory | Cost/Hour |
|---|---|---|---|
| X-Small | 1 | 2GB | $0.00/credit |
| Small | 2 | 4GB | 1 credit |
| Medium | 4 | 8GB | 2 credits |
| Large | 8 | 16GB | 4 credits |
| X-Large | 16 | 32GB | 8 credits |
| 2X-Large | 32 | 64GB | 16 credits |
| 3X-Large | 64 | 128GB | 32 credits |
| 4X-Large | 128 | 256GB | 64 credits |
Code Implementation
Clustering Strategy
-- ============================================================
-- CLUSTERING STRATEGY
-- ============================================================
-- 1. Analyze current clustering
SELECT SYSTEM$CLUSTERING_DEPTH('orders');
SELECT SYSTEM$CLUSTERING_INFORMATION('orders');
-- 2. Create clustering key
ALTER TABLE orders CLUSTER BY (customer_id, order_date);
-- 3. Recluster table
ALTER TABLE orders RECLUSTER;
-- 4. Monitor clustering depth
SELECT
partition_id,
SYSTEM$CLUSTERING_DEPTH('orders', partition_id) as depth,
SYSTEM$CLUSTERING_INFORMATION('orders', partition_id) as info
FROM (
SELECT partition_id
FROM TABLE(INFORMATION_SCHEMA.PARTITIONS('orders'))
);
-- 5. Automatic clustering (recommended)
-- Snowflake automatically maintains clustering
-- Monitor with:
SELECT * FROM TABLE(INFORMATION_SCHEMA.TABLE_STORAGE_METRICS('orders'));
-- 6. Clustering for specific query patterns
-- For queries filtering by customer_id and date range
ALTER TABLE orders CLUSTER BY (customer_id, DATE(order_date));
-- 7. Multi-level clustering
-- For queries that filter by customer_id first, then date
ALTER TABLE orders CLUSTER BY (customer_id, order_date, status);
Time Travel Configuration
-- ============================================================
-- TIME TRAVEL CONFIGURATION
-- ============================================================
-- 1. Set time travel retention
ALTER TABLE orders SET DATA_RETENTION_TIME_IN_DAYS = 7;
-- 2. Query historical data
-- By offset (seconds ago)
SELECT * FROM orders AT (OFFSET => -3600*24); -- 24 hours ago
-- By timestamp
SELECT * FROM orders AT (TIMESTAMP => '2024-01-15 10:00:00');
-- By statement ID
SELECT * FROM orders BEFORE (STATEMENT => 'abc123');
-- 3. Create table at specific point in time
CREATE TABLE orders_backup
CLONE orders
AT (TIMESTAMP => '2024-01-15 10:00:00');
-- 4. Time travel for schema changes
-- See what table looked like before a column was added
SELECT * FROM orders AT (OFFSET => -3600)
COLUMNS(* EXCLUDE new_column);
-- 5. Restore from time travel
CREATE OR REPLACE TABLE orders
CLONE orders AT (TIMESTAMP => '2024-01-15 00:00:00');
-- 6. Monitor time travel storage
SELECT
table_name,
time_travel_bytes,
time_travel_bytes / 1024 / 1024 / 1024 AS time_travel_gb
FROM TABLE(INFORMATION_SCHEMA.TABLE_STORAGE_METRICS('orders'));
Zero-Copy Cloning
-- ============================================================
-- ZERO-COPY CLONING
-- ============================================================
-- 1. Create clone (instant, no storage cost)
CREATE TABLE orders_clone CLONE orders;
-- 2. Clone with data modification
CREATE TABLE orders_test AS
SELECT * FROM orders
WHERE order_date >= '2024-01-01';
-- 3. Clone for testing
CREATE TABLE orders_staging CLONE orders;
-- Perform tests on clone
UPDATE orders_staging SET amount = amount * 1.1 WHERE status = 'pending';
-- Verify results
SELECT * FROM orders_staging WHERE amount != orders.amount;
-- Drop clone when done (no cost)
DROP TABLE orders_staging;
-- 4. Clone for development
CREATE TABLE dev_orders CLONE production.orders;
-- 5. Clone for backup
CREATE TABLE orders_backup_20240115
CLONE orders
AT (TIMESTAMP => '2024-01-15 00:00:00');
-- 6. Monitor clone storage
SELECT
table_name,
clone_group_id,
is_clone,
table_bytes,
time_travel_bytes
FROM TABLE(INFORMATION_SCHEMA.TABLE_STORAGE_METRICS('orders'));
Performance Optimization
-- ============================================================
-- PERFORMANCE OPTIMIZATION
-- ============================================================
-- 1. Materialized views for frequent queries
CREATE MATERIALIZED VIEW mv_daily_revenue AS
SELECT
order_date,
customer_segment,
SUM(amount) AS total_revenue,
COUNT(*) AS order_count
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id
GROUP BY order_date, customer_segment;
-- Refresh materialized view
ALTER MATERIALIZED VIEW mv_daily_revenue REFRESH;
-- 2. Search optimization for point queries
ALTER TABLE orders SET SEARCH_OPTIMIZATION = ON;
-- 3. Result caching
-- Automatic, but can force:
SELECT * FROM orders WHERE customer_id = 123;
-- 4. Warehouse sizing
-- Use larger warehouse for complex queries
ALTER WAREHOUSE analytics_wh SET WAREHOUSE_SIZE = 'X-LARGE';
-- 5. Auto-suspend and auto-resume
ALTER WAREHOUSE analytics_wh
SET AUTO_SUSPEND = 60
AUTO_RESUME = TRUE;
-- 6. Multi-cluster warehouse for concurrency
ALTER WAREHOUSE analytics_wh
SET MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 5;
-- 7. Query profiling
SELECT * FROM TABLE(GET_QUERY_OPERATOR_STATS('query_id'));
Cost Optimization
-- ============================================================
-- COST OPTIMIZATION
-- ============================================================
-- 1. Monitor credit usage
SELECT
warehouse_name,
SUM(credits_used) AS total_credits,
SUM(credits_used) * 3 AS total_cost_usd
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_date >= DATEADD(day, -30, CURRENT_DATE())
GROUP BY warehouse_name
ORDER BY total_credits DESC;
-- 2. Identify expensive queries
SELECT
query_id,
query_text,
execution_time_ms,
partitions_scanned,
bytes_scanned / 1024 / 1024 AS mb_scanned,
credits_used
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_date >= DATEADD(day, -7, CURRENT_DATE())
ORDER BY credits_used DESC
LIMIT 10;
-- 3. Right-size warehouses
-- If average utilization < 50%, use smaller warehouse
SELECT
warehouse_name,
AVG(used_clusters) AS avg_clusters,
MAX(cluster_number) AS max_clusters
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_LOAD_HISTORY
WHERE start_date >= DATEADD(day, -7, CURRENT_DATE())
GROUP BY warehouse_name;
-- 4. Use serverless for ad-hoc queries
-- Serverless auto-scales and charges per second
-- 5. Compact micro-partitions
ALTER TABLE orders COMPACT;
-- 6. Drop unused tables
SELECT table_name, last_altered
FROM INFORMATION_SCHEMA.TABLES
WHERE last_altered < DATEADD(day, -90, CURRENT_DATE());
Data Sharing
-- ============================================================
-- DATA SHARING
-- ============================================================
-- 1. Create share
CREATE SHARE analytics_share;
-- 2. Grant usage on database
GRANT USAGE ON DATABASE analytics_db TO SHARE analytics_share;
-- 3. Grant usage on schema
GRANT USAGE ON SCHEMA analytics_db.public TO SHARE analytics_share;
-- 4. Grant SELECT on table
GRANT SELECT ON TABLE analytics_db.public.orders TO SHARE analytics_share;
-- 5. Add consumer account
ALTER SHARE analytics_share ADD ACCOUNTS = consumer_account_id;
-- 6. Consumer side: Create database from share
CREATE DATABASE analytics_shared FROM SHARE provider_account.analytics_share;
π‘
Production Tip: Snowflake's automatic clustering is usually sufficient. Only use manual clustering for very specific query patterns. Monitor clustering depth regularly and recluster if depth > 10.
Common Follow-Up Questions
Q1: How do you handle data skew in Snowflake?
-- Monitor skew
SELECT
partition_id,
row_count,
bytes / 1024 / 1024 AS size_mb
FROM TABLE(INFORMATION_SCHEMA.PARTITIONS('orders'))
ORDER BY row_count DESC;
-- Recluster to fix skew
ALTER TABLE orders RECLUSTER;
Q2: How do you migrate to Snowflake?
-- 1. Use Snowpipe for initial load
CREATE PIPE orders_pipe
AUTO_INGEST = TRUE
AS
COPY INTO orders
FROM @s3_stage/orders/
FILE_FORMAT = (TYPE = PARQUET);
-- 2. Use CTAS for transformation
CREATE TABLE orders_snowflake AS
SELECT * FROM orders_external;
-- 3. Validate migration
SELECT COUNT(*) FROM orders_external;
SELECT COUNT(*) FROM orders_snowflake;
Q3: How do you handle schema evolution?
-- Snowflake handles schema evolution automatically
-- New columns are added with ALTER TABLE
ALTER TABLE orders ADD COLUMN discount DECIMAL(10,2);
-- Or use COPY with MATCH_BY_COLUMN_NAME
COPY INTO orders
FROM @s3_stage/orders/
FILE_FORMAT = (TYPE = PARQUET)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;
Q4: How do you optimize for concurrent users?
-- Use multi-cluster warehouse
ALTER WAREHOUSE analytics_wh
SET MIN_CLUSTER_COUNT = 2
MAX_CLUSTER_COUNT = 10;
-- Use result cache
-- Queries with same logic return cached results
-- Use materialized views
-- Pre-compute frequent aggregations
β οΈ
Critical Consideration: Snowflake costs can spiral with unlimited auto-scaling. Always set MAX_CLUSTER_COUNT and MONITOR credit usage. Use AUTO_SUSPEND to stop idle warehouses.
Company-Specific Tips
Snowflake Interview Tips
- Discuss micro-partitions and pruning
- Explain clustering vs partitioning
- Mention time travel use cases
- Talk about zero-copy cloning benefits
Databricks Interview Tips
- Compare Snowflake vs Databricks
- Discuss Delta Lake vs Snowflake time travel
- Mention Photon vs Snowflake performance
- Talk about lakehouse vs warehouse
Netflix Interview Tips
- Discuss cost optimization with Snowflake
- Explain multi-tenant warehouse strategies
- Mention data sharing for partnerships
- Talk about real-time analytics with Snowflake
βΉοΈ
Final Takeaway: Snowflake excels at separating storage and compute, enabling independent scaling. Use clustering for query performance, time travel for data recovery, and zero-copy cloning for testing. Always monitor costs and right-size warehouses.