Snowflake & Databricks Interview

Snowflake: Clustering, Time Travel, Zero-Copy Cloning

Deep dive into Snowflake architecture and features

Interview Question

"You have a 5TB Snowflake table with 1 billion rows. Queries filtering by 'customer_id' and 'order_date' are slow. Design: (1) clustering strategy, (2) time travel configuration, (3) zero-copy cloning for testing, (4) cost optimization. Include specific Snowflake SQL commands."

Difficulty: Hard | Frequently asked at Snowflake, Databricks, Netflix, Uber

Theoretical Foundation

Snowflake Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                Snowflake Architecture                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Cloud Services Layer                    │   │
│  │  - Authentication                                    │   │
│  │  - Infrastructure management                         │   │
│  │  - Query optimization                                │   │
│  │  - Metadata management                               │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Query Processing Layer                  │   │
│  │  - Virtual warehouses (compute)                      │   │
│  │  - Auto-scaling                                      │   │
│  │  - Query result cache                                │   │
│  │  - Multi-cluster warehouses                          │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Storage Layer                           │   │
│  │  - Micro-partitions (columnar)                       │   │
│  │  - Automatic clustering                              │   │
│  │  - Compression                                        │   │
│  │  - Metadata                                          │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Micro-Partitions

Snowflake stores data in micro-partitions (50-500MB compressed).

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    Micro-Partitions                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Table: orders                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Micro-Partition 1: rows 1-1,000,000                 │   │
│  │ - customer_id range: 1-1000                         │   │
│  │ - order_date range: 2024-01-01 to 2024-01-31        │   │
│  │ - Min/Max values stored in metadata                 │   │
│  └─────────────────────────────────────────────────────┘   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Micro-Partition 2: rows 1,000,001-2,000,000        │   │
│  │ - customer_id range: 1001-2000                      │   │
│  │ - order_date range: 2024-02-01 to 2024-02-29        │   │
│  └─────────────────────────────────────────────────────┘   │
│  ...                                                        │
│                                                             │
│  Pruning: Skip partitions where query filters don't match  │
│  Query: WHERE customer_id = 500 AND order_date = '2024-01-15'│
│  Only scans Micro-Partition 1 (metadata check)            │
│                                                             │
└─────────────────────────────────────────────────────

Clustering

Clustering organizes micro-partitions by column values for better pruning.

Architecture Diagram

Without Clustering:
┌─────────────────────────────────────────────────────────────┐
│ Micro-Partition 1: customer_id [1-1000], date [Jan]        │
│ Micro-Partition 2: customer_id [500-1500], date [Feb]      │
│ Micro-Partition 3: customer_id [1-500], date [Mar]         │
│ Micro-Partition 4: customer_id [1000-2000], date [Jan]     │
│                                                             │
│ Query: WHERE customer_id = 500                              │
│ Must scan ALL 4 partitions (no pruning)                    │
└─────────────────────────────────────────────────────────────┘

With Clustering on customer_id:
┌─────────────────────────────────────────────────────────────┐
│ Micro-Partition 1: customer_id [1-500], date [Jan, Mar]    │
│ Micro-Partition 2: customer_id [500-1000], date [Jan, Feb] │
│ Micro-Partition 3: customer_id [1000-1500], date [Feb]     │
│ Micro-Partition 4: customer_id [1500-2000], date [Jan]     │
│                                                             │
│ Query: WHERE customer_id = 500                              │
│ Only scans Micro-Partition 2 (effective pruning)           │
└─────────────────────────────────────────────────────────────┘

Clustering metrics:

\text{Depth} = \frac{\text{Number of micro-partitions}}{\text{Number of distinct clustering key values}}

\text{Overlap} = \frac{\text{Partitions with overlapping ranges}}{\text{Total partitions}}

Target: Depth close to 1, Overlap close to 0.

Time Travel

Snowflake retains historical data for a configurable period.

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    Time Travel                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Current State:                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ orders: {id: 1, amount: 100}, {id: 2, amount: 200}  │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  After UPDATE:                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Current: {id: 1, amount: 150}, {id: 2, amount: 200} │   │
│  │ Historical (1 hour ago): {id: 1, amount: 100}, ...  │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Query Historical:                                          │
│  SELECT * FROM orders AT (OFFSET => -3600*24);  -- 24 hours ago│
│  SELECT * FROM orders AT (TIMESTAMP => '2024-01-15 10:00:00');│
│  SELECT * FROM orders BEFORE (STATEMENT => 'abc123');       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Time Travel retention:

Standard: 1 day (default)
Enterprise: 90 days
Business Critical: 90 days
Custom: Up to 90 days

Zero-Copy Cloning

Create a clone without copying data (metadata operation only).

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│              Zero-Copy Cloning                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Original Table: orders (5TB)                               │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Micro-partitions (shared)                            │   │
│  │ Metadata: {table: orders, rows: 1B}                 │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Clone: orders_clone (0 bytes additional storage)          │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Micro-partitions (SHARED with original)              │   │
│  │ Metadata: {table: orders_clone, rows: 1B}           │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  After DML on clone:                                        │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ Original: Unchanged                                 │   │
│  │ Clone: New micro-partitions for changed rows        │   │
│  │ (Copy-on-Write)                                     │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Snowflake Warehouse Sizes

Size	vCPUs	Memory	Cost/Hour
X-Small	1	2GB	$0.00/credit
Small	2	4GB	1 credit
Medium	4	8GB	2 credits
Large	8	16GB	4 credits
X-Large	16	32GB	8 credits
2X-Large	32	64GB	16 credits
3X-Large	64	128GB	32 credits
4X-Large	128	256GB	64 credits

Code Implementation

Clustering Strategy

-- ============================================================
-- CLUSTERING STRATEGY
-- ============================================================

-- 1. Analyze current clustering
SELECT SYSTEM$CLUSTERING_DEPTH('orders');
SELECT SYSTEM$CLUSTERING_INFORMATION('orders');

-- 2. Create clustering key
ALTER TABLE orders CLUSTER BY (customer_id, order_date);

-- 3. Recluster table
ALTER TABLE orders RECLUSTER;

-- 4. Monitor clustering depth
SELECT 
    partition_id,
    SYSTEM$CLUSTERING_DEPTH('orders', partition_id) as depth,
    SYSTEM$CLUSTERING_INFORMATION('orders', partition_id) as info
FROM (
    SELECT partition_id 
    FROM TABLE(INFORMATION_SCHEMA.PARTITIONS('orders'))
);

-- 5. Automatic clustering (recommended)
-- Snowflake automatically maintains clustering
-- Monitor with:
SELECT * FROM TABLE(INFORMATION_SCHEMA.TABLE_STORAGE_METRICS('orders'));

-- 6. Clustering for specific query patterns
-- For queries filtering by customer_id and date range
ALTER TABLE orders CLUSTER BY (customer_id, DATE(order_date));

-- 7. Multi-level clustering
-- For queries that filter by customer_id first, then date
ALTER TABLE orders CLUSTER BY (customer_id, order_date, status);

Time Travel Configuration

-- ============================================================
-- TIME TRAVEL CONFIGURATION
-- ============================================================

-- 1. Set time travel retention
ALTER TABLE orders SET DATA_RETENTION_TIME_IN_DAYS = 7;

-- 2. Query historical data
-- By offset (seconds ago)
SELECT * FROM orders AT (OFFSET => -3600*24);  -- 24 hours ago

-- By timestamp
SELECT * FROM orders AT (TIMESTAMP => '2024-01-15 10:00:00');

-- By statement ID
SELECT * FROM orders BEFORE (STATEMENT => 'abc123');

-- 3. Create table at specific point in time
CREATE TABLE orders_backup 
    CLONE orders 
    AT (TIMESTAMP => '2024-01-15 10:00:00');

-- 4. Time travel for schema changes
-- See what table looked like before a column was added
SELECT * FROM orders AT (OFFSET => -3600) 
    COLUMNS(* EXCLUDE new_column);

-- 5. Restore from time travel
CREATE OR REPLACE TABLE orders 
    CLONE orders AT (TIMESTAMP => '2024-01-15 00:00:00');

-- 6. Monitor time travel storage
SELECT 
    table_name,
    time_travel_bytes,
    time_travel_bytes / 1024 / 1024 / 1024 AS time_travel_gb
FROM TABLE(INFORMATION_SCHEMA.TABLE_STORAGE_METRICS('orders'));

Zero-Copy Cloning

-- ============================================================
-- ZERO-COPY CLONING
-- ============================================================

-- 1. Create clone (instant, no storage cost)
CREATE TABLE orders_clone CLONE orders;

-- 2. Clone with data modification
CREATE TABLE orders_test AS 
SELECT * FROM orders 
WHERE order_date >= '2024-01-01';

-- 3. Clone for testing
CREATE TABLE orders_staging CLONE orders;

-- Perform tests on clone
UPDATE orders_staging SET amount = amount * 1.1 WHERE status = 'pending';

-- Verify results
SELECT * FROM orders_staging WHERE amount != orders.amount;

-- Drop clone when done (no cost)
DROP TABLE orders_staging;

-- 4. Clone for development
CREATE TABLE dev_orders CLONE production.orders;

-- 5. Clone for backup
CREATE TABLE orders_backup_20240115 
    CLONE orders 
    AT (TIMESTAMP => '2024-01-15 00:00:00');

-- 6. Monitor clone storage
SELECT 
    table_name,
    clone_group_id,
    is_clone,
    table_bytes,
    time_travel_bytes
FROM TABLE(INFORMATION_SCHEMA.TABLE_STORAGE_METRICS('orders'));

Performance Optimization

-- ============================================================
-- PERFORMANCE OPTIMIZATION
-- ============================================================

-- 1. Materialized views for frequent queries
CREATE MATERIALIZED VIEW mv_daily_revenue AS
SELECT 
    order_date,
    customer_segment,
    SUM(amount) AS total_revenue,
    COUNT(*) AS order_count
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id
GROUP BY order_date, customer_segment;

-- Refresh materialized view
ALTER MATERIALIZED VIEW mv_daily_revenue REFRESH;

-- 2. Search optimization for point queries
ALTER TABLE orders SET SEARCH_OPTIMIZATION = ON;

-- 3. Result caching
-- Automatic, but can force:
SELECT * FROM orders WHERE customer_id = 123;

-- 4. Warehouse sizing
-- Use larger warehouse for complex queries
ALTER WAREHOUSE analytics_wh SET WAREHOUSE_SIZE = 'X-LARGE';

-- 5. Auto-suspend and auto-resume
ALTER WAREHOUSE analytics_wh 
    SET AUTO_SUSPEND = 60 
    AUTO_RESUME = TRUE;

-- 6. Multi-cluster warehouse for concurrency
ALTER WAREHOUSE analytics_wh 
    SET MIN_CLUSTER_COUNT = 1 
    MAX_CLUSTER_COUNT = 5;

-- 7. Query profiling
SELECT * FROM TABLE(GET_QUERY_OPERATOR_STATS('query_id'));

Cost Optimization

-- ============================================================
-- COST OPTIMIZATION
-- ============================================================

-- 1. Monitor credit usage
SELECT 
    warehouse_name,
    SUM(credits_used) AS total_credits,
    SUM(credits_used) * 3 AS total_cost_usd
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
WHERE start_date >= DATEADD(day, -30, CURRENT_DATE())
GROUP BY warehouse_name
ORDER BY total_credits DESC;

-- 2. Identify expensive queries
SELECT 
    query_id,
    query_text,
    execution_time_ms,
    partitions_scanned,
    bytes_scanned / 1024 / 1024 AS mb_scanned,
    credits_used
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_date >= DATEADD(day, -7, CURRENT_DATE())
ORDER BY credits_used DESC
LIMIT 10;

-- 3. Right-size warehouses
-- If average utilization < 50%, use smaller warehouse
SELECT 
    warehouse_name,
    AVG(used_clusters) AS avg_clusters,
    MAX(cluster_number) AS max_clusters
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_LOAD_HISTORY
WHERE start_date >= DATEADD(day, -7, CURRENT_DATE())
GROUP BY warehouse_name;

-- 4. Use serverless for ad-hoc queries
-- Serverless auto-scales and charges per second

-- 5. Compact micro-partitions
ALTER TABLE orders COMPACT;

-- 6. Drop unused tables
SELECT table_name, last_altered 
FROM INFORMATION_SCHEMA.TABLES 
WHERE last_altered < DATEADD(day, -90, CURRENT_DATE());

Data Sharing

-- ============================================================
-- DATA SHARING
-- ============================================================

-- 1. Create share
CREATE SHARE analytics_share;

-- 2. Grant usage on database
GRANT USAGE ON DATABASE analytics_db TO SHARE analytics_share;

-- 3. Grant usage on schema
GRANT USAGE ON SCHEMA analytics_db.public TO SHARE analytics_share;

-- 4. Grant SELECT on table
GRANT SELECT ON TABLE analytics_db.public.orders TO SHARE analytics_share;

-- 5. Add consumer account
ALTER SHARE analytics_share ADD ACCOUNTS = consumer_account_id;

-- 6. Consumer side: Create database from share
CREATE DATABASE analytics_shared FROM SHARE provider_account.analytics_share;

💡

Production Tip: Snowflake's automatic clustering is usually sufficient. Only use manual clustering for very specific query patterns. Monitor clustering depth regularly and recluster if depth > 10.

Common Follow-Up Questions

Q1: How do you handle data skew in Snowflake?

-- Monitor skew
SELECT 
    partition_id,
    row_count,
    bytes / 1024 / 1024 AS size_mb
FROM TABLE(INFORMATION_SCHEMA.PARTITIONS('orders'))
ORDER BY row_count DESC;

-- Recluster to fix skew
ALTER TABLE orders RECLUSTER;

Q2: How do you migrate to Snowflake?

-- 1. Use Snowpipe for initial load
CREATE PIPE orders_pipe
    AUTO_INGEST = TRUE
    AS
    COPY INTO orders
    FROM @s3_stage/orders/
    FILE_FORMAT = (TYPE = PARQUET);

-- 2. Use CTAS for transformation
CREATE TABLE orders_snowflake AS
SELECT * FROM orders_external;

-- 3. Validate migration
SELECT COUNT(*) FROM orders_external;
SELECT COUNT(*) FROM orders_snowflake;

Q3: How do you handle schema evolution?

-- Snowflake handles schema evolution automatically
-- New columns are added with ALTER TABLE
ALTER TABLE orders ADD COLUMN discount DECIMAL(10,2);

-- Or use COPY with MATCH_BY_COLUMN_NAME
COPY INTO orders
FROM @s3_stage/orders/
FILE_FORMAT = (TYPE = PARQUET)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

Q4: How do you optimize for concurrent users?

-- Use multi-cluster warehouse
ALTER WAREHOUSE analytics_wh 
    SET MIN_CLUSTER_COUNT = 2 
    MAX_CLUSTER_COUNT = 10;

-- Use result cache
-- Queries with same logic return cached results

-- Use materialized views
-- Pre-compute frequent aggregations

⚠️

Critical Consideration: Snowflake costs can spiral with unlimited auto-scaling. Always set MAX_CLUSTER_COUNT and MONITOR credit usage. Use AUTO_SUSPEND to stop idle warehouses.

Company-Specific Tips

Snowflake Interview Tips

Discuss micro-partitions and pruning
Explain clustering vs partitioning
Mention time travel use cases
Talk about zero-copy cloning benefits

Databricks Interview Tips

Compare Snowflake vs Databricks
Discuss Delta Lake vs Snowflake time travel
Mention Photon vs Snowflake performance
Talk about lakehouse vs warehouse

Netflix Interview Tips

Discuss cost optimization with Snowflake
Explain multi-tenant warehouse strategies
Mention data sharing for partnerships
Talk about real-time analytics with Snowflake

ℹ️

Final Takeaway: Snowflake excels at separating storage and compute, enabling independent scaling. Use clustering for query performance, time travel for data recovery, and zero-copy cloning for testing. Always monitor costs and right-size warehouses.