Snowflake Advanced Β· Interview Prep
Zero-Copy Cloning & Data Branching
Difficulty: Medium-Hard Β· Commonly asked at Google, Microsoft, Amazon
Interview Question
"Explain how zero-copy cloning works in Snowflake. How would you use it to create isolated dev/test environments from production data without duplicating storage? What are the implications for data modification?"
βΉοΈ
Companies Asking This: Google (Senior Cloud Data Engineer), Microsoft (Principal Data Architect), Amazon (L5 Data Engineer), Meta (Data Platform Engineer)
Zero-Copy Cloning Fundamentals
Zero-copy cloning creates a new table (or schema, database) that initially shares the same underlying micro-partitions as the source. No data is physically copied β only metadata is created.
How It Works Internally
Basic Cloning Syntax
-- Clone a table (zero-copy initially)
CREATE TABLE orders_dev CLONE orders;
-- Clone with explicit schema
CREATE TABLE dev_schema.orders_dev CLONE production.orders;
-- Clone entire schema
CREATE SCHEMA dev_clone CLONE production;
-- Clone entire database
CREATE DATABASE dev_clone CLONE production_db;
-- Clone with data retention
CREATE TABLE orders_dev CLONE orders
DATA_RETENTION_TIME_IN_DAYS = 30;
Copy-on-Write Behavior
When either the source or clone is modified, Snowflake uses copy-on-write semantics. Only the affected micro-partitions are physically duplicated.
-- Initial state: source and clone share micro-partitions
SELECT
table_name,
bytes / (1024*1024*1024) AS storage_gb,
row_count,
clone_group_id
FROM information_schema.tables
WHERE table_name IN ('ORDERS', 'ORDERS_DEV');
-- After modifying the clone, only new micro-partitions are created
UPDATE orders_dev SET status = 'TEST' WHERE order_id < 1000;
-- Check storage impact
SELECT
table_name,
bytes / (1024*1024*1024) AS storage_gb,
row_count,
clone_group_id
FROM information_schema.tables
WHERE table_name IN ('ORDERS', 'ORDERS_DEV');
-- The source table storage remains unchanged
-- The clone now has its own micro-partitions for modified rows
βΉοΈ
Key Insight: Cloning is metadata-only for reads. Storage divergence only occurs when data is modified in either the source or clone. For read-heavy clone use cases (like reporting), the storage savings are massive.
Real-World Scenario: Google
Question: "How would you set up a complete dev/test environment for a data engineering team using cloning, and what guardrails would you put in place?"
Solution: Environment Cloning Strategy
-- 1. Clone the entire production database for dev
CREATE DATABASE prod_clone
CLONE company_db
DATA_RETENTION_TIME_IN_DAYS = 7;
-- 2. Create a dedicated warehouse for dev work
CREATE WAREHOUSE dev_wh
WAREHOUSE_SIZE = 'SMALL'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE;
-- 3. Set up access controls for dev environment
GRANT USAGE ON DATABASE prod_clone TO ROLE dev_role;
GRANT USAGE ON SCHEMA prod_clone.public TO ROLE dev_role;
GRANT SELECT ON ALL TABLES IN SCHEMA prod_clone.public TO ROLE dev_role;
GRANT CREATE TABLE ON SCHEMA prod_clone.public TO ROLE dev_role;
-- 4. Create a procedure to refresh dev clone periodically
CREATE OR REPLACE PROCEDURE refresh_dev_clone()
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
refresh_start TIMESTAMP_NTZ;
BEGIN
refresh_start := CURRENT_TIMESTAMP();
-- Drop and recreate the clone
DROP TABLE IF EXISTS prod_clone.public.orders_dev;
CREATE TABLE prod_clone.public.orders_dev
CLONE company_db.public.orders;
DROP TABLE IF EXISTS prod_clone.public.customers_dev;
CREATE TABLE prod_clone.public.customers_dev
CLONE company_db.public.customers;
RETURN 'Dev clone refreshed at ' ||
TO_CHAR(refresh_start, 'YYYY-MM-DD HH24:MI:SS');
END;
$$;
-- 5. Schedule daily refresh
CREATE OR REPLACE TASK daily_dev_refresh
WAREHOUSE = admin_wh
SCHEDULE = 'USING CRON 0 6 * * * UTC'
AS
CALL refresh_dev_clone();
Branching Strategy for Feature Development
-- Create a branch for a new feature
CREATE TABLE orders_feature_v2 CLONE orders;
-- Make changes on the branch
ALTER TABLE orders_feature_v2 ADD COLUMN discount_pct NUMBER(5,2);
UPDATE orders_feature_v2
SET discount_pct = ROUND(RANDOM() * 0.3, 2)
WHERE order_date >= '2024-01-01';
-- Test the feature
SELECT
order_id,
total_amount,
discount_pct,
total_amount * (1 - discount_pct / 100) AS final_amount
FROM orders_feature_v2
WHERE discount_pct > 0;
-- If successful, merge changes to production
ALTER TABLE orders ADD COLUMN discount_pct NUMBER(5,2);
UPDATE orders o
SET discount_pct = (
SELECT discount_pct
FROM orders_feature_v2 v
WHERE v.order_id = o.order_id
);
-- Clean up the branch
DROP TABLE orders_feature_v2;
Real-World Scenario: Microsoft
Question: "How do you handle schema evolution when cloning tables? If the source table gets a new column, does the clone automatically get it?"
Schema Evolution with Clones
-- Clone a table
CREATE TABLE orders_dev CLONE orders;
-- Add a column to the source
ALTER TABLE orders ADD COLUMN priority VARCHAR(10) DEFAULT 'NORMAL';
-- Check if clone has the new column
DESCRIBE TABLE orders_dev;
-- Result: The clone does NOT automatically get the new column
-- Cloning is a point-in-time operation
-- To get the new column in the clone, you need to re-clone
DROP TABLE orders_dev;
CREATE TABLE orders_dev CLONE orders;
-- Alternative: Manually add the column to the clone
ALTER TABLE orders_dev ADD COLUMN priority VARCHAR(10) DEFAULT 'NORMAL';
-- Update clone with data from source
UPDATE orders_dev d
SET priority = (
SELECT priority
FROM orders o
WHERE o.order_id = d.order_id
);
β οΈ
Important: Cloning is a point-in-time snapshot. After cloning, changes to the source do NOT automatically propagate to the clone. This is by design β it provides isolation but requires manual sync if you need the latest schema.
Advanced Cloning Patterns
Pattern: Selective Cloning with FILTER
-- Clone only recent data using CTAS + clone
CREATE TABLE orders_recent CLONE orders
FILTER => DATEADD(day, -30, CURRENT_DATE()); -- Not supported
-- Alternative: Use CTAS for filtered clone
CREATE TABLE orders_recent AS
SELECT * FROM orders
WHERE order_date >= DATEADD(day, -30, CURRENT_DATE());
-- Clone specific partitions
CREATE TABLE orders_q1_2024 CLONE orders
FILTER => order_date BETWEEN '2024-01-01' AND '2024-03-31';
Pattern: Clone with Transformation
-- Create a transformed clone for analytics
CREATE TABLE orders_analytics AS
SELECT
o.*,
c.customer_segment,
c.customer_lifetime_value,
DATEDIFF(day, o.order_date, CURRENT_DATE()) AS days_since_order,
NTILE(10) OVER (ORDER BY o.total_amount DESC) AS amount_decile
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= DATEADD(year, -1, CURRENT_DATE());
-- This is not zero-copy (data is transformed), but useful for analytics
Pattern: Cloning for Testing Data Quality Rules
-- Clone production data for testing quality rules
CREATE TABLE orders_quality_test CLONE orders;
-- Test quality rules on the clone
SELECT
'NULL_CUSTOMER_ID' AS rule_name,
COUNT(*) AS violations
FROM orders_quality_test WHERE customer_id IS NULL
UNION ALL
SELECT
'NEGATIVE_AMOUNT' AS rule_name,
COUNT(*) AS violations
FROM orders_quality_test WHERE total_amount < 0
UNION ALL
SELECT
'FUTURE_DATE' AS rule_name,
COUNT(*) AS violations
FROM orders_quality_test WHERE order_date > CURRENT_DATE()
UNION ALL
SELECT
'INVALID_STATUS' AS rule_name,
COUNT(*) AS violations
FROM orders_quality_test
WHERE status NOT IN ('PENDING', 'SHIPPED', 'DELIVERED', 'CANCELLED');
-- If violations found, fix the rules before applying to production
DROP TABLE orders_quality_test;
Cloning Performance Considerations
-- Cloning is fast because it's metadata-only
-- But monitor the operation for large tables
CREATE OR REPLACE TASK clone_monitor
WAREHOUSE = admin_wh
SCHEDULE = 'USING CRON 0 0 1 * * UTC' -- Monthly
AS
SELECT
table_name,
bytes / (1024*1024*1024) AS size_gb,
row_count,
clone_group_id,
created_on
FROM information_schema.tables
WHERE clone_group_id IS NOT NULL
ORDER BY bytes DESC;
-- Check clone relationship
SELECT
t1.table_name AS source_table,
t2.table_name AS clone_table,
t1.bytes / (1024*1024*1024) AS source_size_gb,
t2.bytes / (1024*1024*1024) AS clone_size_gb,
ROUND(
(1 - t2.bytes / NULLIF(t1.bytes, 0)) * 100, 2
) AS storage_savings_pct
FROM information_schema.tables t1
JOIN information_schema.tables t2
ON t1.clone_group_id = t2.clone_group_id
AND t1.table_name != t2.table_name
WHERE t1.clone_group_id IS NOT NULL;
Best Practices
| Use Case | Approach | Storage Impact |
|---|---|---|
| Dev/Test environments | Clone entire database | Zero initially, grows with modifications |
| Feature branches | Clone specific tables | Zero initially |
| Data quality testing | Clone + filter | Minimal |
| Analytics sandbox | Clone + transform | Depends on transformation |
| Reporting isolation | Clone for BI | Zero if read-only |
β οΈ
Anti-Patterns:
- Cloning for backups β Use Time Travel instead; cloning is not a backup strategy
- Never cleaning up clones β Old clones consume storage when modified
- Cloning large datasets for small tests β Use FILTER or WHERE to limit data
- Expecting auto-sync β Clones are point-in-time, not live replicas