Iceberg Tables: External Volumes, Hybrid Tables & Open Formats
Architecture Diagram 1: Iceberg Table Architecture
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ICEBERG TABLE ARCHITECTURE ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā ā
ā SNOWFLAKE COMPUTE ā
ā āāāāāāāāāāāāāāāāāā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā Iceberg Table Definition: ā ā ā
ā ā ā ā ā ā
ā ā ā CREATE ICEBERG TABLE sales_iceberg ā ā ā
ā ā ā CATALOG = 'snowflake' ā ā ā
ā ā ā EXTERNAL_VOLUME = 'my_volume' ā ā ā
ā ā ā BASE_LOCATION = 'iceberg/sales' ā ā ā
ā ā ā CATALOG_SYNC = 'my_catalog_sync' ā ā ā
ā ā ā STORAGE_SERIALIZATION_FORMAT = 'PARQUET' ā ā ā
ā ā ā STORAGE_COMPRESSION_TYPE = 'ZSTD' ā ā ā
ā ā ā AS SELECT * FROM raw_sales; ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā ā Metadata Layer ā
ā ā¼ ā
ā ICEBERG METADATA LAYER ā
ā āāāāāāāāāāāāāāāāāāāāāāā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā Metadata Files (Stored in External Volume): ā ā ā
ā ā ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā metadata/ ā ā ā ā
ā ā ā ā āāā v1.metadata.json (Table metadata) ā ā ā ā
ā ā ā ā āāā v1.parquet (Manifest list) ā ā ā ā
ā ā ā ā āāā v2.metadata.json (Updated metadata) ā ā ā ā
ā ā ā ā āāā v2.parquet (Updated manifest list) ā ā ā ā
ā ā ā ā āāā snap-<uuid>.avro (Snapshot metadata) ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā ā ā
ā ā ā Metadata Content: ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā ⢠Table schema (column names, types, nullability) ā ā ā ā
ā ā ā ā ⢠Partitioning information ā ā ā ā
ā ā ā ā ⢠Snapshot history (versions, timestamps) ā ā ā ā
ā ā ā ā ⢠Manifest file locations ā ā ā ā
ā ā ā ā ⢠Data file statistics (min/max, null counts) ā ā ā ā
ā ā ā ā ⢠Cleanup policies ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā ā Data Layer ā
ā ā¼ ā
ā EXTERNAL VOLUME (Cloud Storage) ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā S3 Bucket / Azure Blob / GCS Bucket ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā iceberg/sales/ ā ā ā
ā ā ā āāā data/ ā ā ā
ā ā ā ā āāā 0001-1000.parquet ā ā ā
ā ā ā ā āāā 0002-1000.parquet ā ā ā
ā ā ā ā āāā 0003-500.parquet ā ā ā
ā ā ā āāā metadata/ ā ā ā
ā ā ā āāā (as shown above) ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā ā
ā ā External Volume Configuration: ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā CREATE EXTERNAL VOLUME my_volume ā ā ā
ā ā ā STORAGE_PROVIDER = 'S3' ā ā ā
ā ā ā STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123:role/snowflake' ā ā ā
ā ā ā STORAGE_BASE_LOCATIONS = ('s3://my-bucket/iceberg') ā ā ā
ā ā ā COMMENT = 'External volume for Iceberg tables'; ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā ā Query Layer ā
ā ā¼ ā
ā QUERY ACCESS ā
ā āāāāāāāāāāāāāā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā SQL Query Execution: ā ā ā
ā ā ā ā ā ā
ā ā ā SELECT * FROM sales_iceberg WHERE region = 'US'; ā ā ā
ā ā ā ā ā ā
ā ā ā Execution Plan: ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā 1. Read metadata (from external volume) ā ā ā ā
ā ā ā ā 2. Identify relevant data files (partition pruning) ā ā ā ā
ā ā ā ā 3. Read data files (from external volume) ā ā ā ā
ā ā ā ā 4. Apply filters (region = 'US') ā ā ā ā
ā ā ā ā 5. Return results ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Architecture Diagram 2: Hybrid Tables Architecture
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā HYBRID TABLES ARCHITECTURE ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā ā
ā HYBRID TABLE DEFINITION ā
ā āāāāāāāāāāāāāāāāāāāāāāā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā CREATE HYBRID TABLE orders_hybrid ( ā ā
ā ā order_id INTEGER PRIMARY KEY, ā ā
ā ā customer_id INTEGER, ā ā
ā ā order_date DATE, ā ā
ā ā amount DECIMAL(10,2), ā ā
ā ā status VARCHAR(20), ā ā
ā ā created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP() ā ā
ā ā ) ā ā
ā ā CLUSTER BY (order_date, status) ā ā
ā ā COMMENT = 'Hybrid table supporting both OLAP and OLTP'; ā ā
ā ā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā ā Hybrid Capabilities ā
ā ā¼ ā
ā HYBRID TABLE CAPABILITIES ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā OLAP Capabilities (Analytics): ā ā ā
ā ā ā ā ā ā
ā ā ā ā Complex analytical queries ā ā ā
ā ā ā ā Aggregations and joins ā ā ā
ā ā ā ā Full SQL support ā ā ā
ā ā ā ā Materialized views ā ā ā
ā ā ā ā Time Travel ā ā ā
ā ā ā ā Clustering and search optimization ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā OLTP Capabilities (Transactional): ā ā ā
ā ā ā ā ā ā
ā ā ā ā Primary key enforcement ā ā ā
ā ā ā ā Unique constraints ā ā ā
ā ā ā ā Referential integrity ā ā ā
ā ā ā ā Single-row inserts/updates/deletes ā ā ā
ā ā ā ā Transaction support (ACID) ā ā ā
ā ā ā ā Upsert operations ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā ā Operations ā
ā ā¼ ā
ā HYBRID TABLE OPERATIONS ā
ā āāāāāāāāāāāāāāāāāāāāāāā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā INSERT (Single Row): ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā INSERT INTO orders_hybrid (order_id, customer_id, ā ā ā ā
ā ā ā ā order_date, amount, status) ā ā ā ā
ā ā ā ā VALUES (1, 101, '2024-01-15', 99.99, 'PENDING'); ā ā ā ā
ā ā ā ā ā ā ā ā
ā ā ā ā -- Primary key enforced ā ā ā ā
ā ā ā ā -- ACID transaction ā ā ā ā
ā ā ā ā -- Immediate visibility ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā ā ā
ā ā ā UPSERT (Merge): ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā MERGE INTO orders_hybrid t ā ā ā ā
ā ā ā ā USING staging_orders s ā ā ā ā
ā ā ā ā ON t.order_id = s.order_id ā ā ā ā
ā ā ā ā WHEN MATCHED THEN UPDATE SET ā ā ā ā
ā ā ā ā amount = s.amount, status = s.status ā ā ā ā
ā ā ā ā WHEN NOT MATCHED THEN INSERT ā ā ā ā
ā ā ā ā (order_id, customer_id, order_date, amount, status) ā ā ā ā
ā ā ā ā VALUES (s.order_id, s.customer_id, s.order_date, ā ā ā ā
ā ā ā ā s.amount, s.status); ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā ā ā
ā ā ā DELETE (Single Row): ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā DELETE FROM orders_hybrid WHERE order_id = 1; ā ā ā ā
ā ā ā ā -- Primary key lookup ā ā ā ā
ā ā ā ā -- Transaction atomicity ā ā ā ā
ā ā ā ā -- Time Travel preserved ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā
ā ā Performance Comparison ā
ā ā¼ ā
ā PERFORMANCE CHARACTERISTICS ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā Operation Type ā Regular Table ā Hybrid Table ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā Single-row INSERT ā Medium ā Fast (PK lookup) ā ā ā
ā ā ā Bulk INSERT ā Fast ā Fast ā ā ā
ā ā ā Single-row UPDATE ā Slow (full) ā Fast (PK lookup) ā ā ā
ā ā ā Bulk UPDATE ā Fast ā Fast ā ā ā
ā ā ā Single-row DELETE ā Slow (full) ā Fast (PK lookup) ā ā ā
ā ā ā Analytics query ā Fast ā Fast ā ā ā
ā ā ā JOIN operations ā Fast ā Fast ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Architecture Diagram 3: Data Lakehouse Architecture
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā DATA LAKEHOUSE ARCHITECTURE ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā ā ā
ā ā UNIFIED DATA LAKEHOUSE ā ā
ā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā Storage Layer (Open Formats): ā ā ā
ā ā ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā Cloud Storage (S3/Azure/GCS) ā ā ā ā
ā ā ā ā āāā raw/ (Original data) ā ā ā ā
ā ā ā ā ā āāā parquet/ ā ā ā ā
ā ā ā ā ā āāā csv/ ā ā ā ā
ā ā ā ā ā āāā json/ ā ā ā ā
ā ā ā ā āāā processed/ (Cleaned data) ā ā ā ā
ā ā ā ā ā āāā parquet/ ā ā ā ā
ā ā ā ā ā āāā orc/ ā ā ā ā
ā ā ā ā āāā analytics/ (Business-ready) ā ā ā ā
ā ā ā ā āāā parquet/ ā ā ā ā
ā ā ā ā āāā iceberg/ ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā ā ā
ā ā ā Table Formats ā ā
ā ā ā¼ ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā Open Table Formats: ā ā ā
ā ā ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā Iceberg: ā ā ā ā
ā ā ā ā ⢠ACID transactions ā ā ā ā
ā ā ā ā ⢠Schema evolution ā ā ā ā
ā ā ā ā ⢠Time travel ā ā ā ā
ā ā ā ā ⢠Partition evolution ā ā ā ā
ā ā ā ā ⢠Hidden partitioning ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā Delta Lake: ā ā ā ā
ā ā ā ā ⢠ACID transactions ā ā ā ā
ā ā ā ā ⢠Schema enforcement ā ā ā ā
ā ā ā ā ⢠Time travel ā ā ā ā
ā ā ā ā ⢠Data quality constraints ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā Hudi: ā ā ā ā
ā ā ā ā ⢠Incremental processing ā ā ā ā
ā ā ā ā ⢠Record-level updates ā ā ā ā
ā ā ā ā ⢠Near real-time ingestion ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā ā ā
ā ā ā Compute Layer ā ā
ā ā ā¼ ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā Compute Engines: ā ā ā
ā ā ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā Snowflake: ā ā ā ā
ā ā ā ā ⢠Native Iceberg support ā ā ā ā
ā ā ā ā ⢠Managed compute ā ā ā ā
ā ā ā ā ⢠SQL interface ā ā ā ā
ā ā ā ā ⢠Performance optimization ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā Spark: ā ā ā ā
ā ā ā ā ⢠Batch processing ā ā ā ā
ā ā ā ā ⢠ETL pipelines ā ā ā ā
ā ā ā ā ⢠Data engineering ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā Trino/Presto: ā ā ā ā
ā ā ā ā ⢠Interactive queries ā ā ā ā
ā ā ā ā ⢠Federated access ā ā ā ā
ā ā ā ā ⢠Multi-engine compatibility ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā ā ā
ā ā ā Governance Layer ā ā
ā ā ā¼ ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā ā Data Governance: ā ā ā
ā ā ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā ā ā ⢠Schema registry (Enforce data contracts) ā ā ā ā
ā ā ā ā ⢠Data lineage (Track data provenance) ā ā ā ā
ā ā ā ā ⢠Access control (RBAC on shared tables) ā ā ā ā
ā ā ā ā ⢠Data quality (Validate constraints) ā ā ā ā
ā ā ā ā ⢠Cost management (Track compute usage) ā ā ā ā
ā ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā ā
ā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā ā
ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Apache Iceberg Tables are open table format standards that provide ACID transactions, schema evolution, time travel, and hidden partitioning for data lakes. Snowflake supports Iceberg as an external table format, enabling open data lake architectures while leveraging Snowflake's query engine.
The Iceberg metadata layer is a catalog abstraction that tracks table state via metadata files and manifest files. It enables atomic operations (commit, rollback) and snapshot isolation without requiring Spark or Hive ā Snowflake reads Iceberg metadata directly.
Use Iceberg for: multi-engine data lakes, vendor independence, compliance-driven data retention, and gradual migration from legacy systems. Use Snowflake native for: performance-critical analytics, managed infrastructure, and simplified operations.
- Open format: Apache Iceberg tables work across Spark, Trino, DuckDB, Snowflake
- ACID transactions: Atomic commits with snapshot isolation on data lakes
- Schema evolution: Add/rename columns without table recreation
- Time travel: Query historical snapshots for auditing and reproducibility
- Snowflake integration: Native Iceberg catalog support (2024+)
Detailed Explanation
Iceberg Tables Fundamentals
Snowflake Iceberg tables bring the open Apache Iceberg table format to Snowflake's managed infrastructure. Iceberg is an open table format designed for large analytic datasets, providing ACID transactions, schema evolution, time travel, and hidden partitioning. By supporting Iceberg, Snowflake enables interoperability with other compute engines (Spark, Trino, Flink) while maintaining Snowflake's performance and management benefits.
Iceberg tables store data in open formats (Parquet, ORC) in external cloud storage, with metadata files that track table structure, snapshots, and data file locations. This separation of metadata and data enables efficient query planning, concurrent access, and schema evolution without rewriting data files. Snowflake manages the metadata layer while data remains in your cloud storage, giving you full control over data placement and lifecycle.
External Volumes and Storage
External volumes define the cloud storage locations where Iceberg table data is stored. An external volume maps to a specific cloud storage bucket and path, with IAM credentials for secure access. Multiple Iceberg tables can share an external volume, enabling centralized storage management and cost optimization.
The external volume configuration includes storage provider (S3, Azure Blob, GCS), IAM role credentials, and base storage locations. Snowflake uses these credentials to read and write data files and metadata to your cloud storage. This architecture ensures data remains in your account while Snowflake manages the compute and metadata layers.
Hybrid Tables: OLAP + OLTP
Hybrid tables extend Snowflake's analytics capabilities with transactional features, enabling both OLAP (analytical) and OLTP (transactional) workloads on the same table. Unlike regular tables optimized for bulk operations, hybrid tables support single-row inserts, updates, and deletes with primary key enforcement and ACID transactions.
Hybrid tables maintain Snowflake's columnar storage and micro-partition architecture while adding transactional optimizations. Single-row operations use primary key lookups instead of full table scans, providing performance comparable to traditional OLTP databases. This capability enables use cases like real-time dashboards, operational analytics, and mixed workload scenarios.
Open Table Formats
Snowflake supports multiple open table formats beyond Iceberg, including Delta Lake and Hudi. Each format provides different strengths: Iceberg excels at schema evolution and time travel, Delta Lake offers strong schema enforcement, and Hudi specializes in incremental processing. Choosing the right format depends on your workload requirements and ecosystem compatibility.
The open format approach enables multi-engine access to the same data. You can query Iceberg tables using Snowflake for analytics, Spark for ETL, and Trino for ad-hoc queries. This flexibility prevents vendor lock-in and enables best-of-breed compute selection for different workloads.
Data Lakehouse Architecture
The data lakehouse pattern combines data lake flexibility with data warehouse reliability. By using open table formats like Iceberg, you can store data in cost-effective cloud storage while providing ACID transactions, schema enforcement, and time travel. Snowflake serves as the primary compute engine for analytics, while other engines handle specialized workloads.
The lakehouse architecture typically organizes data into layers: raw (original data), processed (cleaned and transformed), and analytics (business-ready). Each layer can use appropriate table formats and storage strategies. Snowflake's Iceberg support enables seamless querying across all layers while maintaining data governance and access control.
Key Concepts Table
| Feature | Iceberg Tables | Hybrid Tables | Regular Tables |
|---|---|---|---|
| Storage Location | External volume | Snowflake-managed | Snowflake-managed |
| ACID Transactions | Yes | Yes | Yes |
| Schema Evolution | Yes | Yes | Limited |
| Time Travel | Yes | Yes | Yes |
| Primary Key | No | Yes | No |
| Single-row DML | No | Yes | Limited |
| External Access | Yes | No | No |
| Open Format | ACID | Schema Evolution | Time Travel | Best For |
|---|---|---|---|---|
| Iceberg | Yes | Yes | Yes | Multi-engine, schema evolution |
| Delta Lake | Yes | Yes | Yes | Spark ecosystem, schema enforcement |
| Hudi | Yes | Yes | Yes | Incremental processing, real-time |
| Metric | Iceberg | Hybrid | Regular |
|---|---|---|---|
| Single-row INSERT | Slow | Fast | Medium |
| Bulk INSERT | Fast | Fast | Fast |
| Single-row UPDATE | Slow | Fast | Slow |
| Analytics Query | Fast | Fast | Fast |
| Storage Cost | External (lower) | Internal | Internal |
Code Examples
-- Example 1: Create external volume
CREATE EXTERNAL VOLUME my_iceberg_volume
STORAGE_PROVIDER = 'S3'
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/SnowflakeAccessRole'
STORAGE_BASE_LOCATIONS = ('s3://my-bucket/iceberg-data')
COMMENT = 'External volume for Iceberg tables';
-- Example 2: Create Iceberg table
CREATE ICEBERG TABLE sales_iceberg (
order_id INTEGER,
customer_id INTEGER,
order_date DATE,
amount DECIMAL(10,2),
region VARCHAR(10)
)
CATALOG = 'snowflake'
EXTERNAL_VOLUME = 'my_iceberg_volume'
BASE_LOCATION = 'sales'
STORAGE_SERIALIZATION_FORMAT = 'PARQUET'
STORAGE_COMPRESSION_TYPE = 'ZSTD'
COMMENT = 'Sales data in Iceberg format';
-- Example 3: Insert data into Iceberg table
INSERT INTO sales_iceberg
SELECT * FROM raw_sales_data;
-- Example 4: Query Iceberg table
SELECT
region,
COUNT(*) as order_count,
SUM(amount) as total_amount
FROM sales_iceberg
WHERE order_date >= '2024-01-01'
GROUP BY region;
-- Example 5: Create hybrid table
CREATE HYBRID TABLE orders_hybrid (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER NOT NULL,
order_date DATE NOT NULL,
amount DECIMAL(10,2),
status VARCHAR(20) DEFAULT 'PENDING',
created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
)
CLUSTER BY (order_date, status)
COMMENT = 'Hybrid table for order management';
-- Example 6: Single-row operations on hybrid table
-- Insert
INSERT INTO orders_hybrid (order_id, customer_id, order_date, amount, status)
VALUES (1, 101, '2024-01-15', 99.99, 'PENDING');
-- Update
UPDATE orders_hybrid
SET status = 'SHIPPED', updated_at = CURRENT_TIMESTAMP()
WHERE order_id = 1;
-- Delete
DELETE FROM orders_hybrid WHERE order_id = 1;
-- Example 7: Upsert operation
MERGE INTO orders_hybrid t
USING staging_orders s
ON t.order_id = s.order_id
WHEN MATCHED THEN
UPDATE SET
amount = s.amount,
status = s.status,
updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, order_date, amount, status)
VALUES (s.order_id, s.customer_id, s.order_date, s.amount, s.status);
-- Example 8: Query with time travel
SELECT * FROM sales_iceberg
AT (TIMESTAMP => '2024-01-15 10:00:00'::TIMESTAMP_TZ)
WHERE region = 'US';
-- Example 9: Schema evolution
ALTER TABLE sales_iceberg ADD COLUMN country VARCHAR(10);
ALTER TABLE sales_iceberg RENAME COLUMN region TO sales_region;
-- Example 10: Monitor Iceberg table metadata
SELECT *
FROM TABLE(INFORMATION_SCHEMA.ICEBERG_TABLES())
WHERE table_name = 'SALES_ICEBERG';
Performance Metrics
| Metric | Target | Warning | Critical |
|---|---|---|---|
| Iceberg Query Latency | < 30s | 30-120s | > 120s |
| Hybrid Single-row Ops | < 10ms | 10-50ms | > 50ms |
| Metadata Read Time | < 1s | 1-5s | > 5s |
| Data File Scan Rate | > 100 MB/s | 50-100 MB/s | < 50 MB/s |
Best Practices
-
Choose appropriate table format: Use Iceberg for multi-engine access and schema evolution, Hybrid for mixed OLAP/OLTP workloads.
-
Optimize external volumes: Place external volumes in the same region as your Snowflake account to minimize data transfer costs.
-
Manage Iceberg snapshots: Configure snapshot retention to balance time travel needs with storage costs.
-
Use hybrid tables wisely: Reserve hybrid tables for workloads requiring single-row transactions. Use regular tables for pure analytics.
-
Implement data partitioning: Use Iceberg's hidden partitioning to optimize query performance without exposing partition columns.
-
Monitor storage costs: Track external storage usage and implement lifecycle policies for old data files.
-
Test multi-engine access: Validate that open table formats work correctly across different compute engines.
-
Implement data quality: Use Iceberg's schema enforcement and constraints to maintain data quality.
-
Plan for scale: Design table structures to handle expected data growth without performance degradation.
-
Document data lineage: Track data provenance and transformations for compliance and debugging.
See Also
- PySpark Iceberg - Deep dive on Iceberg tables
- Delta Lake on Databricks - Delta Lake comparison
- Data Warehouse Concepts - Data warehouse design principles