Snowflake Data Lakehouse Architecture
The Data Lakehouse architecture represents a paradigm shift in data platform design, unifying the best aspects of data lakes and data warehouses into a single, cohesive platform.
What is a Data Lakehouse?
A Data Lakehouse combines the low-cost, flexible storage of data lakes with the performance, reliability, and ACID transactions of data warehouses.
Key Characteristics
| Feature | Data Lake | Data Warehouse | Data Lakehouse |
|---|---|---|---|
| Storage Cost | Low | High | Low |
| Schema Enforcement | None | Strict | Flexible |
| ACID Transactions | No | Yes | Yes |
| Data Formats | Proprietary | Proprietary | Open (Parquet, ORC) |
| ML Support | Excellent | Limited | Excellent |
| BI Support | Limited | Excellent | Excellent |
Snowflake's Lakehouse Implementation
Snowflake implements the Lakehouse pattern through several key features:
External Tables
External tables allow Snowflake to query data stored externally without copying it:
-- Create external table from S3
CREATE OR REPLACE EXTERNAL TABLE my_external_table
WITH LOCATION = @my_s3_stage
FILE_FORMAT = (TYPE = PARQUET)
AUTO_REFRESH = true;
-- Query external data directly
SELECT * FROM my_external_table
WHERE date >= '2024-01-01';
Iceberg Tables (Snowflake Native Iceberg)
Snowflake supports Apache Iceberg for open table formats:
-- Create Iceberg table
CREATE OR REPLACE EXTERNAL TABLE my_iceberg_table
CATALOG = 'SNOWFLAKE'
EXTERNAL_VOLUME = 'my_volume'
TABLE_FORMAT = ICEBERG
DATA_FILES_FORMAT = (TYPE = PARQUET);
-- Time travel on Iceberg tables
SELECT * FROM my_iceberg_table
AT (OFFSET => -3600); -- 1 hour ago
Hybrid Tables
Hybrid tables combine transactional and analytical workloads:
-- Create hybrid table for OLTP-like operations
CREATE OR REPLACE HYBRID TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER,
order_date TIMESTAMP_NTZ,
total_amount DECIMAL(10,2),
status VARCHAR(20)
);
-- ACID transactions on hybrid tables
BEGIN TRANSACTION;
INSERT INTO orders VALUES (1, 100, CURRENT_TIMESTAMP(), 99.99, 'PENDING');
UPDATE orders SET status = 'COMPLETED' WHERE order_id = 1;
COMMIT;
Data Lakehouse Patterns in Snowflake
Medallion Architecture
The Medallion (Multi-Hop) architecture organizes data into bronze, silver, and gold layers:
-- Bronze Layer: Raw data ingestion
CREATE OR REPLACE TABLE bronze.raw_orders (
raw_data VARIANT
);
-- Silver Layer: Cleaned and validated data
CREATE OR REPLACE TABLE silver.orders AS
SELECT
raw_data:order_id::INTEGER AS order_id,
raw_data:customer_id::INTEGER AS customer_id,
raw_data:order_date::DATE AS order_date,
raw_data:amount::DECIMAL(10,2) AS amount,
CURRENT_TIMESTAMP() AS ingested_at
FROM bronze.raw_orders;
-- Gold Layer: Business-ready aggregates
CREATE OR REPLACE TABLE gold.daily_sales_summary AS
SELECT
order_date,
COUNT(*) AS total_orders,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value
FROM silver.orders
GROUP BY order_date;
Data Sharing Across Domains
-- Share data across business domains
CREATE SHARE sales_share;
GRANT USAGE ON DATABASE analytics_db TO SHARE sales_share;
GRANT USAGE ON SCHEMA gold TO SHARE sales_share;
GRANT SELECT ON gold.daily_sales_summary TO SHARE sales_share;
-- Consumers can query shared data
CREATE DATABASE sales_from_share FROM SHARE sales_share;
SELECT * FROM sales_from_share.gold.daily_sales_summary;
Performance Optimization for Lakehouse
Partitioning Strategies
-- Partition external data by date
CREATE OR REPLACE EXTERNAL TABLE daily_events
PARTITION BY (event_date)
WITH LOCATION = @events_stage
FILE_FORMAT = (TYPE = PARQUET);
-- Query optimization through partition pruning
SELECT * FROM daily_events
WHERE event_date = '2024-01-15';
Clustering for Large Tables
-- Cluster large fact tables
CREATE OR REPLACE TABLE large_fact_table CLUSTER BY (event_date, region)
AS SELECT * FROM staging_events;
-- Manual re-clustering
ALTER TABLE large_fact_table RECLUSTER;
The Lakehouse pattern is ideal for organizations that need both the flexibility of a data lake and the reliability of a data warehouse. Snowflake's architecture makes this particularly seamless because the compute and storage layers are independent.
Cost Optimization in Lakehouse
| Strategy | Implementation | Benefit |
|---|---|---|
| Auto-scaling | Multi-cluster warehouses | Pay only for active compute |
| Storage tiering | Time-based partitioning | Move old data to cheaper storage |
| Query optimization | Partition pruning | Reduce data scanned |
| Materialized views | Pre-aggregate common queries | Faster response times |
Key Takeaways:
- Data Lakehouse combines lake flexibility with warehouse reliability
- Snowflake supports Iceberg, Hybrid Tables, and External Tables
- Medallion Architecture (Bronze/Silver/Gold) organizes data progressively
- ACID transactions ensure data integrity in the lakehouse
- Cost optimization through partitioning, clustering, and auto-scaling