CW

Snowflake Data Lakehouse Architecture

Free Lesson

Advertisement

Snowflake Data Lakehouse Architecture

The Data Lakehouse architecture represents a paradigm shift in data platform design, unifying the best aspects of data lakes and data warehouses into a single, cohesive platform.

What is a Data Lakehouse?

A Data Lakehouse combines the low-cost, flexible storage of data lakes with the performance, reliability, and ACID transactions of data warehouses.

Key Characteristics

FeatureData LakeData WarehouseData Lakehouse
Storage CostLowHighLow
Schema EnforcementNoneStrictFlexible
ACID TransactionsNoYesYes
Data FormatsProprietaryProprietaryOpen (Parquet, ORC)
ML SupportExcellentLimitedExcellent
BI SupportLimitedExcellentExcellent

Snowflake's Lakehouse Implementation

Snowflake implements the Lakehouse pattern through several key features:

External Tables

External tables allow Snowflake to query data stored externally without copying it:

-- Create external table from S3
CREATE OR REPLACE EXTERNAL TABLE my_external_table
  WITH LOCATION = @my_s3_stage
  FILE_FORMAT = (TYPE = PARQUET)
  AUTO_REFRESH = true;

-- Query external data directly
SELECT * FROM my_external_table
WHERE date >= '2024-01-01';

Iceberg Tables (Snowflake Native Iceberg)

Snowflake supports Apache Iceberg for open table formats:

-- Create Iceberg table
CREATE OR REPLACE EXTERNAL TABLE my_iceberg_table
  CATALOG = 'SNOWFLAKE'
  EXTERNAL_VOLUME = 'my_volume'
  TABLE_FORMAT = ICEBERG
  DATA_FILES_FORMAT = (TYPE = PARQUET);

-- Time travel on Iceberg tables
SELECT * FROM my_iceberg_table
  AT (OFFSET => -3600);  -- 1 hour ago

Hybrid Tables

Hybrid tables combine transactional and analytical workloads:

-- Create hybrid table for OLTP-like operations
CREATE OR REPLACE HYBRID TABLE orders (
  order_id INTEGER PRIMARY KEY,
  customer_id INTEGER,
  order_date TIMESTAMP_NTZ,
  total_amount DECIMAL(10,2),
  status VARCHAR(20)
);

-- ACID transactions on hybrid tables
BEGIN TRANSACTION;
  INSERT INTO orders VALUES (1, 100, CURRENT_TIMESTAMP(), 99.99, 'PENDING');
  UPDATE orders SET status = 'COMPLETED' WHERE order_id = 1;
COMMIT;

Data Lakehouse Patterns in Snowflake

Medallion Architecture

The Medallion (Multi-Hop) architecture organizes data into bronze, silver, and gold layers:

-- Bronze Layer: Raw data ingestion
CREATE OR REPLACE TABLE bronze.raw_orders (
  raw_data VARIANT
);

-- Silver Layer: Cleaned and validated data
CREATE OR REPLACE TABLE silver.orders AS
  SELECT
    raw_data:order_id::INTEGER AS order_id,
    raw_data:customer_id::INTEGER AS customer_id,
    raw_data:order_date::DATE AS order_date,
    raw_data:amount::DECIMAL(10,2) AS amount,
    CURRENT_TIMESTAMP() AS ingested_at
  FROM bronze.raw_orders;

-- Gold Layer: Business-ready aggregates
CREATE OR REPLACE TABLE gold.daily_sales_summary AS
  SELECT
    order_date,
    COUNT(*) AS total_orders,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value
  FROM silver.orders
  GROUP BY order_date;

Data Sharing Across Domains

-- Share data across business domains
CREATE SHARE sales_share;
GRANT USAGE ON DATABASE analytics_db TO SHARE sales_share;
GRANT USAGE ON SCHEMA gold TO SHARE sales_share;
GRANT SELECT ON gold.daily_sales_summary TO SHARE sales_share;

-- Consumers can query shared data
CREATE DATABASE sales_from_share FROM SHARE sales_share;
SELECT * FROM sales_from_share.gold.daily_sales_summary;

Performance Optimization for Lakehouse

Partitioning Strategies

-- Partition external data by date
CREATE OR REPLACE EXTERNAL TABLE daily_events
  PARTITION BY (event_date)
  WITH LOCATION = @events_stage
  FILE_FORMAT = (TYPE = PARQUET);

-- Query optimization through partition pruning
SELECT * FROM daily_events
  WHERE event_date = '2024-01-15';

Clustering for Large Tables

-- Cluster large fact tables
CREATE OR REPLACE TABLE large_fact_table CLUSTER BY (event_date, region)
  AS SELECT * FROM staging_events;

-- Manual re-clustering
ALTER TABLE large_fact_table RECLUSTER;

The Lakehouse pattern is ideal for organizations that need both the flexibility of a data lake and the reliability of a data warehouse. Snowflake's architecture makes this particularly seamless because the compute and storage layers are independent.

Cost Optimization in Lakehouse

StrategyImplementationBenefit
Auto-scalingMulti-cluster warehousesPay only for active compute
Storage tieringTime-based partitioningMove old data to cheaper storage
Query optimizationPartition pruningReduce data scanned
Materialized viewsPre-aggregate common queriesFaster response times

Key Takeaways:

  • Data Lakehouse combines lake flexibility with warehouse reliability
  • Snowflake supports Iceberg, Hybrid Tables, and External Tables
  • Medallion Architecture (Bronze/Silver/Gold) organizes data progressively
  • ACID transactions ensure data integrity in the lakehouse
  • Cost optimization through partitioning, clustering, and auto-scaling

Advertisement

Need Expert Snowflake Help?

Get personalized warehouse optimization, data modeling, or Snowflake platform consulting.

Advertisement