Snowflake ETL Pipeline Patterns

Effective ETL pipelines in Snowflake combine data ingestion, transformation, and loading with built-in reliability, scalability, and performance.

ETL vs ELT in Snowflake Context

Traditional ETL tools extract data from sources, transform it using external compute (Informatica, DataStage, SSIS), and then load the cleaned result into the target. ELT reverses this: raw data lands in Snowflake first, then Snowflake's compute handles all transformations. Snowflake's architecture makes ELT significantly more efficient because the storage and compute layers are independent, and virtual warehouses can be scaled independently for transformation workloads. The key advantages of ELT over ETL in Snowflake include:

Reduced data movement: Raw data loads directly into Snowflake without intermediate staging servers
Leveraging Snowflake compute: Transformations use the same powerful MPP engine that handles queries
Schema-on-read flexibility: Raw data can be retransformed without re-extracting from sources
Cost efficiency: Pay only for compute used during transformation, not for maintaining separate ETL infrastructure
Auditability: Raw data is always available in the staging layer for lineage and compliance

Snowpipe and COPY INTO for Ingestion

Snowpipe provides continuous, serverless data ingestion by automatically loading data from files in cloud storage. COPY INTO is the bulk loading command for one-time or scheduled loads.

-- Create a Snowpipe for continuous ingestion
CREATE OR REPLACE PIPE orders_pipe
  AUTO_INGEST = TRUE
  AS
  COPY INTO raw_orders
  FROM @raw_stage/orders/
  FILE_FORMAT = (TYPE = 'PARQUET')
  MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

-- Bulk load with COPY INTO
COPY INTO raw_sales
FROM @my_s3_stage/sales/
FILE_FORMAT = (
  TYPE = 'CSV'
  FIELD_OPTIONALLY_ENCLOSED_BY = '"'
  SKIP_HEADER = 1
  NULL_IF = ('NULL', 'null', '')
)
ON_ERROR = 'SKIP_FILE';

-- Check pipe status
SELECT * FROM TABLE(INFORMATION_SCHEMA.COPY_ACTIVITY_HISTORY(
  PIPE_NAME => 'orders_pipe'
));

MERGE Statement for Upserts

The MERGE statement combines INSERT and UPDATE operations, enabling efficient incremental loads and slowly changing dimension processing.

-- MERGE for incremental load with SCD Type 1
MERGE INTO dim_customers AS target
USING (
  SELECT
    customer_id,
    customer_name,
    email,
    phone,
    CURRENT_TIMESTAMP() AS load_date
  FROM raw_orders
  WHERE order_date >= '2024-01-01'
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND (
  target.customer_name != source.customer_name
  OR target.email != source.email
) THEN UPDATE SET
  customer_name = source.customer_name,
  email = source.email,
  phone = source.phone,
  load_date = source.load_date,
  is_current = TRUE
WHEN NOT MATCHED THEN INSERT (
  customer_id, customer_name, email, phone, load_date, is_current
) VALUES (
  source.customer_id,
  source.customer_name,
  source.email,
  source.phone,
  source.load_date,
  TRUE
);

Task-Based ETL Orchestration

Tasks and Streams enable automated, event-driven ETL pipelines that run on schedules or in response to new data arrival.

-- Create a task for nightly ETL
CREATE OR REPLACE TASK nightly_etl_task
  WAREHOUSE = etl_wh
  SCHEDULE = 'USING CRON 0 2 * * * UTC'
  ALLOW_OVERLAPPING_EXECUTION = FALSE
AS
BEGIN
  -- Step 1: Load new orders
  COPY INTO raw_orders
  FROM @raw_stage/orders/
  FILE_FORMAT = (TYPE = 'PARQUET')
  ON_ERROR = 'SKIP_FILE';

  -- Step 2: Merge into dimension table
  MERGE INTO dim_orders AS target
  USING raw_orders AS source
  ON target.order_id = source.order_id
  WHEN MATCHED THEN UPDATE SET
    order_amount = source.order_amount,
    order_status = source.order_status
  WHEN NOT MATCHED THEN INSERT
    (order_id, customer_id, order_date, order_amount, order_status)
  VALUES
    (source.order_id, source.customer_id, source.order_date,
     source.order_amount, source.order_status);
END;

-- Create a task with a dependency chain
CREATE OR REPLACE TASK transform_customers
  WAREHOUSE = etl_wh
  AFTER = nightly_etl_task
AS
  MERGE INTO dim_customers AS target
  USING (SELECT DISTINCT customer_id, customer_name, email FROM raw_orders) AS source
  ON target.customer_id = source.customer_id
  WHEN NOT MATCHED THEN INSERT (customer_id, customer_name, email)
  VALUES (source.customer_id, source.customer_name, source.email);

Dynamic Tables for Declarative ETL

Dynamic Tables provide a declarative approach to ETL where Snowflake automatically manages refreshes based on changes to underlying data.

-- Create a dynamic table for real-time aggregation
CREATE OR REPLACE DYNAMIC TABLE daily_sales_summary
  WAREHOUSE = analytics_wh
  TARGET_LAG = '1 hour'
AS
  SELECT
    DATE_TRUNC('day', order_date) AS sale_date,
    product_category,
    COUNT(DISTINCT order_id) AS order_count,
    SUM(order_amount) AS total_revenue,
    AVG(order_amount) AS avg_order_value
  FROM raw_orders o
  JOIN dim_products p ON o.product_id = p.product_id
  GROUP BY 1, 2;

-- Create a dynamic table for customer 360 view
CREATE OR REPLACE DYNAMIC TABLE customer_360
  WAREHOUSE = analytics_wh
  TARGET_LAG = '30 minutes'
AS
  SELECT
    c.customer_id,
    c.customer_name,
    c.email,
    COUNT(o.order_id) AS total_orders,
    SUM(o.order_amount) AS lifetime_value,
    MAX(o.order_date) AS last_order_date,
    DATEDIFF('day', MAX(o.order_date), CURRENT_DATE()) AS days_since_last_order
  FROM dim_customers c
  LEFT JOIN raw_orders o ON c.customer_id = o.customer_id
  GROUP BY 1, 2, 3;

Common ETL Patterns

Incremental Load Pattern

-- Use Streams for CDC-based incremental loads
CREATE OR REPLACE STREAM raw_orders_stream
  ON TABLE raw_orders
  SHOW_INITIAL_ROWS = FALSE;

-- Task processes only new/changed rows
CREATE OR REPLACE TASK process_incremental_orders
  WAREHOUSE = etl_wh
  SCHEDULE = '5 MINUTE'
AS
  INSERT INTO dim_orders (order_id, customer_id, order_date, order_amount, status)
  SELECT order_id, customer_id, order_date, order_amount, status
  FROM raw_orders_stream
  WHERE METADATA$ACTION = 'INSERT';

Full Refresh Pattern

-- Truncate and reload for small dimension tables
CREATE OR REPLACE TASK full_refresh_dim_products
  WAREHOUSE = etl_wh
  SCHEDULE = 'USING CRON 0 6 * * * UTC'
AS
BEGIN
  CREATE OR REPLACE TEMPORARY TABLE dim_products_new AS
  SELECT * FROM raw_products WHERE is_active = TRUE;

  TRUNCATE TABLE dim_products;
  INSERT INTO dim_products SELECT * FROM dim_products_new;
END;

Slowly Changing Dimensions

-- SCD Type 2: Track historical changes
MERGE INTO dim_customers_scd2 AS target
USING raw_customers AS source
ON target.customer_id = source.customer_id
  AND target.is_current = TRUE
WHEN MATCHED AND (
  target.customer_name != source.customer_name
  OR target.email != source.email
) THEN UPDATE SET
  is_current = FALSE,
  end_date = CURRENT_DATE()
WHEN NOT MATCHED THEN INSERT
  (customer_id, customer_name, email, start_date, end_date, is_current)
VALUES
  (source.customer_id, source.customer_name, source.email,
   CURRENT_DATE(), NULL, TRUE);

-- Insert new version for changed records
INSERT INTO dim_customers_scd2
  (customer_id, customer_name, email, start_date, end_date, is_current)
SELECT
  customer_id, customer_name, email,
  CURRENT_DATE(), NULL, TRUE
FROM raw_customers r
WHERE EXISTS (
  SELECT 1 FROM dim_customers_scd2 d
  WHERE d.customer_id = r.customer_id
    AND d.is_current = FALSE
    AND d.end_date = CURRENT_DATE()
);

ETL Pipeline Design Best Practices

Practice	Description	Impact
Use ELT over ETL	Load raw data first, transform in Snowflake	Reduces infrastructure, leverages Snowflake compute
Implement CDC with Streams	Track only changed rows for incremental loads	Minimizes processing and cost
Set auto-suspend on warehouses	Idle warehouses consume credits unnecessarily	Reduces compute costs by 30-50%
Use resource monitors	Set credit quotas per warehouse	Prevents runaway costs
Validate data quality post-load	Add CHECK constraints and quality queries	Catches errors before downstream impact
Version control pipeline code	Store SQL in Git for auditability	Enables rollback and collaboration
Use staging tables for complex transforms	Stage, validate, then promote to target	Improves reliability and debugging
Schedule during off-peak hours	Run heavy ETL when warehouse utilization is low	Better performance, lower cost

Performance Metrics for ETL Pipelines

Metric	Target	Monitoring Query
Load latency	Under 15 minutes	`SELECT DATEDIFF('minute', MAX(load_time), CURRENT_TIMESTAMP()) FROM raw_orders`
Data freshness	Under 1 hour	Compare source and target timestamps
Error rate	Under 0.1%	`SELECT COUNT(*) FROM COPY_ACTIVITY_HISTORY WHERE status = 'LOAD_FAILED'`
Pipeline duration	Track trends	Monitor task execution times in ACCOUNT_USAGE
Credit consumption	Within budget	`SELECT SUM(credits_used) FROM WAREHOUSE_METERING_HISTORY`

Snowflake ETL Pipeline Patterns

Snowflake ETL Pipeline Patterns

ETL vs ELT in Snowflake Context

Snowpipe and COPY INTO for Ingestion

MERGE Statement for Upserts

Task-Based ETL Orchestration

Dynamic Tables for Declarative ETL

Common ETL Patterns

Incremental Load Pattern

Full Refresh Pattern

Slowly Changing Dimensions

ETL Pipeline Design Best Practices

Performance Metrics for ETL Pipelines

See Also

Need Expert Snowflake Help?