Snowflake Real-Time Data Ingestion

Snowflake provides multiple mechanisms for real-time data ingestion, enabling near-instantaneous data availability for analytics and operations.

Real-Time Ingestion Options

Snowflake offers four primary mechanisms for real-time data ingestion, each with distinct latency profiles and use cases. The choice depends on your latency requirements, data volume, and whether you are generating files or raw event streams.

Snowpipe (File-Based Auto-Ingestion)

Snowpipe continuously monitors cloud storage locations (S3, Azure Blob, GCS) for new files and loads them automatically. It supports event notifications (SQS/SNS for AWS, Event Grid for Azure, Pub/Sub for GCP) for near-real-time triggering. Data lands in files first, then Snowpipe loads them within seconds to minutes.

Snowpipe Streaming (API-Based)

Snowpipe Streaming provides direct API-based ingestion without writing files to storage. Applications send data via the /v1/data/pipes/ REST endpoint, and Snowflake buffers and loads it with sub-second latency. This is the lowest-latency option available natively in Snowflake.

Kafka Connector for Snowflake

The Snowflake Kafka Connector handles continuous data streaming from Apache Kafka topics directly into Snowflake. It manages batching, file formation, error handling, and provides exactly-once delivery semantics when configured correctly.

Direct DML

Standard SQL INSERT, UPDATE, and DELETE statements provide immediate data availability but are not optimized for high-throughput streaming. Best for low-volume, event-driven updates where immediate consistency is required.

Snowpipe Streaming API

import snowflake.connector
from snowflake.connector import DictCursor

# Establish connection
conn = snowflake.connector.connect(
    account='your_account',
    user='your_user',
    password='your_password',
    warehouse='streaming_wh',
    database='realtime_db',
    schema='events'
)

cursor = conn.cursor()

# Create a target table for streaming data
cursor.execute("""
    CREATE OR REPLACE TABLE clickstream_events (
        event_id VARCHAR(50),
        user_id VARCHAR(50),
        page_url VARCHAR(500),
        event_type VARCHAR(50),
        event_timestamp TIMESTAMP_NTZ,
        session_id VARCHAR(100),
        ip_address VARCHAR(45),
        user_agent VARCHAR(500)
    )
""")

# Create a streaming pipe
cursor.execute("""
    CREATE OR REPLACE PIPE clickstream_pipe
    AUTO_INGEST = FALSE
    AS
    COPY INTO clickstream_events
    FROM @streaming_stage
    FILE_FORMAT = (TYPE = 'JSON' STRIP_OUTER_ARRAY = TRUE)
""")

# Stream data using the Snowpipe Streaming API
# Note: Use the REST API for production streaming
import requests
import json

account_url = "https://your_account.snowflakecomputing.com"
token = "your_token"

headers = {
    "Authorization": f"Bearer {token}",
    "Content-Type": "application/json"
}

# Stream events via the pipes API
event_data = {
    "records": [
        {
            "event_id": "evt_001",
            "user_id": "user_123",
            "page_url": "/products/item-456",
            "event_type": "page_view",
            "event_timestamp": "2024-01-15T10:30:00Z",
            "session_id": "sess_abc",
            "ip_address": "192.168.1.1",
            "user_agent": "Mozilla/5.0"
        }
    ]
}

response = requests.post(
    f"{account_url}/v1/data/pipes/clickstream_pipe",
    headers=headers,
    data=json.dumps(event_data)
)

print(f"Streaming response: {response.status_code}")

Kafka Connector Configuration

{
  "name": "snowflake-sink-connector",
  "config": {
    "connector.class": "net.snowflake.kafka.connector.SnowflakeSinkConnector",
    "tasks.max": "3",
    "topics": "user_events,page_views,transactions",
    "snowflake.ingestion.method": "SNOWPIPE",
    "snowflake.schema.name": "KAFKA_SCHEMA",
    "snowflake.schema.repository.name": "KAFKA_SCHEMA",
    "snowflake.database.name": "REALTIME_DB",
    "snowflake.table.name": "KAFKA_EVENTS",
    "snowflake.stage.name": "KAFKA_STAGE",
    "snowflake.private.key": "your_private_key",
    "snowflake.private.key.passphrase": "your_passphrase",
    "snowflake.account": "your_account",
    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "snowflake.enable.snowpipe.streaming": "true",
    "buffer.count.records": "10000",
    "buffer.flush.time.sec": "30",
    "buffer.size.bytes": "5242880"
  }
}

Streams for Change Data Capture

-- Create a stream on source table for CDC
CREATE OR REPLACE STREAM orders_cdc_stream
  ON TABLE source_orders
  SHOW_INITIAL_ROWS = FALSE
  APPEND_ONLY = FALSE;

-- Create a task to process stream records
CREATE OR REPLACE TASK process_cdc_events
  WAREHOUSE = streaming_wh
  SCHEDULE = '1 MINUTE'
WHEN SYSTEM$STREAM_HAS_DATA('orders_cdc_stream')
AS
BEGIN
  -- Process inserts
  INSERT INTO target_orders (order_id, customer_id, amount, status, event_time)
  SELECT
    order_id,
    customer_id,
    amount,
    'ACTIVE' AS status,
    CURRENT_TIMESTAMP() AS event_time
  FROM orders_cdc_stream
  WHERE METADATA$ACTION = 'INSERT';

  -- Process updates (soft delete + re-insert)
  UPDATE target_orders t
  SET is_current = FALSE, end_date = CURRENT_TIMESTAMP()
  WHERE EXISTS (
    SELECT 1 FROM orders_cdc_stream s
    WHERE s.order_id = t.order_id
      AND s.METADATA$ACTION = 'UPDATE'
      AND s.METADATA$ISUPDATE = TRUE
  );

  INSERT INTO target_orders (order_id, customer_id, amount, status, is_current)
  SELECT s.order_id, s.customer_id, s.amount, 'ACTIVE', TRUE
  FROM orders_cdc_stream s
  WHERE s.METADATA$ACTION = 'UPDATE'
    AND s.METADATA$ISUPDATE = TRUE;

  -- Process deletes
  UPDATE target_orders t
  SET is_current = FALSE, status = 'DELETED'
  WHERE EXISTS (
    SELECT 1 FROM orders_cdc_stream s
    WHERE s.order_id = t.order_id
      AND s.METADATA$ACTION = 'DELETE'
  );
END;

Latency Comparison Table

Ingestion Method	Typical Latency	Throughput	Best For
Snowpipe Streaming	Under 1 second	100 MB/s per pipe	Real-time dashboards, IoT
Snowpipe (auto-ingest)	1-60 seconds	1 GB per minute	Operational analytics
Kafka Connector	1-10 seconds	10+ MB/s	Event streaming platforms
COPY INTO (batch)	Minutes to hours	10+ GB per load	Historical backfill
Direct DML (INSERT)	Instant	Limited by warehouse	Low-volume event triggers

Best Practices for Real-Time Ingestion

Practice	Description	Impact
Use Snowpipe Streaming for sub-second	Direct API ingestion bypasses file staging	Lowest latency available
Set appropriate auto-suspend	Streaming warehouses should have longer suspend	Avoids cold starts
Monitor pipe health	Query `INFORMATION_SCHEMA.COPY_ACTIVITY_HISTORY` regularly	Catches ingestion failures early
Use VARIANT for semi-structured	Store raw JSON before parsing	Flexibility for schema evolution
Implement dead-letter queues	Route failed records to error tables	Prevents data loss
Batch small events	Accumulate records before sending	Reduces API call overhead
Use Streams for downstream CDC	Track changes in target tables	Enables event-driven analytics
Set up alerts on lag	Monitor `PIPE_STATUS` and `COPY_ACTIVITY_HISTORY`	Proactive failure detection

Snowflake Real-Time Data Ingestion

Snowflake Real-Time Data Ingestion

Real-Time Ingestion Options

Snowpipe (File-Based Auto-Ingestion)

Snowpipe Streaming (API-Based)

Kafka Connector for Snowflake

Direct DML

Snowpipe Streaming API

Kafka Connector Configuration

Streams for Change Data Capture

Latency Comparison Table

Best Practices for Real-Time Ingestion

See Also

Need Expert Snowflake Help?