Snowflake Unstructured Data Handling

Snowflake extends beyond traditional structured data by providing robust support for unstructured and semi-structured data, enabling organizations to store, query, and analyze diverse data types.

Architecture Overview

The Snowflake unstructured data architecture flows through four layers:

Layer 1: Cloud Storage

Stores raw files in S3, Azure Blob, or Google Cloud Storage
Supports images, videos, documents, audio, JSON, and XML
Files remain in native cloud format until staged

Layer 2: Snowflake Stage Layer

Internal, External, Named, User, or Table stages
Stages files for efficient loading into Snowflake
Supports compression and encryption

Layer 3: Processing Layer

COPY INTO for bulk loading
Snowpipe for continuous ingestion
Streams and Tasks for real-time processing

Layer 4: Storage Layer

Data lands in VARIANT columns for semi-structured data
Semi-structured tables for JSON/Parquet data
Hybrid Tables for structured + unstructured references

Key Concepts

What is a VARIANT Data Type?

Schema-on-read data type for semi-structured data
Stores JSON, Avro, ORC, Parquet natively in Snowflake
No need to define schema before loading
Supports nested structures and arrays

What is a File Format?

Named object that defines file structure
Specifies type (JSON, Parquet, CSV, etc.)
Controls compression, encoding, and error handling
Reusable across multiple loading operations

File Format Creation

Creating a JSON File Format

CREATE OR REPLACE FILE FORMAT json_format
  TYPE = 'JSON'
  STRIP_OUTER_ARRAY = TRUE
  TRIM_SPACE = TRUE
  IGNORE_UTF8_ERRORS = FALSE,
  DATE_FORMAT = 'YYYY-MM-DD'
  COMMENT = 'Standard JSON format for API responses';

Creating a Parquet File Format

CREATE OR REPLACE FILE FORMAT parquet_format
  TYPE = 'PARQUET'
  TRIM_SPACE = TRUE
  REPLACE_INVALID_CHARACTERS = TRUE
  COMMENT = 'Parquet format for columnar data';

File Format Options Comparison

Format	Compression	Nesting	Schema	Use Case
JSON	GZIP/SNAPPY	Deep	Schema-on-read	APIs, logs
AVRO	DEFLATE	Nested	Schema-in-file	Streaming
ORC	ZLIB	Nested	Schema-in-file	Hadoop ecosystem
PARQUET	SNAPPY	Nested	Schema-in-file	Analytics

Stage Operations

Internal Stage Types

-- User stage (~/)
PUT file://data.json @~/staged/ AUTO_COMPRESS=TRUE;

-- Table stage
PUT file://data.csv @%my_table/;

-- Named stage
CREATE STAGE my_named_stage
  FILE_FORMAT = json_format
  COMMENT = 'Named stage for daily loads';

PUT file://data.json @my_named_stage/ OVERWRITE=TRUE;

External Stage (S3 Example)

CREATE OR REPLACE STAGE s3_stage
  URL = 's3://my-bucket/data/'
  STORAGE_INTEGRATION = my_s3_integration
  FILE_FORMAT = json_format
  COMMENT = 'S3 bucket for raw data';

LIST @s3_stage;

Loading Unstructured Data

COPY INTO with Variant

CREATE TABLE raw_json_data (
  id INT AUTOINCREMENT PRIMARY KEY,
  raw_data VARIANT,
  file_name VARCHAR,
  load_timestamp TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

COPY INTO raw_json_data (raw_data, file_name)
FROM @s3_stage
FILE_FORMAT = json_format
ON_ERROR = 'CONTINUE'
FORCE = FALSE;

Querying Semi-Structured Data

SELECT
  raw_data:id::STRING as user_id,
  raw_data:name::STRING as user_name,
  raw_data:email::STRING as email,
  raw_data:address.city::STRING as city,
  raw_data:preferences.theme::STRING as theme,
  load_timestamp
FROM raw_json_data
WHERE raw_data:active::BOOLEAN = TRUE;

Flattening Nested Arrays

SELECT
  raw_data:id::INT as user_id,
  f.value:product::STRING as product_name,
  f.value:quantity::INT as quantity,
  f.value:price::DECIMAL(10,2) as price
FROM raw_json_data,
LATERAL FLATTEN(input => raw_data:orders) f;

Directory Table Pattern

-- Create table with directory table
CREATE TABLE documents (
  id INT AUTOINCREMENT PRIMARY KEY,
  document_name VARCHAR,
  metadata VARIANT
) DIRECTORY = (ENABLE = TRUE);

-- Refresh directory table
ALTER TABLE documents REFRESH;

-- Query directory table metadata
SELECT
  directory:relativePath as file_path,
  directory:path as file_path_full,
  directory:size as file_size,
  directory:lastModified as last_modified
FROM (TABLE(INFORMATION_SCHEMA.DIRECTORY_TABLE) WHERE table_name = 'DOCUMENTS');

Data Quality for Unstructured Data

Why Validate Unstructured Data?

Prevents bad data from entering your warehouse
Ensures downstream analytics are accurate
Reduces debugging time for data pipelines
Supports compliance and data governance requirements

Key Validation Options

SKIP_BLANK_LINES — Ignores empty lines in files
ERROR_ON_COLUMN_COUNT_MISMATCH — Fails on schema mismatches
TRIM_SPACE — Removes leading/trailing whitespace
IGNORE_UTF8_ERRORS — Handles encoding issues gracefully

Key Takeaway: Always validate unstructured data before loading. Use file format options and custom error handling to ensure data quality.

-- Create a validation query
SELECT
  $1:id::STRING IS NOT NULL as has_id,
  $1:name::STRING IS NOT NULL as has_name,
  METADATA$FILENAME as file_name,
  COUNT(*) as record_count
FROM @s3_stage
FILE_FORMAT = json_format
GROUP BY has_id, has_name, file_name;

Hybrid Tables for Unstructured References

CREATE HYBRID TABLE product_images (
  product_id INT PRIMARY KEY,
  product_name VARCHAR(100),
  image_url VARCHAR(500),
  thumbnail_url VARCHAR(500),
  metadata VARIANT,
  created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

-- Insert with metadata
INSERT INTO product_images (product_id, product_name, image_url, metadata)
VALUES
  (1, 'Widget A', 's3://images/widget_a.jpg', '{"format":"JPEG","size":1024,"dimensions":{"width":800,"height":600}}');

Summary

What You Learned

Snowflake stores unstructured data in cloud storage via stages
VARIANT column type enables schema-on-read for semi-structured data
FLATTEN function is essential for querying nested arrays
Directory tables provide automatic metadata tracking
Hybrid tables combine structured and unstructured data capabilities

Best Practices

Use named file formats for consistency across loading operations
Validate data before loading to prevent pipeline failures
Leverage FLATTEN for nested array queries
Use directory tables for automatic metadata management
Consider Hybrid Tables for mixed structured/unstructured use cases

Snowflake Unstructured Data Handling

Snowflake Unstructured Data Handling

Architecture Overview

Layer 1: Cloud Storage

Layer 2: Snowflake Stage Layer

Layer 3: Processing Layer

Layer 4: Storage Layer

Key Concepts

What is a VARIANT Data Type?

What is a File Format?

File Format Creation

Creating a JSON File Format

Creating a Parquet File Format

File Format Options Comparison

Stage Operations

Internal Stage Types

External Stage (S3 Example)

Loading Unstructured Data

COPY INTO with Variant

Querying Semi-Structured Data

Flattening Nested Arrays

Directory Table Pattern

Data Quality for Unstructured Data

Why Validate Unstructured Data?

Key Validation Options

Hybrid Tables for Unstructured References

Summary

What You Learned

Best Practices

See Also

Need Expert Snowflake Help?