Snowflake Unstructured Data Handling
Snowflake extends beyond traditional structured data by providing robust support for unstructured and semi-structured data, enabling organizations to store, query, and analyze diverse data types.
Architecture Overview
<svg width="800" height="500" viewBox="0 0 800 500" xmlns="http://www.w3.org/2000/svg">
<defs>
<linearGradient id="grad1" x1="0%" y1="0%" x2="100%" y2="0%">
<stop offset="0%" style="stop-color:#1E90FF;stop-opacity:1" />
<stop offset="100%" style="stop-color:#00BFFF;stop-opacity:1" />
</linearGradient>
<linearGradient id="grad2" x1="0%" y1="0%" x2="100%" y2="0%">
<stop offset="0%" style="stop-color:#32CD32;stop-opacity:1" />
<stop offset="100%" style="stop-color:#90EE90;stop-opacity:1" />
</linearGradient>
</defs>
<text x="400" y="40" text-anchor="middle" font-size="20" font-weight="bold" fill="#333">Snowflake Unstructured Data Architecture</text>
<rect x="50" y="80" width="700" height="100" rx="10" fill="url(#grad1)" opacity="0.9"/>
<text x="400" y="110" text-anchor="middle" font-size="14" fill="white" font-weight="bold">Cloud Storage Layer (S3/Azure/GCS)</text>
<rect x="80" y="120" width="120" height="40" rx="5" fill="white" opacity="0.3"/>
<text x="140" y="145" text-anchor="middle" font-size="11" fill="white">Images</text>
<rect x="220" y="120" width="120" height="40" rx="5" fill="white" opacity="0.3"/>
<text x="280" y="145" text-anchor="middle" font-size="11" fill="white">Videos</text>
<rect x="360" y="120" width="120" height="40" rx="5" fill="white" opacity="0.3"/>
<text x="420" y="145" text-anchor="middle" font-size="11" fill="white">Documents</text>
<rect x="500" y="120" width="120" height="40" rx="5" fill="white" opacity="0.3"/>
<text x="560" y="145" text-anchor="middle" font-size="11" fill="white">Audio</text>
<rect x="640" y="120" width="90" height="40" rx="5" fill="white" opacity="0.3"/>
<text x="685" y="145" text-anchor="middle" font-size="11" fill="white">JSON/XML</text>
<path d="M400 180 L400 220" stroke="#333" stroke-width="2" fill="none" marker-end="url(#arrow)"/>
<rect x="50" y="220" width="700" height="80" rx="10" fill="#FF6347" opacity="0.85"/>
<text x="400" y="245" text-anchor="middle" font-size="14" fill="white" font-weight="bold">Snowflake Stage Layer</text>
<rect x="80" y="260" width="100" height="30" rx="5" fill="white" opacity="0.3"/>
<text x="130" y="280" text-anchor="middle" font-size="11" fill="white">Internal</text>
<rect x="200" y="260" width="100" height="30" rx="5" fill="white" opacity="0.3"/>
<text x="250" y="280" text-anchor="middle" font-size="11" fill="white">External</text>
<rect x="320" y="260" width="100" height="30" rx="5" fill="white" opacity="0.3"/>
<text x="370" y="280" text-anchor="middle" font-size="11" fill="white">Named</text>
<rect x="440" y="260" width="100" height="30" rx="5" fill="white" opacity="0.3"/>
<text x="490" y="280" text-anchor="middle" font-size="11" fill="white">User</text>
<rect x="560" y="260" width="100" height="30" rx="5" fill="white" opacity="0.3"/>
<text x="610" y="280" text-anchor="middle" font-size="11" fill="white">Table</text>
<path d="M400 300 L400 340" stroke="#333" stroke-width="2" fill="none" marker-end="url(#arrow)"/>
<rect x="50" y="340" width="700" height="60" rx="10" fill="url(#grad2)" opacity="0.85"/>
<text x="400" y="375" text-anchor="middle" font-size="14" fill="white" font-weight="bold">Processing: COPY INTO | pipes | Streams | Tasks</text>
<path d="M400 400 L400 430" stroke="#333" stroke-width="2" fill="none" marker-end="url(#arrow)"/>
<rect x="50" y="430" width="700" height="50" rx="10" fill="#9370DB" opacity="0.85"/>
<text x="400" y="460" text-anchor="middle" font-size="14" fill="white" font-weight="bold">Variant Column | Semi-Structured | Hybrid Tables</text>
<defs>
<marker id="arrow" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto" markerUnits="strokeWidth">
<path d="M0,0 L0,6 L9,3 z" fill="#333"/>
</marker>
</defs>
</svg>
Key Concepts
DfVariant Data Type
DfFile Format
File Format Creation
Creating a JSON File Format
CREATE OR REPLACE FILE FORMAT json_format
TYPE = 'JSON'
STRIP_OUTER_ARRAY = TRUE
TRIM_SPACE = TRUE
IGNORE_UTF8_ERRORS = FALSE,
DATE_FORMAT = 'YYYY-MM-DD'
COMMENT = 'Standard JSON format for API responses';
Creating a Parquet File Format
CREATE OR REPLACE FILE FORMAT parquet_format
TYPE = 'PARQUET'
TRIM_SPACE = TRUE
REPLACE_INVALID_CHARACTERS = TRUE
COMMENT = 'Parquet format for columnar data';
File Format Options Comparison
| Format | Compression | Nesting | Schema | Use Case |
|---|---|---|---|---|
| JSON | GZIP/SNAPPY | Deep | Schema-on-read | APIs, logs |
| AVRO | DEFLATE | Nested | Schema-in-file | Streaming |
| ORC | ZLIB | Nested | Schema-in-file | Hadoop ecosystem |
| PARQUET | SNAPPY | Nested | Schema-in-file | Analytics |
Stage Operations
Internal Stage Types
-- User stage (~/)
PUT file://data.json @~/staged/ AUTO_COMPRESS=TRUE;
-- Table stage
PUT file://data.csv @%my_table/;
-- Named stage
CREATE STAGE my_named_stage
FILE_FORMAT = json_format
COMMENT = 'Named stage for daily loads';
PUT file://data.json @my_named_stage/ OVERWRITE=TRUE;
External Stage (S3 Example)
CREATE OR REPLACE STAGE s3_stage
URL = 's3://my-bucket/data/'
STORAGE_INTEGRATION = my_s3_integration
FILE_FORMAT = json_format
COMMENT = 'S3 bucket for raw data';
LIST @s3_stage;
Loading Unstructured Data
COPY INTO with Variant
CREATE TABLE raw_json_data (
id INT AUTOINCREMENT PRIMARY KEY,
raw_data VARIANT,
file_name VARCHAR,
load_timestamp TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
COPY INTO raw_json_data (raw_data, file_name)
FROM @s3_stage
FILE_FORMAT = json_format
ON_ERROR = 'CONTINUE'
FORCE = FALSE;
Querying Semi-Structured Data
SELECT
raw_data:id::STRING as user_id,
raw_data:name::STRING as user_name,
raw_data:email::STRING as email,
raw_data:address.city::STRING as city,
raw_data:preferences.theme::STRING as theme,
load_timestamp
FROM raw_json_data
WHERE raw_data:active::BOOLEAN = TRUE;
Flattening Nested Arrays
SELECT
raw_data:id::INT as user_id,
f.value:product::STRING as product_name,
f.value:quantity::INT as quantity,
f.value:price::DECIMAL(10,2) as price
FROM raw_json_data,
LATERAL FLATTEN(input => raw_data:orders) f;
Directory Table Pattern
-- Create table with directory table
CREATE TABLE documents (
id INT AUTOINCREMENT PRIMARY KEY,
document_name VARCHAR,
metadata VARIANT
) DIRECTORY = (ENABLE = TRUE);
-- Refresh directory table
ALTER TABLE documents REFRESH;
-- Query directory table metadata
SELECT
directory:relativePath as file_path,
directory:path as file_path_full,
directory:size as file_size,
directory:lastModified as last_modified
FROM (TABLE(INFORMATION_SCHEMA.DIRECTORY_TABLE) WHERE table_name = 'DOCUMENTS');
Data Quality for Unstructured Data
Always validate unstructured data before loading. Use file format options like SKIP_BLANK_LINES, ERROR_ON_COLUMN_COUNT_MISMATCH, and custom error handling to ensure data quality.
-- Create a validation query
SELECT
$1:id::STRING IS NOT NULL as has_id,
$1:name::STRING IS NOT NULL as has_name,
METADATA$FILENAME as file_name,
COUNT(*) as record_count
FROM @s3_stage
FILE_FORMAT = json_format
GROUP BY has_id, has_name, file_name;
Hybrid Tables for Unstructured References
CREATE HYBRID TABLE product_images (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
image_url VARCHAR(500),
thumbnail_url VARCHAR(500),
metadata VARIANT,
created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
-- Insert with metadata
INSERT INTO product_images (product_id, product_name, image_url, metadata)
VALUES
(1, 'Widget A', 's3://images/widget_a.jpg', '{"format":"JPEG","size":1024,"dimensions":{"width":800,"height":600}}');
- Snowflake stores unstructured data in cloud storage via stages
- VARIANT column type enables schema-on-read for semi-structured data
- FLATTEN function is essential for querying nested arrays
- Directory tables provide automatic metadata tracking
- Hybrid tables combine structured and unstructured data capabilities