CW

Snowflake Unstructured Data Handling

Free Lesson

Advertisement

Snowflake Unstructured Data Handling

Snowflake extends beyond traditional structured data by providing robust support for unstructured and semi-structured data, enabling organizations to store, query, and analyze diverse data types.

Architecture Overview

<svg width="800" height="500" viewBox="0 0 800 500" xmlns="http://www.w3.org/2000/svg">
  <defs>
    <linearGradient id="grad1" x1="0%" y1="0%" x2="100%" y2="0%">
      <stop offset="0%" style="stop-color:#1E90FF;stop-opacity:1" />
      <stop offset="100%" style="stop-color:#00BFFF;stop-opacity:1" />
    </linearGradient>
    <linearGradient id="grad2" x1="0%" y1="0%" x2="100%" y2="0%">
      <stop offset="0%" style="stop-color:#32CD32;stop-opacity:1" />
      <stop offset="100%" style="stop-color:#90EE90;stop-opacity:1" />
    </linearGradient>
  </defs>
  <text x="400" y="40" text-anchor="middle" font-size="20" font-weight="bold" fill="#333">Snowflake Unstructured Data Architecture</text>
  <rect x="50" y="80" width="700" height="100" rx="10" fill="url(#grad1)" opacity="0.9"/>
  <text x="400" y="110" text-anchor="middle" font-size="14" fill="white" font-weight="bold">Cloud Storage Layer (S3/Azure/GCS)</text>
  <rect x="80" y="120" width="120" height="40" rx="5" fill="white" opacity="0.3"/>
  <text x="140" y="145" text-anchor="middle" font-size="11" fill="white">Images</text>
  <rect x="220" y="120" width="120" height="40" rx="5" fill="white" opacity="0.3"/>
  <text x="280" y="145" text-anchor="middle" font-size="11" fill="white">Videos</text>
  <rect x="360" y="120" width="120" height="40" rx="5" fill="white" opacity="0.3"/>
  <text x="420" y="145" text-anchor="middle" font-size="11" fill="white">Documents</text>
  <rect x="500" y="120" width="120" height="40" rx="5" fill="white" opacity="0.3"/>
  <text x="560" y="145" text-anchor="middle" font-size="11" fill="white">Audio</text>
  <rect x="640" y="120" width="90" height="40" rx="5" fill="white" opacity="0.3"/>
  <text x="685" y="145" text-anchor="middle" font-size="11" fill="white">JSON/XML</text>
  <path d="M400 180 L400 220" stroke="#333" stroke-width="2" fill="none" marker-end="url(#arrow)"/>
  <rect x="50" y="220" width="700" height="80" rx="10" fill="#FF6347" opacity="0.85"/>
  <text x="400" y="245" text-anchor="middle" font-size="14" fill="white" font-weight="bold">Snowflake Stage Layer</text>
  <rect x="80" y="260" width="100" height="30" rx="5" fill="white" opacity="0.3"/>
  <text x="130" y="280" text-anchor="middle" font-size="11" fill="white">Internal</text>
  <rect x="200" y="260" width="100" height="30" rx="5" fill="white" opacity="0.3"/>
  <text x="250" y="280" text-anchor="middle" font-size="11" fill="white">External</text>
  <rect x="320" y="260" width="100" height="30" rx="5" fill="white" opacity="0.3"/>
  <text x="370" y="280" text-anchor="middle" font-size="11" fill="white">Named</text>
  <rect x="440" y="260" width="100" height="30" rx="5" fill="white" opacity="0.3"/>
  <text x="490" y="280" text-anchor="middle" font-size="11" fill="white">User</text>
  <rect x="560" y="260" width="100" height="30" rx="5" fill="white" opacity="0.3"/>
  <text x="610" y="280" text-anchor="middle" font-size="11" fill="white">Table</text>
  <path d="M400 300 L400 340" stroke="#333" stroke-width="2" fill="none" marker-end="url(#arrow)"/>
  <rect x="50" y="340" width="700" height="60" rx="10" fill="url(#grad2)" opacity="0.85"/>
  <text x="400" y="375" text-anchor="middle" font-size="14" fill="white" font-weight="bold">Processing: COPY INTO | pipes | Streams | Tasks</text>
  <path d="M400 400 L400 430" stroke="#333" stroke-width="2" fill="none" marker-end="url(#arrow)"/>
  <rect x="50" y="430" width="700" height="50" rx="10" fill="#9370DB" opacity="0.85"/>
  <text x="400" y="460" text-anchor="middle" font-size="14" fill="white" font-weight="bold">Variant Column | Semi-Structured | Hybrid Tables</text>
  <defs>
    <marker id="arrow" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto" markerUnits="strokeWidth">
      <path d="M0,0 L0,6 L9,3 z" fill="#333"/>
    </marker>
  </defs>
</svg>

Key Concepts

DfVariant Data Type

DfFile Format

File Format Creation

Creating a JSON File Format

CREATE OR REPLACE FILE FORMAT json_format
  TYPE = 'JSON'
  STRIP_OUTER_ARRAY = TRUE
  TRIM_SPACE = TRUE
  IGNORE_UTF8_ERRORS = FALSE,
  DATE_FORMAT = 'YYYY-MM-DD'
  COMMENT = 'Standard JSON format for API responses';

Creating a Parquet File Format

CREATE OR REPLACE FILE FORMAT parquet_format
  TYPE = 'PARQUET'
  TRIM_SPACE = TRUE
  REPLACE_INVALID_CHARACTERS = TRUE
  COMMENT = 'Parquet format for columnar data';

File Format Options Comparison

FormatCompressionNestingSchemaUse Case
JSONGZIP/SNAPPYDeepSchema-on-readAPIs, logs
AVRODEFLATENestedSchema-in-fileStreaming
ORCZLIBNestedSchema-in-fileHadoop ecosystem
PARQUETSNAPPYNestedSchema-in-fileAnalytics

Stage Operations

Internal Stage Types

-- User stage (~/)
PUT file://data.json @~/staged/ AUTO_COMPRESS=TRUE;

-- Table stage
PUT file://data.csv @%my_table/;

-- Named stage
CREATE STAGE my_named_stage
  FILE_FORMAT = json_format
  COMMENT = 'Named stage for daily loads';

PUT file://data.json @my_named_stage/ OVERWRITE=TRUE;

External Stage (S3 Example)

CREATE OR REPLACE STAGE s3_stage
  URL = 's3://my-bucket/data/'
  STORAGE_INTEGRATION = my_s3_integration
  FILE_FORMAT = json_format
  COMMENT = 'S3 bucket for raw data';

LIST @s3_stage;

Loading Unstructured Data

COPY INTO with Variant

CREATE TABLE raw_json_data (
  id INT AUTOINCREMENT PRIMARY KEY,
  raw_data VARIANT,
  file_name VARCHAR,
  load_timestamp TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

COPY INTO raw_json_data (raw_data, file_name)
FROM @s3_stage
FILE_FORMAT = json_format
ON_ERROR = 'CONTINUE'
FORCE = FALSE;

Querying Semi-Structured Data

SELECT
  raw_data:id::STRING as user_id,
  raw_data:name::STRING as user_name,
  raw_data:email::STRING as email,
  raw_data:address.city::STRING as city,
  raw_data:preferences.theme::STRING as theme,
  load_timestamp
FROM raw_json_data
WHERE raw_data:active::BOOLEAN = TRUE;

Flattening Nested Arrays

SELECT
  raw_data:id::INT as user_id,
  f.value:product::STRING as product_name,
  f.value:quantity::INT as quantity,
  f.value:price::DECIMAL(10,2) as price
FROM raw_json_data,
LATERAL FLATTEN(input => raw_data:orders) f;

Directory Table Pattern

-- Create table with directory table
CREATE TABLE documents (
  id INT AUTOINCREMENT PRIMARY KEY,
  document_name VARCHAR,
  metadata VARIANT
) DIRECTORY = (ENABLE = TRUE);

-- Refresh directory table
ALTER TABLE documents REFRESH;

-- Query directory table metadata
SELECT
  directory:relativePath as file_path,
  directory:path as file_path_full,
  directory:size as file_size,
  directory:lastModified as last_modified
FROM (TABLE(INFORMATION_SCHEMA.DIRECTORY_TABLE) WHERE table_name = 'DOCUMENTS');

Data Quality for Unstructured Data

Always validate unstructured data before loading. Use file format options like SKIP_BLANK_LINES, ERROR_ON_COLUMN_COUNT_MISMATCH, and custom error handling to ensure data quality.

-- Create a validation query
SELECT
  $1:id::STRING IS NOT NULL as has_id,
  $1:name::STRING IS NOT NULL as has_name,
  METADATA$FILENAME as file_name,
  COUNT(*) as record_count
FROM @s3_stage
FILE_FORMAT = json_format
GROUP BY has_id, has_name, file_name;

Hybrid Tables for Unstructured References

CREATE HYBRID TABLE product_images (
  product_id INT PRIMARY KEY,
  product_name VARCHAR(100),
  image_url VARCHAR(500),
  thumbnail_url VARCHAR(500),
  metadata VARIANT,
  created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

-- Insert with metadata
INSERT INTO product_images (product_id, product_name, image_url, metadata)
VALUES
  (1, 'Widget A', 's3://images/widget_a.jpg', '{"format":"JPEG","size":1024,"dimensions":{"width":800,"height":600}}');
  • Snowflake stores unstructured data in cloud storage via stages
  • VARIANT column type enables schema-on-read for semi-structured data
  • FLATTEN function is essential for querying nested arrays
  • Directory tables provide automatic metadata tracking
  • Hybrid tables combine structured and unstructured data capabilities

Advertisement

Need Expert Snowflake Help?

Get personalized warehouse optimization, data modeling, or Snowflake platform consulting.

Advertisement