Semi-Structured Data: VARIANT, JSON, XML & Array Functions

Architecture Diagram 1: VARIANT Storage Architecture

Architecture Diagram 2: Semi-Structured Data Parsing Pipeline

Architecture Diagram 3: VARIANT Storage Architecture

Architecture Diagram 4: JSON Parsing Pipeline

Architecture Diagram 5: XML Processing Architecture

Detailed Explanation

What is the VARIANT Data Type?

VARIANT is Snowflake's universal container for semi-structured data (JSON, XML, BSON, Avro, Parquet). It automatically parses and stores hierarchical data in compressed binary format.

VARIANT Storage and Access

Hybrid storage: short strings inline, long strings via pointers (8–16 bytes)
Two access patterns:
- Dot notation — data:customer.name for object navigation
- Bracket notation — data:items[0] for special characters or dynamic paths
Path resolution: O(depth) metadata lookup
Path statistics cache improves performance for frequently queried paths

JSON Processing Deep Dive

Handles all standard JSON types: objects, arrays, strings, numbers, booleans, null
Edge cases: malformed JSON, duplicate keys, non-standard numerics (configurable)
FLATTEN converts arrays to rows for relational processing (lazy evaluation)
Type casting required: data:amount::NUMBER

XML Processing

Native via VARIANT + XMLGET function
Load with TYPE = XML file format
Supports namespaces, attributes, mixed content, CDATA sections
Incremental processing for large documents

Array and Object Functions

Function	Purpose
`ARRAY_AGG`	Aggregate rows into arrays
`ARRAY_CONSTRUCT`	Build arrays from elements
`ARRAY_CAT`	Concatenate arrays
`OBJECT_CONSTRUCT`	Create JSON objects
`OBJECT_KEYS`	Extract object keys
`XMLGET`	Extract XML elements

Schema Evolution and Flexibility

New fields added without DDL changes (missing fields return NULL)
Tradeoff: flexibility vs. performance — queries on many paths have metadata overhead
Recommendation: extract hot paths into dedicated columns via materialized views for analytics

Key Takeaway: Use VARIANT for flexible ingestion, extract frequently queried fields into relational columns for performance.

Key Concepts Table

Function	Purpose	Input	Output
FLATTEN	Convert arrays to rows	ARRAY VARIANT	Table of VARIANT
PARSE_JSON	Parse JSON string	STRING	VARIANT
TRY_PARSE_JSON	Safe JSON parse	STRING	VARIANT (NULL on error)
OBJECT_CONSTRUCT	Create JSON object	Key-value pairs	VARIANT
ARRAY_CONSTRUCT	Create JSON array	Elements	VARIANT
XMLGET	Extract XML element	VARIANT, path	VARIANT
TO_VARIANT	Convert to VARIANT	Any type	VARIANT

Data Type	Storage	Access Pattern	Performance
STRING	Raw text	Full scan required	Slow
VARIANT	Parsed binary	Path navigation	Fast
OBJECT	Parsed binary	Key lookup	Very fast
ARRAY	Parsed binary	Index access	Very fast
GEOGRAPHY	Binary geometry	Spatial functions	Fast

Metric	VARIANT	Flattened Columns	Description
Storage Efficiency	High (compressed)	Medium (redundant)	VARIANT stores once
Query Performance	Medium	High	Flattened columns avoid parsing
Flexibility	Very High	Low	VARIANT adapts to schema changes
Analytics	Limited	Full	Flattened columns enable all SQL
Indexing	Limited	Full	Flattened columns support clustering

Code Examples

-- Example 1: Create table with VARIANT column
CREATE TABLE raw_json_data (
    id INTEGER,
    data VARIANT,
    load_timestamp TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

-- Example 2: Ingest JSON data
COPY INTO raw_json_data (id, data)
FROM @json_stage/
FILE_FORMAT = (
    TYPE = JSON
    STRIP_OUTER_ARRAY = FALSE
    IGNORE_UTF8_ERRORS = TRUE
);

-- Example 3: Query JSON data with dot notation
SELECT 
    data:order_id::STRING as order_id,
    data:customer.name::STRING as customer_name,
    data:customer.email::STRING as customer_email,
    data:total::NUMBER(10,2) as order_total,
    data:currency::STRING as currency
FROM raw_json_data
WHERE data:customer.name::STRING LIKE '%Smith%';

-- Example 4: Flatten JSON arrays
SELECT 
    data:order_id::STRING as order_id,
    f.value:product_id::STRING as product_id,
    f.value:name::STRING as product_name,
    f.value:quantity::INTEGER as quantity,
    f.value:price::NUMBER(10,2) as price
FROM raw_json_data,
     LATERAL FLATTEN(input => data:items) f;

-- Example 5: Aggregate flattened data back to arrays
SELECT 
    data:order_id::STRING as order_id,
    ARRAY_AGG(
        OBJECT_CONSTRUCT(
            'product_id', f.value:product_id::STRING,
            'name', f.value:name::STRING,
            'quantity', f.value:quantity::INTEGER
        )
    ) as items_array
FROM raw_json_data,
     LATERAL FLATTEN(input => data:items) f
GROUP BY 1;

-- Example 6: Create structured table from semi-structured data
CREATE TABLE orders_structured AS
SELECT 
    data:order_id::STRING as order_id,
    data:customer.name::STRING as customer_name,
    data:customer.email::STRING as customer_email,
    data:total::NUMBER(10,2) as order_total,
    data:currency::STRING as currency,
    data:timestamp::TIMESTAMP_NTZ as order_timestamp,
    COUNT(f.value) as item_count
FROM raw_json_data,
     LATERAL FLATTEN(input => data:items) f
GROUP BY 1, 2, 3, 4, 5, 6;

-- Example 7: XML processing
CREATE TABLE xml_data (
    id INTEGER,
    data VARIANT
);

COPY INTO xml_data
FROM @xml_stage/
FILE_FORMAT = (TYPE = XML);

-- Query XML data
SELECT 
    XMLGET(data, 'order_id'):t::STRING as order_id,
    XMLGET(data, 'customer'):t:name:t::STRING as customer_name,
    XMLGET(data, 'total'):t::NUMBER(10,2) as total
FROM xml_data;

-- Example 8: Advanced array manipulation
SELECT 
    data:order_id::STRING as order_id,
    ARRAY_CONSTRUCT('item1', 'item2', 'item3') as static_array,
    ARRAY_CAT(
        data:tags::ARRAY,
        ARRAY_CONSTRUCT('new_tag')
    ) as extended_tags,
    ARRAY_INTERSECTION(
        data:tags::ARRAY,
        ARRAY_CONSTRUCT('tag1', 'tag3')
    ) as matching_tags
FROM raw_json_data
WHERE ARRAY_SIZE(data:tags::ARRAY) > 0;

-- Example 9: Object manipulation
SELECT 
    data:order_id::STRING as order_id,
    OBJECT_CONSTRUCT(
        'order_id', data:order_id::STRING,
        'total', data:total::NUMBER,
        'item_count', ARRAY_SIZE(data:items::ARRAY)
    ) as order_summary,
    OBJECT_KEYS(data:customer::OBJECT) as customer_fields
FROM raw_json_data;

-- Example 10: Schema evolution handling
-- Add new field to JSON without schema change
INSERT INTO raw_json_data (id, data)
SELECT 
    1001,
    PARSE_JSON('{
        "order_id": "ORD-NEW-001",
        "customer": {"name": "New Customer"},
        "new_field": "This field didn''t exist before",
        "nested": {"level1": {"level2": "deep value"}}
    }');

-- Query handles missing fields gracefully
SELECT 
    data:order_id::STRING,
    data:new_field::STRING,
    data:nonexistent::STRING,  -- Returns NULL
    data:nested.level1.level2::STRING
FROM raw_json_data;

Performance Metrics

Operation	Time Complexity	Typical Latency	Optimization
JSON Parse	O(n) where n = document size	1-10ms per document	Use STRIP_OUTER_ARRAY
Path Resolution	O(log m) where m = depth	0.1-1ms	Use explicit paths
FLATTEN	O(k) where k = array size	1-100ms	Limit array size
XMLGET	O(d) where d = document depth	1-5ms	Cache path results
OBJECT_CONSTRUCT	O(p) where p = properties	0.5-2ms	Minimize nesting

File Format	Ingestion Speed	Query Performance	Storage Efficiency
JSON	Fast	Medium	Medium
Parquet	Very Fast	Fast	High
Avro	Fast	Medium	High
XML	Slow	Slow	Medium
CSV	Very Fast	Fast (if structured)	Low

Best Practices

Extract frequently queried fields: For high-performance analytics, extract commonly accessed JSON fields into dedicated columns using materialized views or ETL processes.
Use appropriate file formats: Prefer Parquet over JSON for analytics workloads due to built-in compression, schema information, and predicate pushdown.
Limit array sizes: Avoid extremely large arrays in VARIANT columns. Use FLATTEN with LIMIT for sampling and consider breaking large arrays into multiple rows.
Implement schema validation: Use JSON Schema validation in ingestion pipelines to ensure data quality before loading into VARIANT columns.
Use TRY_ functions: Prefer TRY_PARSE_JSON over PARSE_JSON to handle malformed data gracefully without failing the entire load.
Optimize FLATTEN usage: Use FLATTEN only when necessary and consider materializing flattened results for repeated queries.
Leverage path statistics: Query performance improves with consistent path usage. Avoid dynamic paths that vary per document.
Consider data modeling: For frequently joined or aggregated data, normalize semi-structured data into relational tables.
Monitor VARIANT column sizes: Large VARIANT documents can impact query performance. Consider splitting very large documents into multiple columns or tables.
Use appropriate compression: Enable compression on VARIANT columns to reduce storage costs and improve I/O performance.

Semi-Structured Data: VARIANT, JSON, XML & Array Functions

Semi-Structured Data: VARIANT, JSON, XML & Array Functions

Architecture Diagram 1: VARIANT Storage Architecture

Architecture Diagram 2: Semi-Structured Data Parsing Pipeline

Architecture Diagram 3: VARIANT Storage Architecture

Architecture Diagram 4: JSON Parsing Pipeline

Architecture Diagram 5: XML Processing Architecture

Detailed Explanation

What is the VARIANT Data Type?

VARIANT Storage and Access

JSON Processing Deep Dive

XML Processing

Array and Object Functions

Schema Evolution and Flexibility

Key Concepts Table

Code Examples

Performance Metrics

Best Practices

See Also

Need Expert Snowflake Help?