Semi-Structured Data: VARIANT, JSON, XML & Array Functions
Architecture Diagram 1: VARIANT Storage Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VARIANT DATA TYPE ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β INPUT: Semi-structured data (JSON, XML, Arrays) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Example JSON Document: β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β { β β β
β β β "order_id": "ORD-2024-001", β β β
β β β "customer": { β β β
β β β "name": "John Smith", β β β
β β β "email": "john@example.com", β β β
β β β "address": { β β β
β β β "street": "123 Main St", β β β
β β β "city": "New York", β β β
β β β "state": "NY", β β β
β β β "zip": "10001" β β β
β β β } β β β
β β β }, β β β
β β β "items": [ β β β
β β β { β β β
β β β "product_id": "PROD-001", β β β
β β β "name": "Laptop", β β β
β β β "quantity": 1, β β β
β β β "price": 999.99 β β β
β β β }, β β β
β β β { β β β
β β β "product_id": "PROD-002", β β β
β β β "name": "Mouse", β β β
β β β "quantity": 2, β β β
β β β "price": 29.99 β β β
β β β } β β β
β β β ], β β β
β β β "total": 1059.97, β β β
β β β "currency": "USD", β β β
β β β "timestamp": "2024-01-15T10:30:00Z" β β β
β β β } β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β VARIANT STORAGE FORMAT β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Column: data (VARIANT) β β β
β β β β β β
β β β Internal Representation: β β β
β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββββ β β β
β β β β β Type β Key β Value β Child β Next ββ β β β
β β β β β Byte β Index β Offset β Pointerβ Pointer ββ β β β
β β β β βββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββ€β β β β
β β β β β 0x01 β NULL β NULL β 0x0001 β 0x0000 ββ β β β
β β β β β (OBJECT)β β β β ββ β β β
β β β β βββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββ€β β β β
β β β β β 0x02 β 0x0001 β 0x0010 β NULL β 0x0002 ββ β β β
β β β β β (STRING)β order_idβ β β ββ β β β
β β β β βββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββ€β β β β
β β β β β 0x01 β 0x0002 β 0x0020 β 0x0003 β 0x0004 ββ β β β
β β β β β (OBJECT)β customerβ β β ββ β β β
β β β β βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββββ β β β
β β β β β β β β
β β β β Value Storage: β β β β
β β β β β’ Short strings: Inline in metadata β β β β
β β β β β’ Long strings: Pointers to column storage β β β β
β β β β β’ Numbers: Direct storage in metadata β β β β
β β β β β’ Booleans: Single byte flags β β β β
β β β β β’ Nulls: Type indicator only β β β β
β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β QUERY ACCESS PATHS β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β β
β β β Dot Notation: data:customer.name β β β
β β β β β β β β
β β β β βΌ β β β
β β β β βββββββββββββββββββ β β β
β β β βββββββββββββΆβ Path Resolver β β β β
β β β β (Metadata β β β β
β β β β Lookup) β β β β
β β β ββββββββββ¬βββββββββ β β β
β β β β β β β
β β β βΌ β β β
β β β βββββββββββββββββββ β β β
β β β β Value Extractorβ β β β
β β β β (Type Check) β β β β
β β β ββββββββββ¬βββββββββ β β β
β β β β β β β
β β β βΌ β β β
β β β βββββββββββββββββββ β β β
β β β β Return Value β β β β
β β β β "John Smith" β β β β
β β β βββββββββββββββββββ β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Architecture Diagram 2: Semi-Structured Data Parsing Pipeline
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SEMI-STRUCTURED DATA PARSING PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β STAGE 1: RAW DATA INGESTION β
β ββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β File: events.json β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β [ β β β
β β β {"event": "click", "user": {"id": 123, "name": "Alice"}},β β β
β β β {"event": "view", "user": {"id": 456, "name": "Bob"}}, β β β
β β β {"event": "purchase", "items": [{"id": 1, "qty": 2}]} β β β
β β β ] β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β Target Table: raw_events (data VARIANT) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β STAGE 2: VARIANT PARSING β
β βββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β VARIANT Column Contents: β β β
β β β β β β
β β β Row 1: β β β
β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β { β β β β
β β β β "event": "click", β β β β
β β β β "user": { β β β β
β β β β "id": 123, β β β β
β β β β "name": "Alice" β β β β
β β β β } β β β β
β β β β } β β β β
β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β β β
β β β Row 2: β β β
β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β { β β β β
β β β β "event": "view", β β β β
β β β β "user": { β β β β
β β β β "id": 456, β β β β
β β β β "name": "Bob" β β β β
β β β β } β β β β
β β β β } β β β β
β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β STAGE 3: QUERY AND EXTRACTION β
β βββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β SELECT β β
β β data:event::STRING as event_type, β β
β β data:user:id::INTEGER as user_id, β β
β β data:user:name::STRING as user_name, β β
β β data:items[0]:id::INTEGER as first_item_id β β
β β FROM raw_events; β β
β β β β
β β Result: β β
β β ββββββββββββββββ¬ββββββββββ¬βββββββββββββ¬βββββββββββββββββ β β
β β β event_type β user_id β user_name β first_item_id β β β
β β ββββββββββββββββΌββββββββββΌβββββββββββββΌβββββββββββββββββ€ β β
β β β click β 123 β Alice β NULL β β β
β β β view β 456 β Bob β NULL β β β
β β β purchase β NULL β NULL β 1 β β β
β β ββββββββββββββββ΄ββββββββββ΄βββββββββββββ΄βββββββββββββββββ β β
β β β β
β β Note: NULL values where path doesn't exist in document β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β STAGE 4: FLATTENING ARRAYS β
β ββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β SELECT β β
β β data:event::STRING as event_type, β β
β β f.value:id::INTEGER as item_id, β β
β β f.value:qty::INTEGER as quantity β β
β β FROM raw_events, β β
β β LATERAL FLATTEN(input => data:items) f; β β
β β β β
β β Result: β β
β β ββββββββββββββββ¬ββββββββββ¬βββββββββββ β β
β β β event_type β item_id β quantity β β β
β β ββββββββββββββββΌββββββββββΌβββββββββββ€ β β
β β β purchase β 1 β 2 β β β
β β ββββββββββββββββ΄ββββββββββ΄βββββββββββ β β
β β β β
β β Note: Only rows with items array are expanded β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Architecture Diagram 3: XML Processing Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β XML PROCESSING ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β INPUT: XML Document β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β <?xml version="1.0" encoding="UTF-8"?> β β
β β <order> β β
β β <order_id>ORD-2024-001</order_id> β β
β β <customer> β β
β β <name>John Smith</name> β β
β β <email>john@example.com</email> β β
β β </customer> β β
β β <items> β β
β β <item> β β
β β <product_id>PROD-001</product_id> β β
β β <name>Laptop</name> β β
β β <quantity>1</quantity> β β
β β </item> β β
β β <item> β β
β β <product_id>PROD-002</product_id> β β
β β <name>Mouse</name> β β
β β <quantity>2</quantity> β β
β β </item> β β
β β </items> β β
β β <total>1059.97</total> β β
β β </order> β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β XML PARSING PIPELINE β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Step 1: Ingest as VARIANT (automatic XML parsing) β β β
β β β β β β
β β β COPY INTO xml_orders β β β
β β β FROM @xml_stage/ β β β
β β β FILE_FORMAT = (TYPE = XML); β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βΌ β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Step 2: Query with FLATTEN for hierarchical access β β β
β β β β β β
β β β SELECT β β β
β β β XMLGET(data, 'order_id'):t::STRING as order_id, β β β
β β β XMLGET(data, 'customer'):t:customer:name:t::STRING β β β
β β β FROM xml_orders; β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βΌ β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Step 3: Flatten repeated elements (items) β β β
β β β β β β
β β β SELECT β β β
β β β XMLGET(data, 'order_id'):t::STRING as order_id, β β β
β β β XMLGET(f.value, 'product_id'):t::STRING as product_id, β β β
β β β XMLGET(f.value, 'name'):t::STRING as item_name, β β β
β β β XMLGET(f.value, 'quantity'):t::INTEGER as quantity β β β
β β β FROM xml_orders, β β β
β β β LATERAL FLATTEN(input => XMLGET(data, 'items'):t) f; β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β OUTPUT: Relational Table β β
β β β β
β β ββββββββββββββββββββ¬βββββββββββββ¬βββββββββββ¬βββββββββββ β β
β β β order_id β product_id β item_nameβ quantity β β β
β β ββββββββββββββββββββΌβββββββββββββΌβββββββββββΌβββββββββββ€ β β
β β β ORD-2024-001 β PROD-001 β Laptop β 1 β β β
β β β ORD-2024-001 β PROD-002 β Mouse β 2 β β β
β β ββββββββββββββββββββ΄βββββββββββββ΄βββββββββββ΄βββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
VARIANT is Snowflake's universal container for semi-structured data (JSON, XML, BSON, Avro, Parquet). It automatically parses and stores hierarchical data in a compressed binary format, enabling SQL queries on nested structures using dot notation and bracket notation without explicit schema definitions.
FLATTEN converts semi-structured arrays into individual rows for relational processing. It supports output modes (OBJECT, ARRAY, BOTH) and preserves array indexes. The function uses lazy evaluation β materializing elements only when accessed by the query.
Use TRY_PARSE_JSON over PARSE_JSON to handle malformed data gracefully. Extract frequently queried VARIANT fields into dedicated columns using materialized views for high-performance analytics. Prefer Parquet over raw JSON for analytics workloads.
- VARIANT stores JSON/XML/BSON natively with automatic parsing
- Dot notation (data:key) for object access, bracket notation for arrays
- FLATTEN converts arrays to rows for relational processing
- Schema evolution: New fields added without DDL changes (missing fields return NULL)
- Performance tradeoff: VARIANT is flexible but slower than extracted columns for analytics
Detailed Explanation
VARIANT Data Type Fundamentals
Snowflake's VARIANT data type is a universal container for semi-structured data, capable of storing JSON, XML, BSON, Avro, Parquet, and other hierarchical data formats natively. Unlike traditional databases that require rigid schema definitions, VARIANT automatically parses and stores semi-structured data while maintaining the ability to query individual elements using familiar SQL syntax. The data type uses an optimized internal storage format that preserves the hierarchical structure while enabling efficient access patterns.
When semi-structured data is loaded into a VARIANT column, Snowflake performs automatic parsing and validation. The parser identifies the data format, extracts key-value pairs and array elements, and stores them in a compressed binary format optimized for query performance. Short string values are stored inline within the metadata structure, while longer values are stored in separate column storage with pointers from the metadata. This hybrid storage approach balances query performance with storage efficiency.
VARIANT columns support two access patterns: dot notation for simple hierarchical navigation and bracket notation for paths containing special characters or dynamic elements. The path resolution engine uses metadata to quickly locate values without scanning entire documents. For frequently queried paths, Snowflake maintains path statistics that enable the optimizer to make efficient access decisions.
JSON Processing Deep Dive
JSON is the most common semi-structured format in Snowflake, and the platform provides comprehensive support for JSON data handling. The JSON parser handles all standard JSON data types including objects, arrays, strings, numbers, booleans, and null values. It also handles edge cases like malformed JSON, duplicate keys, and non-standard numeric formats through configurable error handling options.
When querying JSON data, Snowflake uses a combination of metadata lookups and value extraction to efficiently retrieve requested elements. The dot notation syntax (data:customer.name) traverses the object hierarchy, while array indexing (data:items[0]) accesses specific elements. Type casting is required when extracting values for comparison or aggregation, using the ::TYPE syntax (data:amount::NUMBER).
The FLATTEN function is essential for working with JSON arrays, converting array elements into individual rows that can be joined, aggregated, and analyzed like relational data. FLATTEN supports different output modes (OBJECT, ARRAY, BOTH) and can preserve the array index for ordering operations. The function performs lazy evaluation, only materializing array elements when they're actually accessed by the query.
XML Processing Capabilities
Snowflake provides native XML processing through the VARIANT data type and XMLGET function. When XML data is loaded with TYPE = XML file format, Snowflake automatically converts the hierarchical XML structure into a VARIANT representation that supports the same query patterns as JSON data. The XMLGET function provides explicit access to XML elements, supporting both element content and attribute values.
XML processing in Snowflake handles common XML patterns including namespaces, attributes, mixed content, and CDATA sections. The parser maintains the document structure while converting it to an optimized internal representation. For large XML documents, Snowflake can process them incrementally, avoiding memory limitations associated with loading entire documents into memory.
Array and Object Functions
Snowflake provides a rich set of functions for manipulating semi-structured data. Array functions (ARRAY_AGG, ARRAY_CONSTRUCT, ARRAY_CAT, ARRAY_INTERSECTION) enable array construction, manipulation, and analysis. Object functions (OBJECT_CONSTRUCT, OBJECT_KEYS, OBJECT_DELETE) support object creation and modification. These functions operate directly on VARIANT data, eliminating the need to extract data into separate tables for processing.
The ARRAY_AGG function is particularly powerful for converting relational data into semi-structured formats, enabling aggregation of multiple rows into single array values. This capability supports scenarios like building nested JSON structures from flat tables or creating complex data objects for API responses. The function supports ORDER BY and DISTINCT clauses for controlling array element ordering.
Schema Evolution and Flexibility
Semi-structured data in Snowflake provides natural schema evolution capabilities. New fields can be added to JSON or XML documents without modifying table structures, and queries gracefully handle missing fields by returning NULL values. This flexibility enables iterative data model changes without requiring DDL operations or data migration.
However, this flexibility comes with performance implications. Queries that access many different paths may experience slower performance due to metadata lookup overhead. For high-performance analytics on frequently accessed fields, consider extracting commonly queried elements into dedicated columns using INSERT...SELECT or materialized views. This hybrid approach combines the flexibility of semi-structured storage with the performance of relational queries.
Key Concepts Table
| Function | Purpose | Input | Output |
|---|---|---|---|
| FLATTEN | Convert arrays to rows | ARRAY VARIANT | Table of VARIANT |
| PARSE_JSON | Parse JSON string | STRING | VARIANT |
| TRY_PARSE_JSON | Safe JSON parse | STRING | VARIANT (NULL on error) |
| OBJECT_CONSTRUCT | Create JSON object | Key-value pairs | VARIANT |
| ARRAY_CONSTRUCT | Create JSON array | Elements | VARIANT |
| XMLGET | Extract XML element | VARIANT, path | VARIANT |
| TO_VARIANT | Convert to VARIANT | Any type | VARIANT |
| Data Type | Storage | Access Pattern | Performance |
|---|---|---|---|
| STRING | Raw text | Full scan required | Slow |
| VARIANT | Parsed binary | Path navigation | Fast |
| OBJECT | Parsed binary | Key lookup | Very fast |
| ARRAY | Parsed binary | Index access | Very fast |
| GEOGRAPHY | Binary geometry | Spatial functions | Fast |
| Metric | VARIANT | Flattened Columns | Description |
|---|---|---|---|
| Storage Efficiency | High (compressed) | Medium (redundant) | VARIANT stores once |
| Query Performance | Medium | High | Flattened columns avoid parsing |
| Flexibility | Very High | Low | VARIANT adapts to schema changes |
| Analytics | Limited | Full | Flattened columns enable all SQL |
| Indexing | Limited | Full | Flattened columns support clustering |
Code Examples
-- Example 1: Create table with VARIANT column
CREATE TABLE raw_json_data (
id INTEGER,
data VARIANT,
load_timestamp TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
-- Example 2: Ingest JSON data
COPY INTO raw_json_data (id, data)
FROM @json_stage/
FILE_FORMAT = (
TYPE = JSON
STRIP_OUTER_ARRAY = FALSE
IGNORE_UTF8_ERRORS = TRUE
);
-- Example 3: Query JSON data with dot notation
SELECT
data:order_id::STRING as order_id,
data:customer.name::STRING as customer_name,
data:customer.email::STRING as customer_email,
data:total::NUMBER(10,2) as order_total,
data:currency::STRING as currency
FROM raw_json_data
WHERE data:customer.name::STRING LIKE '%Smith%';
-- Example 4: Flatten JSON arrays
SELECT
data:order_id::STRING as order_id,
f.value:product_id::STRING as product_id,
f.value:name::STRING as product_name,
f.value:quantity::INTEGER as quantity,
f.value:price::NUMBER(10,2) as price
FROM raw_json_data,
LATERAL FLATTEN(input => data:items) f;
-- Example 5: Aggregate flattened data back to arrays
SELECT
data:order_id::STRING as order_id,
ARRAY_AGG(
OBJECT_CONSTRUCT(
'product_id', f.value:product_id::STRING,
'name', f.value:name::STRING,
'quantity', f.value:quantity::INTEGER
)
) as items_array
FROM raw_json_data,
LATERAL FLATTEN(input => data:items) f
GROUP BY 1;
-- Example 6: Create structured table from semi-structured data
CREATE TABLE orders_structured AS
SELECT
data:order_id::STRING as order_id,
data:customer.name::STRING as customer_name,
data:customer.email::STRING as customer_email,
data:total::NUMBER(10,2) as order_total,
data:currency::STRING as currency,
data:timestamp::TIMESTAMP_NTZ as order_timestamp,
COUNT(f.value) as item_count
FROM raw_json_data,
LATERAL FLATTEN(input => data:items) f
GROUP BY 1, 2, 3, 4, 5, 6;
-- Example 7: XML processing
CREATE TABLE xml_data (
id INTEGER,
data VARIANT
);
COPY INTO xml_data
FROM @xml_stage/
FILE_FORMAT = (TYPE = XML);
-- Query XML data
SELECT
XMLGET(data, 'order_id'):t::STRING as order_id,
XMLGET(data, 'customer'):t:name:t::STRING as customer_name,
XMLGET(data, 'total'):t::NUMBER(10,2) as total
FROM xml_data;
-- Example 8: Advanced array manipulation
SELECT
data:order_id::STRING as order_id,
ARRAY_CONSTRUCT('item1', 'item2', 'item3') as static_array,
ARRAY_CAT(
data:tags::ARRAY,
ARRAY_CONSTRUCT('new_tag')
) as extended_tags,
ARRAY_INTERSECTION(
data:tags::ARRAY,
ARRAY_CONSTRUCT('tag1', 'tag3')
) as matching_tags
FROM raw_json_data
WHERE ARRAY_SIZE(data:tags::ARRAY) > 0;
-- Example 9: Object manipulation
SELECT
data:order_id::STRING as order_id,
OBJECT_CONSTRUCT(
'order_id', data:order_id::STRING,
'total', data:total::NUMBER,
'item_count', ARRAY_SIZE(data:items::ARRAY)
) as order_summary,
OBJECT_KEYS(data:customer::OBJECT) as customer_fields
FROM raw_json_data;
-- Example 10: Schema evolution handling
-- Add new field to JSON without schema change
INSERT INTO raw_json_data (id, data)
SELECT
1001,
PARSE_JSON('{
"order_id": "ORD-NEW-001",
"customer": {"name": "New Customer"},
"new_field": "This field didn't exist before",
"nested": {"level1": {"level2": "deep value"}}
}');
-- Query handles missing fields gracefully
SELECT
data:order_id::STRING,
data:new_field::STRING,
data:nonexistent::STRING, -- Returns NULL
data:nested.level1.level2::STRING
FROM raw_json_data;
Performance Metrics
| Operation | Time Complexity | Typical Latency | Optimization |
|---|---|---|---|
| JSON Parse | O(n) where n = document size | 1-10ms per document | Use STRIP_OUTER_ARRAY |
| Path Resolution | O(log m) where m = depth | 0.1-1ms | Use explicit paths |
| FLATTEN | O(k) where k = array size | 1-100ms | Limit array size |
| XMLGET | O(d) where d = document depth | 1-5ms | Cache path results |
| OBJECT_CONSTRUCT | O(p) where p = properties | 0.5-2ms | Minimize nesting |
| File Format | Ingestion Speed | Query Performance | Storage Efficiency |
|---|---|---|---|
| JSON | Fast | Medium | Medium |
| Parquet | Very Fast | Fast | High |
| Avro | Fast | Medium | High |
| XML | Slow | Slow | Medium |
| CSV | Very Fast | Fast (if structured) | Low |
Best Practices
-
Extract frequently queried fields: For high-performance analytics, extract commonly accessed JSON fields into dedicated columns using materialized views or ETL processes.
-
Use appropriate file formats: Prefer Parquet over JSON for analytics workloads due to built-in compression, schema information, and predicate pushdown.
-
Limit array sizes: Avoid extremely large arrays in VARIANT columns. Use FLATTEN with LIMIT for sampling and consider breaking large arrays into multiple rows.
-
Implement schema validation: Use JSON Schema validation in ingestion pipelines to ensure data quality before loading into VARIANT columns.
-
Use TRY_ functions: Prefer TRY_PARSE_JSON over PARSE_JSON to handle malformed data gracefully without failing the entire load.
-
Optimize FLATTEN usage: Use FLATTEN only when necessary and consider materializing flattened results for repeated queries.
-
Leverage path statistics: Query performance improves with consistent path usage. Avoid dynamic paths that vary per document.
-
Consider data modeling: For frequently joined or aggregated data, normalize semi-structured data into relational tables.
-
Monitor VARIANT column sizes: Large VARIANT documents can impact query performance. Consider splitting very large documents into multiple columns or tables.
-
Use appropriate compression: Enable compression on VARIANT columns to reduce storage costs and improve I/O performance.
See Also
- PySpark Iceberg - Semi-structured data with Iceberg
- Delta Lake on Databricks - Delta Lake semi-structured support
- Data Warehouse Concepts - Data warehouse design principles