Snowflake Advanced Β· Interview Prep
UDFs, Stored Procedures & Snowpark
Difficulty: Hard Β· Commonly asked at Microsoft, Meta, Amazon
Interview Question
"When would you use a UDF vs a stored procedure vs Snowpark? Walk me through the performance implications and use cases for each. How do you handle state management in UDFs?"
βΉοΈ
Companies Asking This: Microsoft (Principal Data Engineer), Meta (Staff Data Engineer), Amazon (L6 Data Engineer), Google (Senior Data Engineer)
UDFs (User-Defined Functions)
UDFs are custom functions that can be used in SQL queries. They run per-row (scalar) or per-group (aggregate) and are pushed down to the warehouse.
Scalar UDFs
-- Create a scalar UDF for tax calculation
CREATE OR REPLACE FUNCTION calculate_tax(amount NUMBER, tax_rate NUMBER)
RETURNS NUMBER
LANGUAGE SQL
AS
'amount * tax_rate / 100';
-- Use in queries
SELECT
order_id,
total_amount,
calculate_tax(total_amount, 8.5) AS tax,
total_amount + calculate_tax(total_amount, 8.5) AS total_with_tax
FROM orders;
-- UDF with more complex logic
CREATE OR REPLACE FUNCTION categorize_amount(amount NUMBER)
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
CASE
WHEN amount < 100 THEN 'LOW'
WHEN amount < 1000 THEN 'MEDIUM'
WHEN amount < 10000 THEN 'HIGH'
ELSE 'PREMIUM'
END
$$;
-- UDF with string manipulation
CREATE OR REPLACE FUNCTION mask_email(email VARCHAR)
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
CONCAT(
LEFT(email, 2),
REPEAT('*', LENGTH(email) - POSITION('@' IN email)),
'@',
RIGHT(email, LENGTH(email) - POSITION('@' IN email) + 1)
)
$$;
-- Use mask_email
SELECT mask_email('john.doe@example.com') AS masked_email;
-- Result: jd*****@example.com
JavaScript UDFs
-- JavaScript UDF for complex string processing
CREATE OR REPLACE FUNCTION parse_user_agent(ua_string VARCHAR)
RETURNS OBJECT
LANGUAGE JAVASCRIPT
AS
$$
var result = {};
if (ua_string.includes('Chrome')) {
result.browser = 'Chrome';
} else if (ua_string.includes('Firefox')) {
result.browser = 'Firefox';
} else if (ua_string.includes('Safari')) {
result.browser = 'Safari';
} else {
result.browser = 'Unknown';
}
if (ua_string.includes('Windows')) {
result.os = 'Windows';
} else if (ua_string.includes('Mac')) {
result.os = 'MacOS';
} else if (ua_string.includes('Linux')) {
result.os = 'Linux';
} else {
result.os = 'Unknown';
}
return result;
$$;
-- Use JavaScript UDF
SELECT
user_agent,
parse_user_agent(user_agent):browser::VARCHAR AS browser,
parse_user_agent(user_agent):os::VARCHAR AS operating_system
FROM user_sessions;
Aggregate UDFs
-- Create an aggregate UDF for weighted average
CREATE OR REPLACE FUNCTION weighted_avg(value NUMBER, weight NUMBER)
RETURNS NUMBER
LANGUAGE SQL
AS
'SUM(value * weight) / NULLIF(SUM(weight), 0)';
-- Use in aggregate context
SELECT
product_category,
weighted_avg(rating, review_count) AS weighted_avg_rating
FROM product_reviews
GROUP BY 1;
Stored Procedures
Stored procedures encapsulate business logic and can perform DDL/DML operations. They support transactions and can call other procedures.
SQL Stored Procedures
-- Create a stored procedure for data quality checks
CREATE OR REPLACE PROCEDURE check_data_quality(table_name VARCHAR)
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
total_rows NUMBER;
null_count NUMBER;
duplicate_count NUMBER;
result VARCHAR;
BEGIN
-- Count total rows
EXECUTE IMMEDIATE 'SELECT COUNT(*) FROM ' || table_name INTO total_rows;
-- Count nulls in key columns
EXECUTE IMMEDIATE
'SELECT COUNT(*) FROM ' || table_name ||
' WHERE order_id IS NULL OR customer_id IS NULL'
INTO null_count;
-- Count duplicates
EXECUTE IMMEDIATE
'SELECT COUNT(*) FROM (SELECT order_id, COUNT(*) c FROM ' ||
table_name || ' GROUP BY order_id HAVING c > 1)'
INTO duplicate_count;
result := 'Table: ' || table_name ||
' | Total: ' || total_rows ||
' | Nulls: ' || null_count ||
' | Duplicates: ' || duplicate_count;
RETURN result;
END;
$$;
-- Call the procedure
CALL check_data_quality('ORDERS');
-- Procedure with error handling
CREATE OR REPLACE PROCEDURE safe_delete(schema_name VARCHAR, table_name VARCHAR)
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
DECLARE
row_count NUMBER;
BEGIN
EXECUTE IMMEDIATE 'SELECT COUNT(*) FROM ' || schema_name || '.' || table_name
INTO row_count;
IF row_count > 1000000 THEN
RETURN 'ERROR: Table has ' || row_count || ' rows. Use TRUNCATE for large tables.';
ELSE
EXECUTE IMMEDIATE 'DELETE FROM ' || schema_name || '.' || table_name;
RETURN 'Deleted ' || row_count || ' rows from ' || table_name;
END IF;
EXCEPTION
WHEN OTHER THEN
RETURN 'ERROR: ' || SQLERRM;
END;
$$;
Python Stored Procedures (Snowpark)
-- Create Python stored procedure
CREATE OR REPLACE PROCEDURE process_data_python(input_table VARCHAR, output_table VARCHAR)
RETURNS VARCHAR
LANGUAGE PYTHON
RUNTIME_VERSION = '3.10'
PACKAGES = ('snowflake-snowpark-python')
HANDLER = 'main'
AS
$$
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, avg, count, sum as snowpark_sum
def main(session: Session, input_table: str, output_table: str) -> str:
# Read input data
df = session.table(input_table)
# Perform transformations
result = df.groupBy("region", "product_category").agg(
count("*").alias("transaction_count"),
snowpark_sum("amount").alias("total_revenue"),
avg("amount").alias("avg_transaction")
).filter(col("total_revenue") > 10000)
# Write to output table
result.write.mode("overwrite").save_as_table(output_table)
return f"Processed data into {output_table}"
$$;
Real-World Scenario: Microsoft
Question: "How would you implement a real-time data quality framework using UDFs and stored procedures?"
Solution: Data Quality Framework
-- 1. Create quality rules table
CREATE TABLE data_quality_rules (
rule_id NUMBER AUTOINCREMENT,
table_name VARCHAR(100),
column_name VARCHAR(100),
rule_type VARCHAR(50),
rule_expression VARCHAR(2000),
severity VARCHAR(20),
enabled BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);
-- Insert quality rules
INSERT INTO data_quality_rules (table_name, column_name, rule_type, rule_expression, severity)
VALUES
('ORDERS', 'ORDER_ID', 'NOT_NULL', 'ORDER_ID IS NOT NULL', 'CRITICAL'),
('ORDERS', 'AMOUNT', 'RANGE', 'AMOUNT > 0 AND AMOUNT < 1000000', 'HIGH'),
('ORDERS', 'STATUS', 'ENUM', 'STATUS IN (''PENDING'', ''SHIPPED'', ''DELIVERED'', ''CANCELLED'')', 'MEDIUM'),
('ORDERS', 'ORDER_DATE', 'FUTURE_CHECK', 'ORDER_DATE <= CURRENT_DATE()', 'HIGH');
-- 2. Create quality check UDF
CREATE OR REPLACE FUNCTION check_quality_rule(
column_value VARCHAR,
rule_expression VARCHAR
)
RETURNS BOOLEAN
LANGUAGE SQL
AS
'TRY_CAST(column_value AS NUMBER) IS NOT NULL OR column_value IS NULL';
-- 3. Create quality check procedure
CREATE OR REPLACE PROCEDURE run_data_quality_checks(table_name VARCHAR)
RETURNS VARIANT
LANGUAGE SQL
AS
$$
DECLARE
results ARRAY DEFAULT ARRAY_CONSTRUCT();
rule_record RECORD;
violation_count NUMBER;
total_count NUMBER;
BEGIN
-- Get all enabled rules for the table
FOR rule_record IN
SELECT rule_id, column_name, rule_type, rule_expression, severity
FROM data_quality_rules
WHERE table_name = :table_name AND enabled = TRUE
DO
-- Execute quality check
EXECUTE IMMEDIATE
'SELECT COUNT(*) FROM ' || :table_name ||
' WHERE NOT (' || rule_record.rule_expression || ')'
INTO violation_count;
EXECUTE IMMEDIATE
'SELECT COUNT(*) FROM ' || :table_name
INTO total_count;
-- Add result
results := ARRAY_APPEND(results, OBJECT_CONSTRUCT(
'rule_id', rule_record.rule_id,
'column', rule_record.column_name,
'rule_type', rule_record.rule_type,
'severity', rule_record.severity,
'violations', violation_count,
'total_rows', total_count,
'violation_pct', ROUND(violation_count * 100.0 / NULLIF(total_count, 0), 2)
));
END FOR;
-- Log results
INSERT INTO data_quality_results (table_name, check_timestamp, results)
VALUES (:table_name, CURRENT_TIMESTAMP(), PARSE_JSON(TO_VARIANT(:results)));
RETURN OBJECT_CONSTRUCT(
'table', :table_name,
'rules_checked', ARRAY_SIZE(:results),
'timestamp', CURRENT_TIMESTAMP()
);
END;
$$;
-- 4. Schedule quality checks
CREATE OR REPLACE TASK daily_quality_checks
WAREHOUSE = admin_wh
SCHEDULE = 'USING CRON 0 6 * * * UTC'
AS
CALL run_data_quality_checks('ORDERS');
Real-World Scenario: Meta
Question: "How do you handle performance issues with UDFs? The UDF is slowing down queries significantly."
UDF Performance Optimization
-- 1. Identify slow UDF calls
SELECT
query_id,
query_text,
total_elapsed_time_ms,
compilation_time_ms,
execution_time_ms
FROM snowflake.account_usage.query_history
WHERE query_text LIKE '%calculate_tax%' -- Your UDF name
AND start_time >= DATEADD(hour, -24, CURRENT_TIMESTAMP())
ORDER BY total_elapsed_time_ms DESC;
-- 2. Alternative: Inline UDF logic in SQL
-- Instead of UDF:
SELECT order_id, calculate_tax(amount, 8.5) AS tax FROM orders;
-- Inline the logic:
SELECT
order_id,
amount * 8.5 / 100 AS tax
FROM orders;
-- 3. Use Table Functions for complex transformations
CREATE OR REPLACE FUNCTION complex_transform(data ARRAY)
RETURNS TABLE (key VARCHAR, value VARCHAR)
LANGUAGE SQL
AS
$$
SELECT
f.key,
f.value::VARCHAR
FROM LATERAL FLATTEN(input => data) f
$$;
-- 4. Consider Snowpark for complex transformations
-- (Better performance than JavaScript UDFs for complex logic)
Best Practices
| Use Case | Recommended Approach |
|---|---|
| Simple row-level transforms | SQL UDFs |
| Complex string/JSON parsing | JavaScript UDFs |
| DDL/DML operations | Stored Procedures |
| ML/Complex data processing | Snowpark (Python/Java) |
| Data quality checks | Stored Procedures + UDFs |
| Aggregations | SQL Aggregate UDFs |
β οΈ
Performance Tips:
- Prefer SQL UDFs over JavaScript β SQL UDFs are compiled and optimized
- Avoid UDFs in WHERE clauses β They prevent predicate pushdown
- Use lateral flatten sparingly β It's expensive in UDFs
- Consider inlining β Sometimes inline SQL is faster than a UDF call
- Test with PROFILE β Use PROFILE to identify UDF bottlenecks