Beyond the Basics
Advanced SQL separates competent data engineers from exceptional ones. These techniques enable you to solve complex data problems, optimize performance, and build sophisticated transformations.
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ADVANCED SQL TECHNIQUES ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā Complex Window Functions ā Recursive CTEs ā
ā Pivot / Unpivot ā LATERAL Joins ā
ā Query Execution Plans ā Performance Tuning ā
ā Partitioning Strategies ā Indexing Optimization ā
ā Advanced Aggregations ā Conditional Logic ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Complex Window Functions
Advanced Window Frames
-- Cumulative distribution
SELECT
employee_name,
salary,
department,
CUME_DIST() OVER (
PARTITION BY department
ORDER BY salary DESC
) AS percentile,
PERCENT_RANK() OVER (
PARTITION BY department
ORDER BY salary DESC
) AS percentile_rank
FROM employees;
-- First and last values in window
SELECT
order_date,
daily_revenue,
FIRST_VALUE(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS first_day_revenue,
LAST_VALUE(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last_day_revenue
FROM daily_sales;
-- NTH value
SELECT
product_name,
category,
total_sales,
NTH_VALUE(product_name, 3) OVER (
PARTITION BY category
ORDER BY total_sales DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS third_best_seller
FROM product_sales;
Window Functions for Data Quality
-- Detect consecutive days with zero sales
WITH daily_sales AS (
SELECT generate_series('2024-01-01'::date, '2024-12-31'::date, '1 day') AS sale_date
),
actual_sales AS (
SELECT DISTINCT order_date::date AS sale_date
FROM orders
),
missing_days AS (
SELECT d.sale_date
FROM daily_sales d
LEFT JOIN actual_sales a ON d.sale_date = a.sale_date
WHERE a.sale_date IS NULL
),
consecutive_groups AS (
SELECT
sale_date,
sale_date - ROW_NUMBER() OVER (ORDER BY sale_date)::int AS grp
FROM missing_days
)
SELECT
MIN(sale_date) AS gap_start,
MAX(sage_date) AS gap_end,
COUNT(*) AS consecutive_days
FROM consecutive_groups
GROUP BY grp
HAVING COUNT(*) >= 3
ORDER BY gap_start;
-- Find gaps in sequential IDs
WITH numbered AS (
SELECT
order_id,
order_id - ROW_NUMBER() OVER (ORDER BY order_id) AS gap_group
FROM orders
)
SELECT
MIN(order_id) AS gap_start,
MAX(order_id) AS gap_end,
MAX(order_id) - MIN(order_id) - COUNT(*) + 1 AS gap_size
FROM numbered
GROUP BY gap_group
HAVING MAX(order_id) - MIN(order_id) > COUNT(*) - 1
ORDER BY gap_start;
Recursive CTEs
Recursive CTEs enable hierarchical and graph-like queries ā essential for org charts, bill of materials, and pathfinding.
Syntax
WITH RECURSIVE cte_name AS (
-- Base case (anchor member)
SELECT ... FROM table WHERE condition
UNION ALL
-- Recursive member
SELECT ... FROM table
INNER JOIN cte_name ON ...
)
SELECT * FROM cte_name;
Practical Examples
-- Organizational hierarchy
WITH RECURSIVE org_chart AS (
-- Base case: CEO (no manager)
SELECT
employee_id,
name,
manager_id,
1 AS level,
name::text AS path
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive: employees with managers
SELECT
e.employee_id,
e.name,
e.manager_id,
oc.level + 1,
oc.path || ' ā ' || e.name
FROM employees e
INNER JOIN org_chart oc ON e.manager_id = oc.employee_id
)
SELECT
employee_id,
name,
level,
path
FROM org_chart
ORDER BY path;
-- Bill of materials (product components)
WITH RECURSIVE bom AS (
-- Base case: top-level product
SELECT
product_id,
component_id,
quantity,
1 AS depth,
product_id::text AS path
FROM bill_of_materials
WHERE product_id = 1001
UNION ALL
-- Recursive: sub-components
SELECT
b.product_id,
b.component_id,
b.quantity * bom.quantity AS quantity,
bom.depth + 1,
bom.path || '.' || b.component_id::text
FROM bill_of_materials b
INNER JOIN bom ON b.product_id = bom.component_id
)
SELECT
component_id,
SUM(quantity) AS total_quantity,
MAX(depth) AS max_depth
FROM bom
GROUP BY component_id
ORDER BY component_id;
-- Find all ancestors of a node
WITH RECURSIVE ancestors AS (
SELECT
employee_id,
name,
manager_id,
ARRAY[employee_id] AS visited
FROM employees
WHERE employee_id = 42 -- Target employee
UNION ALL
SELECT
e.employee_id,
e.name,
e.manager_id,
a.visited || e.employee_id
FROM employees e
INNER JOIN ancestors a ON e.employee_id = a.manager_id
WHERE e.employee_id != ALL(a.visited) -- Prevent cycles
)
SELECT employee_id, name FROM ancestors;
Pivot and Unpivot
Manual Pivot (Works in all databases)
-- Pivot: rows to columns
SELECT
product_name,
SUM(CASE WHEN EXTRACT(quarter FROM order_date) = 1 THEN amount ELSE 0 END) AS q1_sales,
SUM(CASE WHEN EXTRACT(quarter FROM order_date) = 2 THEN amount ELSE 0 END) AS q2_sales,
SUM(CASE WHEN EXTRACT(quarter FROM order_date) = 3 THEN amount ELSE 0 END) AS q3_sales,
SUM(CASE WHEN EXTRACT(quarter FROM order_date) = 4 THEN amount ELSE 0 END) AS q4_sales
FROM orders
GROUP BY product_name;
-- Pivot with dynamic columns (PostgreSQL)
DO $$
DECLARE
rec RECORD;
sql TEXT := 'SELECT product_name';
BEGIN
FOR rec IN
SELECT DISTINCT EXTRACT(quarter FROM order_date) AS q
FROM orders
ORDER BY q
LOOP
sql := sql || ', SUM(CASE WHEN EXTRACT(quarter FROM order_date) = '
|| rec.q || ' THEN amount ELSE 0 END) AS q' || rec.q || '_sales';
END LOOP;
sql := sql || ' FROM orders GROUP BY product_name';
RAISE NOTICE '%', sql;
END $$;
Native Pivot (Database-Specific)
-- PostgreSQL: Using crosstab
CREATE EXTENSION IF NOT EXISTS tablefunc;
SELECT *
FROM crosstab(
'SELECT product_name,
EXTRACT(quarter FROM order_date)::int AS quarter,
SUM(amount)::numeric AS sales
FROM orders
GROUP BY product_name, EXTRACT(quarter FROM order_date)
ORDER BY 1, 2',
'SELECT DISTINCT EXTRACT(quarter FROM order_date)::int
FROM orders ORDER BY 1'
) AS ct(product_name text, q1 numeric, q2 numeric, q3 numeric, q4 numeric);
-- SQL Server: PIVOT operator
SELECT *
FROM (
SELECT
product_name,
'Q' + CAST(EXTRACT(quarter FROM order_date) AS VARCHAR) AS quarter,
amount
FROM orders
) src
PIVOT (
SUM(amount)
FOR quarter IN ([Q1], [Q2], [Q3], [Q4])
) pvt;
LATERAL Joins
LATERAL joins allow subqueries to reference columns from preceding tables ā essential for "top-N per group" and correlated calculations.
-- Top 3 orders per customer (LATERAL approach)
SELECT
c.customer_id,
c.first_name,
top_orders.*
FROM customers c
CROSS JOIN LATERAL (
SELECT
order_id,
order_date,
amount
FROM orders o
WHERE o.customer_id = c.customer_id
ORDER BY amount DESC
LIMIT 3
) top_orders;
-- Get most recent order per customer with full details
SELECT
c.customer_id,
c.first_name,
c.last_name,
recent.order_id,
recent.order_date,
recent.total_amount
FROM customers c
CROSS JOIN LATERAL (
SELECT *
FROM orders o
WHERE o.customer_id = c.customer_id
ORDER BY o.order_date DESC
LIMIT 1
) recent;
-- Running totals per customer
SELECT
c.customer_id,
o.order_date,
o.amount,
running.total AS running_total
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
CROSS JOIN LATERAL (
SELECT SUM(o2.amount) AS total
FROM orders o2
WHERE o2.customer_id = c.customer_id
AND o2.order_date <= o.order_date
) running;
Query Execution Plans
Reading EXPLAIN Output
-- Basic EXPLAIN
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;
-- With actual execution statistics
EXPLAIN ANALYZE
SELECT
c.first_name,
COUNT(o.order_id)
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE c.country = 'USA'
GROUP BY c.first_name;
-- Output example:
/*
HashAggregate (cost=12345.67..12345.89 rows=22 width=20)
Group Key: c.first_name
-> Hash Join (cost=100.00..12345.00 rows=50000 width=16)
Hash Cond: (o.customer_id = c.customer_id)
-> Seq Scan on orders o (cost=0.00..10000.00 rows=500000 width=12)
-> Hash (cost=100.00..100.00 rows=5000 width=16)
-> Seq Scan on customers c (cost=0.00..100.00 rows=5000 width=16)
Filter: (country = 'USA')
Planning Time: 0.123 ms
Execution Time: 123.456 ms
*/
Common Execution Plan Operations
| Operation | Description | Performance Impact |
|---|---|---|
| Seq Scan | Reads entire table | Slow on large tables |
| Index Scan | Uses index to find rows | Fast |
| Index Only Scan | Data from index only | Very fast |
| Hash Join | Builds hash table for join | Good for large joins |
| Nested Loop | Iterates and looks up | Good for small datasets |
| Sort | Orders result set | Can be expensive |
| Aggregate | Groups and aggregates | May require sorting |
Optimization Strategies
-- Before: Suboptimal query
EXPLAIN ANALYZE
SELECT *
FROM orders
WHERE EXTRACT(year FROM order_date) = 2024;
-- Problem: Function on column prevents index use
-- Seq Scan on orders (cost=0.00..15000.00 rows=100000)
-- After: Optimized query
EXPLAIN ANALYZE
SELECT *
FROM orders
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01';
-- Improvement: Range scan can use index
-- Index Scan using idx_orders_date on orders (cost=0.43..5000.00 rows=100000)
Partitioning Strategies
Range Partitioning
-- PostgreSQL: Create partitioned table
CREATE TABLE orders (
order_id BIGINT,
customer_id INTEGER,
order_date DATE,
amount DECIMAL(10,2)
) PARTITION BY RANGE (order_date);
-- Create partitions
CREATE TABLE orders_2023 PARTITION OF orders
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
CREATE TABLE orders_2024 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
CREATE TABLE orders_2025 PARTITION OF orders
FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
-- Query automatically uses partition pruning
SELECT * FROM orders
WHERE order_date >= '2024-06-01'
AND order_date < '2024-07-01';
-- Only scans orders_2024 partition
Hash Partitioning
-- Distribute data evenly across partitions
CREATE TABLE events (
event_id BIGINT,
user_id INTEGER,
event_type VARCHAR(50),
event_time TIMESTAMP
) PARTITION BY HASH (user_id);
-- Create partitions
CREATE TABLE events_p0 PARTITION OF events
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE events_p1 PARTITION OF events
FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE events_p2 PARTITION OF events
FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE events_p3 PARTITION OF events
FOR VALUES WITH (MODULUS 4, REMAINDER 3);
Advanced Aggregations
-- GROUPING SETS: Multiple aggregation levels
SELECT
COALESCE(country, '(All Countries)') AS country,
COALESCE(city, '(All Cities)') AS city,
SUM(amount) AS total_revenue
FROM orders
GROUP BY GROUPING SETS (
(country, city), -- Country + City level
(country), -- Country level
() -- Grand total
);
-- ROLLUP: Hierarchical aggregation
SELECT
country,
city,
SUM(amount) AS total_revenue
FROM orders
GROUP BY ROLLUP (country, city);
-- Produces: country+city, country, grand total
-- CUBE: All combinations
SELECT
country,
city,
category,
SUM(amount) AS total_revenue
FROM orders
GROUP BY CUBE (country, city, category);
-- Produces: all possible combinations of the 3 columns
-- FILTER clause (PostgreSQL)
SELECT
order_date,
COUNT(*) AS total_orders,
COUNT(*) FILTER (WHERE status = 'completed') AS completed_orders,
COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled_orders,
SUM(amount) FILTER (WHERE amount > 100) AS high_value_revenue
FROM orders
GROUP BY order_date;
Performance Tuning Checklist
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā SQL PERFORMANCE TUNING CHECKLIST ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā ā
ā ā” Use EXPLAIN ANALYZE to identify bottlenecks ā
ā ā” Ensure indexes exist on JOIN and WHERE columns ā
ā ā” Avoid SELECT * ā fetch only needed columns ā
ā ā” Use covering indexes for frequent queries ā
ā ā” Avoid functions on indexed columns in WHERE clauses ā
ā ā” Use LIMIT/OFFSET for pagination ā
ā ā” Prefer EXISTS over IN for subqueries ā
ā ā” Use CTEs for readability, but inline for performance ā
ā ā” Partition large tables by date ā
ā ā” Analyze tables after bulk loads (ANALYZE) ā
ā ā” Use materialized views for expensive aggregations ā
ā ā” Monitor slow query logs regularly ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Key Takeaways
- Window functions are essential ā master frame specifications, ranking, and analytical functions
- Recursive CTEs enable hierarchical queries ā org charts, bill of materials, pathfinding
- LATERAL joins solve top-N per group ā more flexible than window functions in some cases
- Understand execution plans ā EXPLAIN ANALYZE is your best debugging tool
- Partitioning improves query performance ā use range partitioning for time-series data
- GROUPING SETS, ROLLUP, CUBE ā provide flexible multi-level aggregations
- Always optimize before scaling ā a well-tuned query can outperform a faster database
Practice Exercises
-
Window functions: Write a query that calculates the 7-day moving average of daily revenue and identifies days where actual revenue is more than 20% above or below the moving average.
-
Recursive CTE: Write a query to find the shortest path between two nodes in a graph represented by an adjacency list table.
-
Pivot: Transform a table of monthly sales into a quarterly summary with columns for Q1-Q4 and year-over-year comparison.
-
LATERAL: For each customer, find their most recent order and the time since their previous order.
-
Performance: Take a complex query from your work, create proper indexes, and demonstrate at least 10x improvement using EXPLAIN ANALYZE.