Advanced SQL for Data Engineers

Module 1: FoundationsAdvanced SQLFree Lesson

Advertisement

Beyond the Basics

Advanced SQL separates competent data engineers from exceptional ones. These techniques enable you to solve complex data problems, optimize performance, and build sophisticated transformations.

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                  ADVANCED SQL TECHNIQUES                      │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│  Complex Window Functions    │  Recursive CTEs              │
│  Pivot / Unpivot             │  LATERAL Joins               │
│  Query Execution Plans       │  Performance Tuning          │
│  Partitioning Strategies     │  Indexing Optimization       │
│  Advanced Aggregations       │  Conditional Logic           │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Complex Window Functions

Advanced Window Frames

-- Cumulative distribution
SELECT 
    employee_name,
    salary,
    department,
    CUME_DIST() OVER (
        PARTITION BY department 
        ORDER BY salary DESC
    ) AS percentile,
    PERCENT_RANK() OVER (
        PARTITION BY department 
        ORDER BY salary DESC
    ) AS percentile_rank
FROM employees;

-- First and last values in window
SELECT 
    order_date,
    daily_revenue,
    FIRST_VALUE(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS first_day_revenue,
    LAST_VALUE(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS last_day_revenue
FROM daily_sales;

-- NTH value
SELECT 
    product_name,
    category,
    total_sales,
    NTH_VALUE(product_name, 3) OVER (
        PARTITION BY category 
        ORDER BY total_sales DESC
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS third_best_seller
FROM product_sales;

Window Functions for Data Quality

-- Detect consecutive days with zero sales
WITH daily_sales AS (
    SELECT generate_series('2024-01-01'::date, '2024-12-31'::date, '1 day') AS sale_date
),
actual_sales AS (
    SELECT DISTINCT order_date::date AS sale_date
    FROM orders
),
missing_days AS (
    SELECT d.sale_date
    FROM daily_sales d
    LEFT JOIN actual_sales a ON d.sale_date = a.sale_date
    WHERE a.sale_date IS NULL
),
consecutive_groups AS (
    SELECT 
        sale_date,
        sale_date - ROW_NUMBER() OVER (ORDER BY sale_date)::int AS grp
    FROM missing_days
)
SELECT 
    MIN(sale_date) AS gap_start,
    MAX(sage_date) AS gap_end,
    COUNT(*) AS consecutive_days
FROM consecutive_groups
GROUP BY grp
HAVING COUNT(*) >= 3
ORDER BY gap_start;

-- Find gaps in sequential IDs
WITH numbered AS (
    SELECT 
        order_id,
        order_id - ROW_NUMBER() OVER (ORDER BY order_id) AS gap_group
    FROM orders
)
SELECT 
    MIN(order_id) AS gap_start,
    MAX(order_id) AS gap_end,
    MAX(order_id) - MIN(order_id) - COUNT(*) + 1 AS gap_size
FROM numbered
GROUP BY gap_group
HAVING MAX(order_id) - MIN(order_id) > COUNT(*) - 1
ORDER BY gap_start;

Recursive CTEs

Recursive CTEs enable hierarchical and graph-like queries — essential for org charts, bill of materials, and pathfinding.

Syntax

WITH RECURSIVE cte_name AS (
    -- Base case (anchor member)
    SELECT ... FROM table WHERE condition
    
    UNION ALL
    
    -- Recursive member
    SELECT ... FROM table 
    INNER JOIN cte_name ON ...
)
SELECT * FROM cte_name;

Practical Examples

-- Organizational hierarchy
WITH RECURSIVE org_chart AS (
    -- Base case: CEO (no manager)
    SELECT 
        employee_id,
        name,
        manager_id,
        1 AS level,
        name::text AS path
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive: employees with managers
    SELECT 
        e.employee_id,
        e.name,
        e.manager_id,
        oc.level + 1,
        oc.path || ' → ' || e.name
    FROM employees e
    INNER JOIN org_chart oc ON e.manager_id = oc.employee_id
)
SELECT 
    employee_id,
    name,
    level,
    path
FROM org_chart
ORDER BY path;

-- Bill of materials (product components)
WITH RECURSIVE bom AS (
    -- Base case: top-level product
    SELECT 
        product_id,
        component_id,
        quantity,
        1 AS depth,
        product_id::text AS path
    FROM bill_of_materials
    WHERE product_id = 1001
    
    UNION ALL
    
    -- Recursive: sub-components
    SELECT 
        b.product_id,
        b.component_id,
        b.quantity * bom.quantity AS quantity,
        bom.depth + 1,
        bom.path || '.' || b.component_id::text
    FROM bill_of_materials b
    INNER JOIN bom ON b.product_id = bom.component_id
)
SELECT 
    component_id,
    SUM(quantity) AS total_quantity,
    MAX(depth) AS max_depth
FROM bom
GROUP BY component_id
ORDER BY component_id;

-- Find all ancestors of a node
WITH RECURSIVE ancestors AS (
    SELECT 
        employee_id,
        name,
        manager_id,
        ARRAY[employee_id] AS visited
    FROM employees
    WHERE employee_id = 42  -- Target employee
    
    UNION ALL
    
    SELECT 
        e.employee_id,
        e.name,
        e.manager_id,
        a.visited || e.employee_id
    FROM employees e
    INNER JOIN ancestors a ON e.employee_id = a.manager_id
    WHERE e.employee_id != ALL(a.visited)  -- Prevent cycles
)
SELECT employee_id, name FROM ancestors;

Pivot and Unpivot

Manual Pivot (Works in all databases)

-- Pivot: rows to columns
SELECT 
    product_name,
    SUM(CASE WHEN EXTRACT(quarter FROM order_date) = 1 THEN amount ELSE 0 END) AS q1_sales,
    SUM(CASE WHEN EXTRACT(quarter FROM order_date) = 2 THEN amount ELSE 0 END) AS q2_sales,
    SUM(CASE WHEN EXTRACT(quarter FROM order_date) = 3 THEN amount ELSE 0 END) AS q3_sales,
    SUM(CASE WHEN EXTRACT(quarter FROM order_date) = 4 THEN amount ELSE 0 END) AS q4_sales
FROM orders
GROUP BY product_name;

-- Pivot with dynamic columns (PostgreSQL)
DO $$
DECLARE
    rec RECORD;
    sql TEXT := 'SELECT product_name';
BEGIN
    FOR rec IN 
        SELECT DISTINCT EXTRACT(quarter FROM order_date) AS q
        FROM orders
        ORDER BY q
    LOOP
        sql := sql || ', SUM(CASE WHEN EXTRACT(quarter FROM order_date) = ' 
             || rec.q || ' THEN amount ELSE 0 END) AS q' || rec.q || '_sales';
    END LOOP;
    
    sql := sql || ' FROM orders GROUP BY product_name';
    RAISE NOTICE '%', sql;
END $$;

Native Pivot (Database-Specific)

-- PostgreSQL: Using crosstab
CREATE EXTENSION IF NOT EXISTS tablefunc;

SELECT *
FROM crosstab(
    'SELECT product_name, 
            EXTRACT(quarter FROM order_date)::int AS quarter, 
            SUM(amount)::numeric AS sales
     FROM orders
     GROUP BY product_name, EXTRACT(quarter FROM order_date)
     ORDER BY 1, 2',
    'SELECT DISTINCT EXTRACT(quarter FROM order_date)::int 
     FROM orders ORDER BY 1'
) AS ct(product_name text, q1 numeric, q2 numeric, q3 numeric, q4 numeric);

-- SQL Server: PIVOT operator
SELECT *
FROM (
    SELECT 
        product_name,
        'Q' + CAST(EXTRACT(quarter FROM order_date) AS VARCHAR) AS quarter,
        amount
    FROM orders
) src
PIVOT (
    SUM(amount)
    FOR quarter IN ([Q1], [Q2], [Q3], [Q4])
) pvt;

LATERAL Joins

LATERAL joins allow subqueries to reference columns from preceding tables — essential for "top-N per group" and correlated calculations.

-- Top 3 orders per customer (LATERAL approach)
SELECT 
    c.customer_id,
    c.first_name,
    top_orders.*
FROM customers c
CROSS JOIN LATERAL (
    SELECT 
        order_id,
        order_date,
        amount
    FROM orders o
    WHERE o.customer_id = c.customer_id
    ORDER BY amount DESC
    LIMIT 3
) top_orders;

-- Get most recent order per customer with full details
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    recent.order_id,
    recent.order_date,
    recent.total_amount
FROM customers c
CROSS JOIN LATERAL (
    SELECT *
    FROM orders o
    WHERE o.customer_id = c.customer_id
    ORDER BY o.order_date DESC
    LIMIT 1
) recent;

-- Running totals per customer
SELECT 
    c.customer_id,
    o.order_date,
    o.amount,
    running.total AS running_total
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
CROSS JOIN LATERAL (
    SELECT SUM(o2.amount) AS total
    FROM orders o2
    WHERE o2.customer_id = c.customer_id
      AND o2.order_date <= o.order_date
) running;

Query Execution Plans

Reading EXPLAIN Output

-- Basic EXPLAIN
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;

-- With actual execution statistics
EXPLAIN ANALYZE
SELECT 
    c.first_name,
    COUNT(o.order_id)
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE c.country = 'USA'
GROUP BY c.first_name;

-- Output example:
/*
HashAggregate  (cost=12345.67..12345.89 rows=22 width=20)
  Group Key: c.first_name
  ->  Hash Join  (cost=100.00..12345.00 rows=50000 width=16)
        Hash Cond: (o.customer_id = c.customer_id)
        ->  Seq Scan on orders o  (cost=0.00..10000.00 rows=500000 width=12)
        ->  Hash  (cost=100.00..100.00 rows=5000 width=16)
              ->  Seq Scan on customers c  (cost=0.00..100.00 rows=5000 width=16)
                    Filter: (country = 'USA')
Planning Time: 0.123 ms
Execution Time: 123.456 ms
*/

Common Execution Plan Operations

OperationDescriptionPerformance Impact
Seq ScanReads entire tableSlow on large tables
Index ScanUses index to find rowsFast
Index Only ScanData from index onlyVery fast
Hash JoinBuilds hash table for joinGood for large joins
Nested LoopIterates and looks upGood for small datasets
SortOrders result setCan be expensive
AggregateGroups and aggregatesMay require sorting

Optimization Strategies

-- Before: Suboptimal query
EXPLAIN ANALYZE
SELECT *
FROM orders
WHERE EXTRACT(year FROM order_date) = 2024;

-- Problem: Function on column prevents index use
-- Seq Scan on orders (cost=0.00..15000.00 rows=100000)

-- After: Optimized query
EXPLAIN ANALYZE
SELECT *
FROM orders
WHERE order_date >= '2024-01-01'
  AND order_date < '2025-01-01';

-- Improvement: Range scan can use index
-- Index Scan using idx_orders_date on orders (cost=0.43..5000.00 rows=100000)

Partitioning Strategies

Range Partitioning

-- PostgreSQL: Create partitioned table
CREATE TABLE orders (
    order_id BIGINT,
    customer_id INTEGER,
    order_date DATE,
    amount DECIMAL(10,2)
) PARTITION BY RANGE (order_date);

-- Create partitions
CREATE TABLE orders_2023 PARTITION OF orders
    FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');

CREATE TABLE orders_2024 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

CREATE TABLE orders_2025 PARTITION OF orders
    FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');

-- Query automatically uses partition pruning
SELECT * FROM orders 
WHERE order_date >= '2024-06-01' 
  AND order_date < '2024-07-01';
-- Only scans orders_2024 partition

Hash Partitioning

-- Distribute data evenly across partitions
CREATE TABLE events (
    event_id BIGINT,
    user_id INTEGER,
    event_type VARCHAR(50),
    event_time TIMESTAMP
) PARTITION BY HASH (user_id);

-- Create partitions
CREATE TABLE events_p0 PARTITION OF events
    FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE events_p1 PARTITION OF events
    FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE events_p2 PARTITION OF events
    FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE events_p3 PARTITION OF events
    FOR VALUES WITH (MODULUS 4, REMAINDER 3);

Advanced Aggregations

-- GROUPING SETS: Multiple aggregation levels
SELECT 
    COALESCE(country, '(All Countries)') AS country,
    COALESCE(city, '(All Cities)') AS city,
    SUM(amount) AS total_revenue
FROM orders
GROUP BY GROUPING SETS (
    (country, city),  -- Country + City level
    (country),        -- Country level
    ()                -- Grand total
);

-- ROLLUP: Hierarchical aggregation
SELECT 
    country,
    city,
    SUM(amount) AS total_revenue
FROM orders
GROUP BY ROLLUP (country, city);
-- Produces: country+city, country, grand total

-- CUBE: All combinations
SELECT 
    country,
    city,
    category,
    SUM(amount) AS total_revenue
FROM orders
GROUP BY CUBE (country, city, category);
-- Produces: all possible combinations of the 3 columns

-- FILTER clause (PostgreSQL)
SELECT 
    order_date,
    COUNT(*) AS total_orders,
    COUNT(*) FILTER (WHERE status = 'completed') AS completed_orders,
    COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled_orders,
    SUM(amount) FILTER (WHERE amount > 100) AS high_value_revenue
FROM orders
GROUP BY order_date;

Performance Tuning Checklist

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│              SQL PERFORMANCE TUNING CHECKLIST                 │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│                                                             │
│  ā–” Use EXPLAIN ANALYZE to identify bottlenecks              │
│  ā–” Ensure indexes exist on JOIN and WHERE columns           │
│  ā–” Avoid SELECT * — fetch only needed columns               │
│  ā–” Use covering indexes for frequent queries                │
│  ā–” Avoid functions on indexed columns in WHERE clauses      │
│  ā–” Use LIMIT/OFFSET for pagination                          │
│  ā–” Prefer EXISTS over IN for subqueries                     │
│  ā–” Use CTEs for readability, but inline for performance     │
│  ā–” Partition large tables by date                           │
│  ā–” Analyze tables after bulk loads (ANALYZE)                │
│  ā–” Use materialized views for expensive aggregations        │
│  ā–” Monitor slow query logs regularly                        │
│                                                             │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Key Takeaways

  1. Window functions are essential — master frame specifications, ranking, and analytical functions
  2. Recursive CTEs enable hierarchical queries — org charts, bill of materials, pathfinding
  3. LATERAL joins solve top-N per group — more flexible than window functions in some cases
  4. Understand execution plans — EXPLAIN ANALYZE is your best debugging tool
  5. Partitioning improves query performance — use range partitioning for time-series data
  6. GROUPING SETS, ROLLUP, CUBE — provide flexible multi-level aggregations
  7. Always optimize before scaling — a well-tuned query can outperform a faster database

Practice Exercises

  1. Window functions: Write a query that calculates the 7-day moving average of daily revenue and identifies days where actual revenue is more than 20% above or below the moving average.

  2. Recursive CTE: Write a query to find the shortest path between two nodes in a graph represented by an adjacency list table.

  3. Pivot: Transform a table of monthly sales into a quarterly summary with columns for Q1-Q4 and year-over-year comparison.

  4. LATERAL: For each customer, find their most recent order and the time since their previous order.

  5. Performance: Take a complex query from your work, create proper indexes, and demonstrate at least 10x improvement using EXPLAIN ANALYZE.

Advertisement

Need Expert Data Engineering Help?

Professional DE consulting, pipeline architecture, and data platform services.

Advertisement