Performance Patterns: Anti-Join, Semi-Join, Materialized CTE

Interview Question: "Explain the difference between NOT EXISTS and LEFT JOIN...IS NULL for anti-joins. When would you use a materialized CTE? How do you optimize subqueries?" — Asked at Amazon, Google, Microsoft for Senior Backend roles

ℹ️

Difficulty: Advanced | Companies: Amazon, Google, Microsoft, Meta, Apple | Time: 60-75 minutes

Semi-Join Patterns

Semi-join returns rows from left table that match right table:

\text{Semi-Join}(A, B) = \{a \in A : \exists b \in B, a.key = b.key\}

-- Create sample tables
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(255)
);

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    order_date DATE,
    total_amount DECIMAL(12,2)
);

-- Generate sample data
INSERT INTO customers VALUES
(1, 'Alice', 'alice@example.com'),
(2, 'Bob', 'bob@example.com'),
(3, 'Charlie', 'charlie@example.com'),
(4, 'Diana', 'diana@example.com'),
(5, 'Eve', 'eve@example.com');

INSERT INTO orders VALUES
(1, 1, '2024-01-01', 100.00),
(2, 1, '2024-01-15', 150.00),
(3, 2, '2024-02-01', 200.00),
(4, 3, '2024-02-15', 175.00);

-- Semi-Join: IN clause
SELECT * FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders);

-- Semi-Join: EXISTS (often more efficient)
SELECT * FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
);

-- Semi-Join: INTERSECT
SELECT * FROM customers
WHERE customer_id IN (
    SELECT customer_id FROM orders
    INTERSECT
    SELECT customer_id FROM orders WHERE total_amount > 100
);

Output:

customer_id	name	email
1	Alice	alice@example.com
2	Bob	bob@example.com
3	Charlie	charlie@example.com

Anti-Join Patterns

Anti-join returns rows from left table that don't match right table:

\text{Anti-Join}(A, B) = \{a \in A : \nexists b \in B, a.key = b.key\}

-- Anti-Join: NOT IN (beware of NULLs!)
SELECT * FROM customers
WHERE customer_id NOT IN (SELECT customer_id FROM orders);

-- Anti-Join: NOT EXISTS (usually more efficient)
SELECT * FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
);

-- Anti-Join: LEFT JOIN...IS NULL
SELECT c.* FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

-- Anti-Join: EXCEPT
SELECT customer_id FROM customers
EXCEPT
SELECT customer_id FROM orders;

Output:

customer_id	name	email
4	Diana	diana@example.com
5	Eve	eve@example.com

⚠️

NULL Trap: NOT IN with NULL values returns empty results. Always use NOT EXISTS or ensure the subquery excludes NULLs.

Materialized CTE

Materialized CTEs compute results once and reuse:

\text{Materialized CTE} = \text{Compute once} \rightarrow \text{Store in temp} \rightarrow \text{Reuse}

-- Non-materialized CTE (computed multiple times)
WITH order_stats AS (
    SELECT 
        customer_id,
        COUNT(*) AS order_count,
        SUM(total_amount) AS total_spent
    FROM orders
    GROUP BY customer_id
)
SELECT 
    c.name,
    s.order_count,
    s.total_spent,
    s.total_spent / NULLIF(s.order_count, 0) AS avg_order_value
FROM customers c
JOIN order_stats s ON c.customer_id = s.customer_id;

-- Materialized CTE (PostgreSQL 12+)
WITH order_stats AS MATERIALIZED (
    SELECT 
        customer_id,
        COUNT(*) AS order_count,
        SUM(total_amount) AS total_spent
    FROM orders
    GROUP BY customer_id
)
SELECT 
    c.name,
    s.order_count,
    s.total_spent
FROM customers c
JOIN order_stats s ON c.customer_id = s.customer_id;

-- Not Materialized (inline, may be optimized away)
WITH order_stats AS NOT MATERIALIZED (
    SELECT 
        customer_id,
        COUNT(*) AS order_count
    FROM orders
    GROUP BY customer_id
)
SELECT * FROM order_stats WHERE order_count > 1;

Subquery Optimization

-- Correlated subquery (can be slow)
SELECT 
    c.*,
    (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) AS order_count
FROM customers c;

-- Rewrite with JOIN (usually faster)
SELECT 
    c.*,
    COALESCE(s.order_count, 0) AS order_count
FROM customers c
LEFT JOIN (
    SELECT customer_id, COUNT(*) AS order_count
    FROM orders
    GROUP BY customer_id
) s ON c.customer_id = s.customer_id;

-- EXISTS vs IN for large datasets
-- EXISTS is often faster for large right tables
SELECT * FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
        AND o.order_date > '2024-01-01'
);

Join Optimization Patterns

-- Filter early (reduce join size)
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date > '2024-01-01'  -- Filter before join
    AND o.total_amount > 100;

-- Use covering indexes
CREATE INDEX idx_orders_covering ON orders(customer_id, order_date, total_amount);

-- Join order optimization
-- PostgreSQL optimizer usually handles this, but you can hint
SET join_collapse_limit = 12;  -- Allow more join reordering

-- LATERAL for dependent subqueries
SELECT 
    c.*,
    latest.order_date,
    latest.total_amount
FROM customers c
CROSS JOIN LATERAL (
    SELECT order_date, total_amount
    FROM orders
    WHERE customer_id = c.customer_id
    ORDER BY order_date DESC
    LIMIT 1
) latest;

Performance Comparison Table

Pattern	Time Complexity	When to Use
Semi-Join (EXISTS)	$O(n \log m)$	Large right table
Anti-Join (NOT EXISTS)	$O(n \log m)$	Exclusion queries
Materialized CTE	$O(n + m)$	Reuse results
Correlated Subquery	$O(n \times m)$	Avoid if possible
LATERAL JOIN	$O(n \times k)$	Top-N per group

-- Compare execution plans
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id
);

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM customers c
WHERE c.customer_id IN (SELECT customer_id FROM orders);

Window Function Optimization

-- Avoid multiple window passes
-- Bad: Multiple window functions
SELECT 
    customer_id,
    order_date,
    SUM(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total,
    AVG(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_avg,
    COUNT(*) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_count
FROM orders;

-- Good: Combine if possible
WITH windowed AS (
    SELECT 
        customer_id,
        order_date,
        total_amount,
        SUM(total_amount) OVER w AS running_total,
        AVG(total_amount) OVER w AS running_avg,
        COUNT(*) OVER w AS running_count
    FROM orders
    WINDOW w AS (PARTITION BY customer_id ORDER BY order_date)
)
SELECT * FROM windowed;

Mathematical Formulas

For join performance:

\text{Hash Join Cost} = O(n + m)

\text{Nested Loop Cost} = O(n \times m)

\text{Merge Join Cost} = O(n \log n + m \log m)

The optimizer chooses based on:

\text{Choice} = \arg\min_{\text{plan}} \text{Cost}(\text{plan})

ℹ️

Pro Tip: Use EXPLAIN (ANALYZE, BUFFERS) to compare actual I/O between different query patterns. Buffer hits indicate cached data, reads indicate disk I/O.

Anti-Patterns to Avoid

SELECT * in subqueries
NOT IN with nullable columns
Correlated subqueries in SELECT
Multiple scans of same table
Unnecessary DISTINCT or ORDER BY

-- Bad: SELECT * in subquery
SELECT * FROM customers
WHERE customer_id IN (SELECT * FROM orders);

-- Good: Specify columns
SELECT * FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders);

-- Bad: Correlated subquery
SELECT *, (SELECT MAX(order_date) FROM orders WHERE customer_id = c.customer_id)
FROM customers c;

-- Good: Join
SELECT c.*, o.max_date
FROM customers c
LEFT JOIN (
    SELECT customer_id, MAX(order_date) AS max_date
    FROM orders
    GROUP BY customer_id
) o ON c.customer_id = o.customer_id;

⚠️

Performance Warning: Always test queries with realistic data volumes. Small test datasets may not reveal performance issues.