Interview Question: "Explain the difference between NOT EXISTS and LEFT JOIN...IS NULL for anti-joins. When would you use a materialized CTE? How do you optimize subqueries?" — Asked at Amazon, Google, Microsoft for Senior Backend roles
ℹ️
Difficulty: Advanced | Companies: Amazon, Google, Microsoft, Meta, Apple | Time: 60-75 minutes
Semi-Join Patterns
Semi-join returns rows from left table that match right table:
-- Create sample tables
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(255)
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
total_amount DECIMAL(12,2)
);
-- Generate sample data
INSERT INTO customers VALUES
(1, 'Alice', 'alice@example.com'),
(2, 'Bob', 'bob@example.com'),
(3, 'Charlie', 'charlie@example.com'),
(4, 'Diana', 'diana@example.com'),
(5, 'Eve', 'eve@example.com');
INSERT INTO orders VALUES
(1, 1, '2024-01-01', 100.00),
(2, 1, '2024-01-15', 150.00),
(3, 2, '2024-02-01', 200.00),
(4, 3, '2024-02-15', 175.00);
-- Semi-Join: IN clause
SELECT * FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders);
-- Semi-Join: EXISTS (often more efficient)
SELECT * FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
);
-- Semi-Join: INTERSECT
SELECT * FROM customers
WHERE customer_id IN (
SELECT customer_id FROM orders
INTERSECT
SELECT customer_id FROM orders WHERE total_amount > 100
);
Output:
| customer_id | name | |
|---|---|---|
| 1 | Alice | alice@example.com |
| 2 | Bob | bob@example.com |
| 3 | Charlie | charlie@example.com |
Anti-Join Patterns
Anti-join returns rows from left table that don't match right table:
-- Anti-Join: NOT IN (beware of NULLs!)
SELECT * FROM customers
WHERE customer_id NOT IN (SELECT customer_id FROM orders);
-- Anti-Join: NOT EXISTS (usually more efficient)
SELECT * FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
);
-- Anti-Join: LEFT JOIN...IS NULL
SELECT c.* FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;
-- Anti-Join: EXCEPT
SELECT customer_id FROM customers
EXCEPT
SELECT customer_id FROM orders;
Output:
| customer_id | name | |
|---|---|---|
| 4 | Diana | diana@example.com |
| 5 | Eve | eve@example.com |
⚠️
NULL Trap: NOT IN with NULL values returns empty results. Always use NOT EXISTS or ensure the subquery excludes NULLs.
Materialized CTE
Materialized CTEs compute results once and reuse:
-- Non-materialized CTE (computed multiple times)
WITH order_stats AS (
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
)
SELECT
c.name,
s.order_count,
s.total_spent,
s.total_spent / NULLIF(s.order_count, 0) AS avg_order_value
FROM customers c
JOIN order_stats s ON c.customer_id = s.customer_id;
-- Materialized CTE (PostgreSQL 12+)
WITH order_stats AS MATERIALIZED (
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id
)
SELECT
c.name,
s.order_count,
s.total_spent
FROM customers c
JOIN order_stats s ON c.customer_id = s.customer_id;
-- Not Materialized (inline, may be optimized away)
WITH order_stats AS NOT MATERIALIZED (
SELECT
customer_id,
COUNT(*) AS order_count
FROM orders
GROUP BY customer_id
)
SELECT * FROM order_stats WHERE order_count > 1;
Subquery Optimization
-- Correlated subquery (can be slow)
SELECT
c.*,
(SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) AS order_count
FROM customers c;
-- Rewrite with JOIN (usually faster)
SELECT
c.*,
COALESCE(s.order_count, 0) AS order_count
FROM customers c
LEFT JOIN (
SELECT customer_id, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id
) s ON c.customer_id = s.customer_id;
-- EXISTS vs IN for large datasets
-- EXISTS is often faster for large right tables
SELECT * FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date > '2024-01-01'
);
Join Optimization Patterns
-- Filter early (reduce join size)
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date > '2024-01-01' -- Filter before join
AND o.total_amount > 100;
-- Use covering indexes
CREATE INDEX idx_orders_covering ON orders(customer_id, order_date, total_amount);
-- Join order optimization
-- PostgreSQL optimizer usually handles this, but you can hint
SET join_collapse_limit = 12; -- Allow more join reordering
-- LATERAL for dependent subqueries
SELECT
c.*,
latest.order_date,
latest.total_amount
FROM customers c
CROSS JOIN LATERAL (
SELECT order_date, total_amount
FROM orders
WHERE customer_id = c.customer_id
ORDER BY order_date DESC
LIMIT 1
) latest;
Performance Comparison Table
| Pattern | Time Complexity | When to Use |
|---|---|---|
| Semi-Join (EXISTS) | Large right table | |
| Anti-Join (NOT EXISTS) | Exclusion queries | |
| Materialized CTE | Reuse results | |
| Correlated Subquery | Avoid if possible | |
| LATERAL JOIN | Top-N per group |
-- Compare execution plans
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id
);
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM customers c
WHERE c.customer_id IN (SELECT customer_id FROM orders);
Window Function Optimization
-- Avoid multiple window passes
-- Bad: Multiple window functions
SELECT
customer_id,
order_date,
SUM(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total,
AVG(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_avg,
COUNT(*) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_count
FROM orders;
-- Good: Combine if possible
WITH windowed AS (
SELECT
customer_id,
order_date,
total_amount,
SUM(total_amount) OVER w AS running_total,
AVG(total_amount) OVER w AS running_avg,
COUNT(*) OVER w AS running_count
FROM orders
WINDOW w AS (PARTITION BY customer_id ORDER BY order_date)
)
SELECT * FROM windowed;
Mathematical Formulas
For join performance:
The optimizer chooses based on:
ℹ️
Pro Tip: Use EXPLAIN (ANALYZE, BUFFERS) to compare actual I/O between different query patterns. Buffer hits indicate cached data, reads indicate disk I/O.
Anti-Patterns to Avoid
- SELECT * in subqueries
- NOT IN with nullable columns
- Correlated subqueries in SELECT
- Multiple scans of same table
- Unnecessary DISTINCT or ORDER BY
-- Bad: SELECT * in subquery
SELECT * FROM customers
WHERE customer_id IN (SELECT * FROM orders);
-- Good: Specify columns
SELECT * FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders);
-- Bad: Correlated subquery
SELECT *, (SELECT MAX(order_date) FROM orders WHERE customer_id = c.customer_id)
FROM customers c;
-- Good: Join
SELECT c.*, o.max_date
FROM customers c
LEFT JOIN (
SELECT customer_id, MAX(order_date) AS max_date
FROM orders
GROUP BY customer_id
) o ON c.customer_id = o.customer_id;
⚠️
Performance Warning: Always test queries with realistic data volumes. Small test datasets may not reveal performance issues.