β‘ Query Optimization & EXPLAIN
Google & Uber Interview Deep Dive
π Interview Question
βΉοΈπ΄ Google/Uber Interview Question
"Given a slow-performing query on a large orders table (100M+ rows), walk through your optimization process using EXPLAIN ANALYZE. Identify the bottlenecks, suggest improvements, and verify the improvements. What metrics would you track?"
Companies: Google, Uber | Difficulty: Hard | Time: 45 minutes
π Setup: Large Dataset
-- Create a large orders table
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INT NOT NULL,
product_id INT NOT NULL,
order_date TIMESTAMP NOT NULL,
quantity INT DEFAULT 1,
unit_price DECIMAL(10, 2),
total_amount DECIMAL(12, 2),
status VARCHAR(20) DEFAULT 'pending',
shipping_address TEXT,
payment_method VARCHAR(30)
);
-- Create indexes (we'll analyze which help)
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_order_date ON orders(order_date);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_composite ON orders(customer_id, order_date);
-- Insert 100M rows (using generate_series for speed)
INSERT INTO orders (customer_id, product_id, order_date, quantity, unit_price, total_amount, status)
SELECT
(random() * 1000000)::INT + 1,
(random() * 50000)::INT + 1,
'2020-01-01'::timestamp + (random() * 1825 || ' days')::interval,
(random() * 10)::INT + 1,
(random() * 1000)::decimal(10,2),
(random() * 10000)::decimal(12,2),
CASE (random() * 4)::INT
WHEN 0 THEN 'pending'
WHEN 1 THEN 'shipped'
WHEN 2 THEN 'delivered'
WHEN 3 THEN 'cancelled'
ELSE 'refunded'
END
FROM generate_series(1, 10000000);
π Part 1: Reading EXPLAIN Output
Basic EXPLAIN
-- Simple EXPLAIN shows the query plan without executing
EXPLAIN
SELECT * FROM orders WHERE customer_id = 12345;
Output:
Seq Scan on orders (cost=0.00..183347.00 rows=10 width=124)
Filter: (customer_id = 12345)
βΉοΈπ Reading EXPLAIN Output
- cost: Estimated startup cost..total cost (in arbitrary units)
- rows: Estimated number of rows returned
- width: Estimated average width of rows in bytes
- Seq Scan: Full table scan (usually bad for large tables)
- Index Scan: Using an index (usually good)
EXPLAIN with Index Usage
-- Same query with index
EXPLAIN
SELECT * FROM orders WHERE customer_id = 12345;
Output (with index):
Index Scan using idx_orders_customer_id on orders (cost=0.43..8.45 rows=10 width=124)
Index Cond: (customer_id = 12345)
EXPLAIN ANALYZE (Actual Execution)
-- EXPLAIN ANALYZE actually runs the query
EXPLAIN ANALYZE
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS total_spent,
AVG(total_amount) AS avg_order_value
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
HAVING COUNT(*) > 5
ORDER BY total_spent DESC
LIMIT 10;
Sample Output:
Limit (cost=2847532.43..2847532.45 rows=10 width=20) (actual time=45231.234..45231.256 rows=10 loops=1)
-> Sort (cost=2847532.43..2847532.68 rows=100 width=20) (actual time=45231.230..45231.243 rows=10 loops=1)
Sort Key: (sum(orders.total_amount)) DESC
Sort Method: top-N heapsort Memory: 25kB
-> HashAggregate (cost=2847520.00..2847530.00 rows=1000 width=20) (actual time=45230.100..45230.890 rows=1234 loops=1)
Group Key: customer_id
Filter: (count(*) > 5)
Batches: 1 Memory Usage: 241kB
-> Seq Scan on orders (cost=0.00..2013470.00 rows=5000000 width=12) (actual time=0.012..18234.567 rows=5000000 loops=1)
Filter: (order_date >= '2024-01-01')
Rows Removed by Filter: 5000000
Planning Time: 0.085 ms
Execution Time: 45231.312 ms
π Part 2: Key Metrics to Analyze
Understanding Key Values
β οΈβ οΈ Critical EXPLAIN ANALYZE Metrics
1. Rows Removed by Filter: High values mean many rows are scanned but not used β indicates missing indexes.
2. Actual Time vs Planning Time: High planning time may indicate complex queries; high execution time needs optimization.
3. Memory Usage: HashAggregate memory can indicate if work_mem needs increase.
4. Sort Method: "external merge" means disk-based sorting β increase work_mem or optimize ORDER BY.
-- Analyze specific bottleneck
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT
o.order_id,
o.order_date,
c.customer_name,
p.product_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE o.order_date >= '2024-01-01'
AND o.status = 'shipped';
Enhanced Output with BUFFERS:
Nested Loop (cost=0.87..123456.78 rows=1000 width=50) (actual time=0.025..1234.567 rows=1000 loops=1)
Buffers: shared hit=54321 read=1234
-> Index Scan using idx_orders_status_date on orders o (cost=0.43..98765.43 rows=1000 width=20) (actual time=0.018..987.654 rows=1000 loops=1)
Index Cond: ((status = 'shipped') AND (order_date >= '2024-01-01'))
Buffers: shared hit=43210 read=1234
-> Index Scan using customers_pkey on customers c (cost=0.29..8.31 rows=1 width=30) (actual time=0.002..0.003 rows=1 loops=1000)
Index Cond: (customer_id = o.customer_id)
Buffers: shared hit=3000
-> Index Scan using products_pkey on products p (cost=0.29..16.61 rows=1 width=30) (actual time=0.002..0.003 rows=1 loops=1000)
Index Cond: (product_id = o.product_id)
Buffers: shared hit=2000
Planning Time: 0.125 ms
Execution Time: 1234.789 ms
π§ Part 3: Common Bottlenecks & Solutions
Bottleneck 1: Sequential Scans on Large Tables
-- BAD: Full table scan
EXPLAIN ANALYZE
SELECT * FROM orders WHERE total_amount > 5000;
-- Solution: Create appropriate index
CREATE INDEX idx_orders_amount ON orders(total_amount);
-- Verify improvement
EXPLAIN ANALYZE
SELECT * FROM orders WHERE total_amount > 5000;
Bottleneck 2: Nested Loop on Large Datasets
-- BAD: Nested loop with large intermediate result
EXPLAIN ANALYZE
SELECT *
FROM orders o
WHERE o.customer_id IN (
SELECT customer_id
FROM orders
WHERE order_date >= '2024-01-01'
);
-- Solution 1: Use JOIN instead
EXPLAIN ANALYZE
SELECT DISTINCT o.*
FROM orders o
JOIN orders o2 ON o.customer_id = o2.customer_id
WHERE o2.order_date >= '2024-01-01';
-- Solution 2: Use EXISTS
EXPLAIN ANALYZE
SELECT *
FROM orders o
WHERE EXISTS (
SELECT 1
FROM orders o2
WHERE o2.customer_id = o.customer_id
AND o2.order_date >= '2024-01-01'
);
Bottleneck 3: Sort Operations
-- BAD: Large sort without index
EXPLAIN ANALYZE
SELECT * FROM orders
ORDER BY order_date DESC
LIMIT 100;
-- Solution: Create covering index
CREATE INDEX idx_orders_date_desc ON orders(order_date DESC);
-- Or use INCLUDE for covering index (PostgreSQL 11+)
CREATE INDEX idx_orders_date_covering
ON orders(order_date DESC)
INCLUDE (customer_id, total_amount, status);
Bottleneck 4: Hash Aggregate Memory
-- BAD: Hash aggregate with high memory usage
EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, COUNT(*), SUM(total_amount)
FROM orders
GROUP BY customer_id;
-- Solution: Increase work_mem for this session
SET work_mem = '256MB';
-- Or optimize with pre-aggregation
CREATE MATERIALIZED VIEW customer_order_summary AS
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id;
π Part 4: EXPLAIN Options
-- VERBOSE: Shows detailed output
EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 123;
-- FORMAT options: TEXT, JSON, YAML, XML
EXPLAIN (ANALYZE, FORMAT JSON)
SELECT * FROM orders WHERE customer_id = 123;
-- SUMMARY: Shows planning/execution statistics
EXPLAIN (ANALYZE, SUMMARY)
SELECT * FROM orders WHERE customer_id = 123;
-- COSTS: Toggle cost display
EXPLAIN (ANALYZE, COSTS FALSE)
SELECT * FROM orders WHERE customer_id = 123;
-- TIMING: Toggle timing information
EXPLAIN (ANALYZE, TIMING FALSE)
SELECT * FROM orders WHERE customer_id = 123;
-- WAL: Show WAL (Write-Ahead Log) information
EXPLAIN (ANALYZE, WAL)
INSERT INTO orders SELECT * FROM orders WHERE customer_id = 123;
π― Part 5: Query Rewriting Techniques
π‘π‘ Query Rewriting Tips
- Replace IN with JOIN or EXISTS
- Move complex subqueries to CTEs
- Use UNION ALL instead of UNION when duplicates don't matter
- Avoid SELECT * β specify only needed columns
- Push predicates as close to source as possible
Example: Rewriting Complex Queries
-- ORIGINAL: Slow query with IN clause
EXPLAIN ANALYZE
SELECT *
FROM orders
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE order_date >= '2024-01-01'
AND status = 'shipped'
)
AND order_date < '2024-01-01';
-- REWRITTEN: Using CTE and JOIN
WITH recent_shipped AS (
SELECT DISTINCT customer_id
FROM orders
WHERE order_date >= '2024-01-01'
AND status = 'shipped'
)
EXPLAIN ANALYZE
SELECT o.*
FROM orders o
JOIN recent_shipped rs ON o.customer_id = rs.customer_id
WHERE o.order_date < '2024-01-01';
Example: Eliminating Subqueries
-- ORIGINAL: Correlated subquery
EXPLAIN ANALYZE
SELECT
*,
(SELECT COUNT(*) FROM orders o2 WHERE o2.customer_id = o1.customer_id) AS customer_order_count
FROM orders o1
WHERE order_date >= '2024-01-01';
-- REWRITTEN: Using window function
EXPLAIN ANALYZE
SELECT *,
COUNT(*) OVER (PARTITION BY customer_id) AS customer_order_count
FROM orders
WHERE order_date >= '2024-01-01';
π Part 6: Statistics and ANALYZE
β οΈβ οΈ Stale Statistics
PostgreSQL uses statistics to estimate costs. Stale statistics lead to poor plan choices. Run ANALYZE regularly on large tables.
-- Check table statistics
SELECT
schemaname,
relname,
n_live_tup,
n_dead_tup,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'orders';
-- Update statistics manually
ANALYZE orders;
-- Check index usage
SELECT
schemaname,
relname,
indexrelname,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'orders';
-- Check index bloat
SELECT
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE relname = 'orders';
ποΈ Part 7: Real-World Optimization Workflow
Step-by-Step Process
-- Step 1: Run EXPLAIN ANALYZE to establish baseline
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT ... ;
-- Step 2: Check for sequential scans
-- Look for: Seq Scan on large tables
-- Step 3: Check for nested loops with large row estimates
-- Look for: Nested Loop with high row estimates
-- Step 4: Check for sort operations
-- Look for: Sort with external merge or high memory
-- Step 5: Create indexes based on findings
CREATE INDEX idx_xxx ON table(column);
-- Step 6: Verify improvement
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
-- Step 7: Compare metrics
-- Before: actual time=X rows=Y
-- After: actual time=X' rows=Y'
Comparison Template
-- BEFORE optimization
EXPLAIN (ANALYZE, BUFFERS, SUMMARY)
SELECT
DATE_TRUNC('month', order_date) AS month,
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS revenue
FROM orders
WHERE order_date >= '2023-01-01'
AND status != 'cancelled'
GROUP BY DATE_TRUNC('month', order_date), customer_id
HAVING SUM(total_amount) > 10000
ORDER BY revenue DESC;
-- Apply optimization...
CREATE INDEX idx_orders_status_date_amount
ON orders(status, order_date DESC)
INCLUDE (total_amount, customer_id);
-- AFTER optimization
EXPLAIN (ANALYZE, BUFFERS, SUMMARY)
SELECT ... ; -- Same query
π― Quiz Section
π Best Practices for Interviews
π‘β EXPLAIN Best Practices
1. Always Use ANALYZE for Actual Performance:
-- Planning-only EXPLAIN can be misleading
EXPLAIN SELECT ...;
-- Always verify with actual execution
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
2. Check Buffers for I/O Insights:
EXPLAIN (ANALYZE, BUFFERS)
-- High 'read' vs 'hit' indicates disk I/O
-- Consider increasing shared_buffers or optimizing queries
3. Use FORMAT for Programmatic Analysis:
EXPLAIN (ANALYZE, FORMAT JSON) SELECT ...;
-- Useful for automated monitoring tools
4. Compare Before and After:
- Document baseline metrics
- Create indexes
- Re-run and compare
- Measure improvement percentage
5. Watch for These Red Flags:
- High row estimates vs actual (stale statistics)
- Sort Method: external merge (insufficient work_mem)
- Hash Batches > 1 (insufficient work_mem)
- Nested Loop with high row estimates (missing index)
π Common Follow-Up Questions
- "How do you identify if statistics are stale?" β Compare estimated vs actual rows in EXPLAIN output
- "When should you use a covering index?" β When query only needs columns in index (INCLUDE)
- "How do you handle queries that can't use indexes?" β Consider materialized views, pre-aggregation
- "What's the difference between index scan and index only scan?" β Index only scan reads from index without touching table
β οΈβ οΈ Interview Tip
Always explain your thought process when reading EXPLAIN output. Interviewers want to see that you can:
- Identify the most expensive operation
- Understand why the planner chose that plan
- Suggest specific improvements
- Verify improvements with metrics