⚡ Query Optimization & EXPLAIN

Google & Uber Interview Deep Dive

🏢 Google🏢 Uber⚡ Difficulty: Hard⏱️ 45 min

📋 Interview Question

ℹ️🔴 Google/Uber Interview Question

"Given a slow-performing query on a large orders table (100M+ rows), walk through your optimization process using EXPLAIN ANALYZE. Identify the bottlenecks, suggest improvements, and verify the improvements. What metrics would you track?"

Companies: Google, Uber | Difficulty: Hard | Time: 45 minutes

📊 Setup: Large Dataset

-- Create a large orders table
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INT NOT NULL,
    product_id INT NOT NULL,
    order_date TIMESTAMP NOT NULL,
    quantity INT DEFAULT 1,
    unit_price DECIMAL(10, 2),
    total_amount DECIMAL(12, 2),
    status VARCHAR(20) DEFAULT 'pending',
    shipping_address TEXT,
    payment_method VARCHAR(30)
);

-- Create indexes (we'll analyze which help)
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_order_date ON orders(order_date);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_composite ON orders(customer_id, order_date);

-- Insert 100M rows (using generate_series for speed)
INSERT INTO orders (customer_id, product_id, order_date, quantity, unit_price, total_amount, status)
SELECT
    (random() * 1000000)::INT + 1,
    (random() * 50000)::INT + 1,
    '2020-01-01'::timestamp + (random() * 1825 || ' days')::interval,
    (random() * 10)::INT + 1,
    (random() * 1000)::decimal(10,2),
    (random() * 10000)::decimal(12,2),
    CASE (random() * 4)::INT
        WHEN 0 THEN 'pending'
        WHEN 1 THEN 'shipped'
        WHEN 2 THEN 'delivered'
        WHEN 3 THEN 'cancelled'
        ELSE 'refunded'
    END
FROM generate_series(1, 10000000);

🔍 Part 1: Reading EXPLAIN Output

Basic EXPLAIN

-- Simple EXPLAIN shows the query plan without executing
EXPLAIN
SELECT * FROM orders WHERE customer_id = 12345;

Output:

Architecture Diagram

Seq Scan on orders  (cost=0.00..183347.00 rows=10 width=124)
  Filter: (customer_id = 12345)

ℹ️🔍 Reading EXPLAIN Output

cost: Estimated startup cost..total cost (in arbitrary units)
rows: Estimated number of rows returned
width: Estimated average width of rows in bytes
Seq Scan: Full table scan (usually bad for large tables)
Index Scan: Using an index (usually good)

EXPLAIN with Index Usage

-- Same query with index
EXPLAIN
SELECT * FROM orders WHERE customer_id = 12345;

Output (with index):

Architecture Diagram

Index Scan using idx_orders_customer_id on orders  (cost=0.43..8.45 rows=10 width=124)
  Index Cond: (customer_id = 12345)

EXPLAIN ANALYZE (Actual Execution)

-- EXPLAIN ANALYZE actually runs the query
EXPLAIN ANALYZE
SELECT
    customer_id,
    COUNT(*) AS order_count,
    SUM(total_amount) AS total_spent,
    AVG(total_amount) AS avg_order_value
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
HAVING COUNT(*) > 5
ORDER BY total_spent DESC
LIMIT 10;

Sample Output:

Architecture Diagram

Limit  (cost=2847532.43..2847532.45 rows=10 width=20) (actual time=45231.234..45231.256 rows=10 loops=1)
  ->  Sort  (cost=2847532.43..2847532.68 rows=100 width=20) (actual time=45231.230..45231.243 rows=10 loops=1)
        Sort Key: (sum(orders.total_amount)) DESC
        Sort Method: top-N heapsort  Memory: 25kB
        ->  HashAggregate  (cost=2847520.00..2847530.00 rows=1000 width=20) (actual time=45230.100..45230.890 rows=1234 loops=1)
              Group Key: customer_id
              Filter: (count(*) > 5)
              Batches: 1  Memory Usage: 241kB
              ->  Seq Scan on orders  (cost=0.00..2013470.00 rows=5000000 width=12) (actual time=0.012..18234.567 rows=5000000 loops=1)
                    Filter: (order_date >= '2024-01-01')
                    Rows Removed by Filter: 5000000
Planning Time: 0.085 ms
Execution Time: 45231.312 ms

📈 Part 2: Key Metrics to Analyze

Understanding Key Values

⚠️⚠️ Critical EXPLAIN ANALYZE Metrics

1. Rows Removed by Filter: High values mean many rows are scanned but not used — indicates missing indexes.

2. Actual Time vs Planning Time: High planning time may indicate complex queries; high execution time needs optimization.

3. Memory Usage: HashAggregate memory can indicate if work_mem needs increase.

4. Sort Method: "external merge" means disk-based sorting — increase work_mem or optimize ORDER BY.

-- Analyze specific bottleneck
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT
    o.order_id,
    o.order_date,
    c.customer_name,
    p.product_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE o.order_date >= '2024-01-01'
AND o.status = 'shipped';

Enhanced Output with BUFFERS:

Architecture Diagram

Nested Loop  (cost=0.87..123456.78 rows=1000 width=50) (actual time=0.025..1234.567 rows=1000 loops=1)
  Buffers: shared hit=54321 read=1234
  ->  Index Scan using idx_orders_status_date on orders o  (cost=0.43..98765.43 rows=1000 width=20) (actual time=0.018..987.654 rows=1000 loops=1)
        Index Cond: ((status = 'shipped') AND (order_date >= '2024-01-01'))
        Buffers: shared hit=43210 read=1234
  ->  Index Scan using customers_pkey on customers c  (cost=0.29..8.31 rows=1 width=30) (actual time=0.002..0.003 rows=1 loops=1000)
        Index Cond: (customer_id = o.customer_id)
        Buffers: shared hit=3000
  ->  Index Scan using products_pkey on products p  (cost=0.29..16.61 rows=1 width=30) (actual time=0.002..0.003 rows=1 loops=1000)
        Index Cond: (product_id = o.product_id)
        Buffers: shared hit=2000
Planning Time: 0.125 ms
Execution Time: 1234.789 ms

🔧 Part 3: Common Bottlenecks & Solutions

Bottleneck 1: Sequential Scans on Large Tables

-- BAD: Full table scan
EXPLAIN ANALYZE
SELECT * FROM orders WHERE total_amount > 5000;

-- Solution: Create appropriate index
CREATE INDEX idx_orders_amount ON orders(total_amount);

-- Verify improvement
EXPLAIN ANALYZE
SELECT * FROM orders WHERE total_amount > 5000;

Bottleneck 2: Nested Loop on Large Datasets

-- BAD: Nested loop with large intermediate result
EXPLAIN ANALYZE
SELECT *
FROM orders o
WHERE o.customer_id IN (
    SELECT customer_id
    FROM orders
    WHERE order_date >= '2024-01-01'
);

-- Solution 1: Use JOIN instead
EXPLAIN ANALYZE
SELECT DISTINCT o.*
FROM orders o
JOIN orders o2 ON o.customer_id = o2.customer_id
WHERE o2.order_date >= '2024-01-01';

-- Solution 2: Use EXISTS
EXPLAIN ANALYZE
SELECT *
FROM orders o
WHERE EXISTS (
    SELECT 1
    FROM orders o2
    WHERE o2.customer_id = o.customer_id
    AND o2.order_date >= '2024-01-01'
);

Bottleneck 3: Sort Operations

-- BAD: Large sort without index
EXPLAIN ANALYZE
SELECT * FROM orders
ORDER BY order_date DESC
LIMIT 100;

-- Solution: Create covering index
CREATE INDEX idx_orders_date_desc ON orders(order_date DESC);

-- Or use INCLUDE for covering index (PostgreSQL 11+)
CREATE INDEX idx_orders_date_covering
ON orders(order_date DESC)
INCLUDE (customer_id, total_amount, status);

Bottleneck 4: Hash Aggregate Memory

-- BAD: Hash aggregate with high memory usage
EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, COUNT(*), SUM(total_amount)
FROM orders
GROUP BY customer_id;

-- Solution: Increase work_mem for this session
SET work_mem = '256MB';

-- Or optimize with pre-aggregation
CREATE MATERIALIZED VIEW customer_order_summary AS
SELECT
    customer_id,
    COUNT(*) AS order_count,
    SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id;

📊 Part 4: EXPLAIN Options

-- VERBOSE: Shows detailed output
EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 123;

-- FORMAT options: TEXT, JSON, YAML, XML
EXPLAIN (ANALYZE, FORMAT JSON)
SELECT * FROM orders WHERE customer_id = 123;

-- SUMMARY: Shows planning/execution statistics
EXPLAIN (ANALYZE, SUMMARY)
SELECT * FROM orders WHERE customer_id = 123;

-- COSTS: Toggle cost display
EXPLAIN (ANALYZE, COSTS FALSE)
SELECT * FROM orders WHERE customer_id = 123;

-- TIMING: Toggle timing information
EXPLAIN (ANALYZE, TIMING FALSE)
SELECT * FROM orders WHERE customer_id = 123;

-- WAL: Show WAL (Write-Ahead Log) information
EXPLAIN (ANALYZE, WAL)
INSERT INTO orders SELECT * FROM orders WHERE customer_id = 123;

🎯 Part 5: Query Rewriting Techniques

💡💡 Query Rewriting Tips

Replace IN with JOIN or EXISTS
Move complex subqueries to CTEs
Use UNION ALL instead of UNION when duplicates don't matter
Avoid SELECT * — specify only needed columns
Push predicates as close to source as possible

Example: Rewriting Complex Queries

-- ORIGINAL: Slow query with IN clause
EXPLAIN ANALYZE
SELECT *
FROM orders
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
    WHERE order_date >= '2024-01-01'
    AND status = 'shipped'
)
AND order_date < '2024-01-01';

-- REWRITTEN: Using CTE and JOIN
WITH recent_shipped AS (
    SELECT DISTINCT customer_id
    FROM orders
    WHERE order_date >= '2024-01-01'
    AND status = 'shipped'
)
EXPLAIN ANALYZE
SELECT o.*
FROM orders o
JOIN recent_shipped rs ON o.customer_id = rs.customer_id
WHERE o.order_date < '2024-01-01';

Example: Eliminating Subqueries

-- ORIGINAL: Correlated subquery
EXPLAIN ANALYZE
SELECT
    *,
    (SELECT COUNT(*) FROM orders o2 WHERE o2.customer_id = o1.customer_id) AS customer_order_count
FROM orders o1
WHERE order_date >= '2024-01-01';

-- REWRITTEN: Using window function
EXPLAIN ANALYZE
SELECT *,
    COUNT(*) OVER (PARTITION BY customer_id) AS customer_order_count
FROM orders
WHERE order_date >= '2024-01-01';

📈 Part 6: Statistics and ANALYZE

⚠️⚠️ Stale Statistics

PostgreSQL uses statistics to estimate costs. Stale statistics lead to poor plan choices. Run ANALYZE regularly on large tables.

-- Check table statistics
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'orders';

-- Update statistics manually
ANALYZE orders;

-- Check index usage
SELECT
    schemaname,
    relname,
    indexrelname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'orders';

-- Check index bloat
SELECT
    indexrelname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE relname = 'orders';

🏗️ Part 7: Real-World Optimization Workflow

Step-by-Step Process

-- Step 1: Run EXPLAIN ANALYZE to establish baseline
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT ... ;

-- Step 2: Check for sequential scans
-- Look for: Seq Scan on large tables

-- Step 3: Check for nested loops with large row estimates
-- Look for: Nested Loop with high row estimates

-- Step 4: Check for sort operations
-- Look for: Sort with external merge or high memory

-- Step 5: Create indexes based on findings
CREATE INDEX idx_xxx ON table(column);

-- Step 6: Verify improvement
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

-- Step 7: Compare metrics
-- Before: actual time=X rows=Y
-- After: actual time=X' rows=Y'

Comparison Template

-- BEFORE optimization
EXPLAIN (ANALYZE, BUFFERS, SUMMARY)
SELECT
    DATE_TRUNC('month', order_date) AS month,
    customer_id,
    COUNT(*) AS order_count,
    SUM(total_amount) AS revenue
FROM orders
WHERE order_date >= '2023-01-01'
AND status != 'cancelled'
GROUP BY DATE_TRUNC('month', order_date), customer_id
HAVING SUM(total_amount) > 10000
ORDER BY revenue DESC;

-- Apply optimization...
CREATE INDEX idx_orders_status_date_amount
ON orders(status, order_date DESC)
INCLUDE (total_amount, customer_id);

-- AFTER optimization
EXPLAIN (ANALYZE, BUFFERS, SUMMARY)
SELECT ... ;  -- Same query

🎯 Quiz Section

🏆 Best Practices for Interviews

💡✅ EXPLAIN Best Practices

1. Always Use ANALYZE for Actual Performance:

-- Planning-only EXPLAIN can be misleading
EXPLAIN SELECT ...;

-- Always verify with actual execution
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

2. Check Buffers for I/O Insights:

EXPLAIN (ANALYZE, BUFFERS)
-- High 'read' vs 'hit' indicates disk I/O
-- Consider increasing shared_buffers or optimizing queries

3. Use FORMAT for Programmatic Analysis:

EXPLAIN (ANALYZE, FORMAT JSON) SELECT ...;
-- Useful for automated monitoring tools

4. Compare Before and After:

Document baseline metrics
Create indexes
Re-run and compare
Measure improvement percentage

5. Watch for These Red Flags:

High row estimates vs actual (stale statistics)
Sort Method: external merge (insufficient work_mem)
Hash Batches > 1 (insufficient work_mem)
Nested Loop with high row estimates (missing index)

🔗 Common Follow-Up Questions

"How do you identify if statistics are stale?" — Compare estimated vs actual rows in EXPLAIN output
"When should you use a covering index?" — When query only needs columns in index (INCLUDE)
"How do you handle queries that can't use indexes?" — Consider materialized views, pre-aggregation
"What's the difference between index scan and index only scan?" — Index only scan reads from index without touching table

⚠️⚠️ Interview Tip

Always explain your thought process when reading EXPLAIN output. Interviewers want to see that you can:

Identify the most expensive operation
Understand why the planner chose that plan
Suggest specific improvements
Verify improvements with metrics

Topic: Query Optimization with EXPLAIN/ANALYZE for FAANG Interviews

⚡ Query Optimization & EXPLAIN

📋 Interview Question

📊 Setup: Large Dataset

🔍 Part 1: Reading EXPLAIN Output

Basic EXPLAIN

EXPLAIN with Index Usage

EXPLAIN ANALYZE (Actual Execution)

📈 Part 2: Key Metrics to Analyze

Understanding Key Values

🔧 Part 3: Common Bottlenecks & Solutions

Bottleneck 1: Sequential Scans on Large Tables

Bottleneck 2: Nested Loop on Large Datasets

Bottleneck 3: Sort Operations

Bottleneck 4: Hash Aggregate Memory

📊 Part 4: EXPLAIN Options

🎯 Part 5: Query Rewriting Techniques

Example: Rewriting Complex Queries

Example: Eliminating Subqueries

📈 Part 6: Statistics and ANALYZE

🏗️ Part 7: Real-World Optimization Workflow

Step-by-Step Process

Comparison Template

🎯 Quiz Section

🏆 Best Practices for Interviews

🔗 Common Follow-Up Questions