Interview Question: "Explain the difference between a cost-based optimizer and a rule-based optimizer. How does PostgreSQL estimate row counts? What are statistics and how do they affect query plans?" — Asked at Oracle, Microsoft, Snowflake for Database Engineer roles
ℹ️
Difficulty: Advanced | Companies: Oracle, Microsoft, Snowflake, Databricks, Amazon Redshift | Time: 60-75 minutes
Cost Model Fundamentals
The query optimizer estimates cost using:
For a sequential scan:
For an index scan:
-- Create sample tables for optimization analysis
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INT,
order_date DATE,
total_amount DECIMAL(12,2),
status VARCHAR(20),
region VARCHAR(50)
);
CREATE TABLE order_items (
item_id SERIAL PRIMARY KEY,
order_id INT REFERENCES orders(order_id),
product_id INT,
quantity INT,
unit_price DECIMAL(10,2)
);
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(255),
region VARCHAR(50),
credit_limit DECIMAL(12,2)
);
-- Generate sample data (1M orders)
INSERT INTO orders (customer_id, order_date, total_amount, status, region)
SELECT
(random() * 10000)::INT + 1,
CURRENT_DATE - (random() * 365)::INT,
(random() * 10000)::DECIMAL(12,2),
(ARRAY['pending', 'shipped', 'delivered', 'cancelled'])[floor(random()*4+1)],
(ARRAY['North', 'South', 'East', 'West'])[floor(random()*4+1)]
FROM generate_series(1, 1000000);
-- Generate order items
INSERT INTO order_items (order_id, product_id, quantity, unit_price)
SELECT
order_id,
(random() * 1000)::INT + 1,
(random() * 10)::INT + 1,
(random() * 500)::DECIMAL(10,2)
FROM orders, generate_series(1, (random() * 5)::INT + 1);
-- Generate customers
INSERT INTO customers (name, email, region, credit_limit)
SELECT
'Customer ' || i,
'customer' || i || '@example.com',
(ARRAY['North', 'South', 'East', 'West'])[floor(random()*4+1)],
(random() * 50000)::DECIMAL(12,2)
FROM generate_series(1, 10000) i;
Understanding EXPLAIN Output
-- Basic EXPLAIN analysis
EXPLAIN SELECT * FROM orders WHERE region = 'North';
Output:
Seq Scan on orders (cost=0.00..18334.00 rows=250000 width=48)
Filter: (region = 'North')
Rows Removed by Filter: 750000
Cost Breakdown:
0.00= startup cost (before first row)18334.00= total cost (all rows processed)250000= estimated rows48= estimated row width in bytes
-- Detailed EXPLAIN with ANALYZE
EXPLAIN ANALYZE
SELECT o.order_id, o.total_amount, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.region = 'North'
AND o.total_amount > 1000;
Output:
Hash Join (cost=2567.89..19456.78 rows=62500 width=64) (actual time=12.345..89.012 rows=62489 loops=1)
Hash Cond: (o.customer_id = c.customer_id)
-> Seq Scan on orders o (cost=0.00..15678.90 rows=250000 width=16) (actual time=0.012..45.678 rows=250000 loops=1)
Filter: ((region = 'North') AND (total_amount > 1000))
Rows Removed by Filter: 750000
-> Hash (cost=1234.56..1234.56 rows=10000 width=48) (actual time=8.901..8.901 rows=10000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 1234kB
-> Seq Scan on customers c (cost=0.00..1234.56 rows=10000 width=48) (actual time=0.005..4.567 rows=10000 loops=1)
Planning Time: 0.234 ms
Execution Time: 89.567 ms
ℹ️
Key Metrics: Compare estimated rows vs actual rows. Large discrepancies indicate stale statistics or poor estimates.
Statistics and Histograms
-- View table statistics
SELECT
attname,
null_frac,
avg_width,
n_distinct,
correlation,
most_common_vals,
most_common_freqs,
histogram_bounds
FROM pg_stats
WHERE tablename = 'orders'
AND attname = 'region';
-- Manually update statistics
ANALYZE orders;
-- Check statistics target
SELECT
attname,
attstattarget,
pg_stats.ext_stats
FROM pg_attribute
JOIN pg_stats ON attname = attname
WHERE attrelid = 'orders'::regclass;
Histogram Representation:
For a column with values :
Each bucket contains approximately values.
-- Create custom statistics for better estimates
CREATE STATISTICS orders_region_status (region, status) ON orders;
-- Check correlation statistics
SELECT
correlation,
-- Physical correlation affects index scan cost
CASE
WHEN correlation > 0.8 THEN 'High correlation - index scan efficient'
WHEN correlation > 0.5 THEN 'Medium correlation'
ELSE 'Low correlation - index scan less efficient'
END AS analysis
FROM pg_stats
WHERE tablename = 'orders'
AND attname = 'order_date';
Cost Calculation Formulas
Sequential Scan Cost:
Index Scan Cost:
Hash Join Cost:
Merge Join Cost:
-- Compare different join strategies
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.*, c.*
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date > '2024-01-01';
-- Force specific join method
SET enable_hashjoin = off;
SET enable_mergejoin = off;
-- Only nested loop remains
-- Reset
RESET enable_hashjoin;
RESET enable_mergejoin;
Index Selection Algorithm
The optimizer selects indexes based on:
-- Create various indexes for comparison
CREATE INDEX idx_orders_region ON orders(region);
CREATE INDEX idx_orders_date ON orders(order_date);
CREATE INDEX idx_orders_composite ON orders(region, order_date, total_amount);
CREATE INDEX idx_orders_partial ON orders(total_amount) WHERE status = 'shipped';
-- Compare plans with different indexes
EXPLAIN SELECT * FROM orders WHERE region = 'North' AND order_date > '2024-06-01';
-- Drop indexes to see plan changes
DROP INDEX idx_orders_region;
EXPLAIN SELECT * FROM orders WHERE region = 'North' AND order_date > '2024-06-01';
Index Cost Comparison:
| Index Type | Random I/O | Sequential I/O | Memory |
|---|---|---|---|
| Seq Scan | 0 | High | Low |
| B-Tree | Medium | Low | Medium |
| Hash | Low | Low | High |
| GiST | High | Low | Medium |
⚠️
Index Overhead: Each index adds write overhead. The optimizer must balance read performance vs write cost. Monitor index usage with pg_stat_user_indexes.
Advanced Cost Estimation
Correlation Factor:
Selectivity Estimation:
-- Analyze selectivity of different predicates
EXPLAIN SELECT * FROM orders
WHERE total_amount BETWEEN 1000 AND 5000;
EXPLAIN SELECT * FROM orders
WHERE total_amount > 1000;
EXPLAIN SELECT * FROM orders
WHERE status = 'shipped' AND region = 'North';
-- Check estimated vs actual for different ranges
EXPLAIN ANALYZE SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31';
EXPLAIN ANALYZE SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';
Query Plan Cache Analysis
-- Check plan cache statistics
SELECT
query,
calls,
total_time,
mean_time,
rows
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;
-- Analyze plan stability
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT * FROM orders WHERE region = 'North';
-- Compare with different parameter values
PREPARE test_query AS SELECT * FROM orders WHERE region = $1;
EXPLAIN ANALYZE EXECUTE test_query('North');
EXPLAIN ANALYZE EXECUTE test_query('South');
Performance Tuning Checklist
- Statistics Freshness:
ANALYZEafter major data changes - Index Usage: Check
pg_stat_user_indexes - Join Order: Verify optimal join sequence
- Memory Settings: Adjust work_mem, shared_buffers
- Parallel Queries: Enable parallel workers for large scans
ℹ️
Pro Tip: Use EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) with tools like pganalyze or explain.dalibo.com for visual plan analysis.
Mathematical Model for Plan Selection
The optimizer uses a dynamic programming approach:
Where is the query and is a candidate plan.
For joins with tables:
With pruning and heuristics:
⚠️
Cardinality Estimation Errors: The #1 cause of bad query plans. Monitor with pg_stat_statements and tune statistics targets for critical columns.