π Indexing Strategies
Netflix & Amazon Interview Deep Dive
π Interview Question
βΉοΈπ΄ Netflix/Amazon Interview Question
"You have a users table with 500M rows and queries are slow. Design an indexing strategy that handles: 1) lookup by email, 2) search by city and age range, 3) full-text search on bio, 4) frequent filter on active users. Explain trade-offs of each index type."
Companies: Netflix, Amazon | Difficulty: Hard | Time: 40 minutes
π Setup: Users Table
CREATE TABLE users (
user_id BIGSERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL,
username VARCHAR(100),
first_name VARCHAR(100),
last_name VARCHAR(100),
city VARCHAR(100),
state VARCHAR(50),
age INT,
bio TEXT,
is_active BOOLEAN DEFAULT true,
created_at TIMESTAMP DEFAULT NOW(),
last_login TIMESTAMP,
profile_data JSONB
);
-- Insert 500M rows
INSERT INTO users (email, username, first_name, last_name, city, state, age, bio, is_active, created_at, last_login, profile_data)
SELECT
'user' || i || '@example.com',
'user' || i,
'First' || i,
'Last' || i,
CASE (random() * 4)::INT
WHEN 0 THEN 'New York'
WHEN 1 THEN 'San Francisco'
WHEN 2 THEN 'Chicago'
WHEN 3 THEN 'Seattle'
ELSE 'Austin'
END,
CASE (random() * 4)::INT
WHEN 0 THEN 'NY'
WHEN 1 THEN 'CA'
WHEN 2 THEN 'IL'
WHEN 3 THEN 'WA'
ELSE 'TX'
END,
(random() * 80 + 18)::INT,
'Bio text for user ' || i,
(random() > 0.1),
NOW() - (random() * 3650 || ' days')::interval,
NOW() - (random() * 365 || ' days')::interval,
jsonb_build_object(
'theme', CASE (random() * 2)::INT WHEN 0 THEN 'light' WHEN 1 THEN 'dark' ELSE 'auto' END,
'notifications', (random() > 0.5)
)
FROM generate_series(1, 50000000) AS i;
π² Part 1: B-Tree Index (Default)
βΉοΈπ B-Tree Index
B-tree is the default index type in PostgreSQL. It's optimal for:
- Equality comparisons (=, <, >, <=, >=)
- Range queries
- Pattern matching with prefix (LIKE 'abc%')
- Sorting (ORDER BY)
- Combining multiple conditions (AND)
-- Basic B-tree index for email lookup
CREATE INDEX idx_users_email ON users(email);
-- Verify with EXPLAIN
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'user12345@example.com';
-- Output: Index Scan using idx_users_email
-- actual time=0.025..0.026 rows=1 loops=1
Composite B-tree Index
-- Composite index for city + age queries
CREATE INDEX idx_users_city_age ON users(city, age);
-- Queries that benefit:
EXPLAIN ANALYZE SELECT * FROM users WHERE city = 'New York' AND age = 25;
EXPLAIN ANALYZE SELECT * FROM users WHERE city = 'New York';
EXPLAIN ANALYZE SELECT * FROM users WHERE city = 'New York' AND age BETWEEN 25 AND 35;
-- Query that WON'T use this index efficiently:
EXPLAIN ANALYZE SELECT * FROM users WHERE age = 25; -- Skips city prefix
β οΈβ οΈ Index Column Order Rule
B-tree indexes follow the leftmost prefix rule. A composite index on (city, age) can be used for:
city = 'X'city = 'X' AND age = 25city = 'X' AND age BETWEEN 25 AND 35
But NOT efficiently for:
age = 25(skips city)city LIKE '%York'(prefix not fixed)
B-tree with INCLUDE (Covering Index)
-- Covering index: includes columns needed by query
CREATE INDEX idx_users_email_covering
ON users(email)
INCLUDE (username, first_name, last_name, created_at);
-- Query can be satisfied entirely from index
EXPLAIN ANALYZE
SELECT email, username, first_name, last_name
FROM users
WHERE email LIKE 'user1000%';
-- Output: Index Only Scan (no table access needed)
π’ Part 2: Hash Index
βΉοΈπ Hash Index
Hash indexes are optimal for:
- Equality comparisons only (=)
- No range queries
- No sorting
- Smaller than B-tree for single-column lookups
- PostgreSQL 10+ supports WAL-logged hash indexes
-- Hash index for exact email matching
CREATE INDEX idx_users_email_hash ON users USING hash(email);
-- Only equality queries benefit
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'user12345@example.com';
-- Range queries will NOT use hash index
EXPLAIN ANALYZE
SELECT * FROM users WHERE email > 'user10000@example.com'; -- Seq Scan
Hash vs B-tree Comparison
-- Performance test
-- B-tree: 0.025ms for equality, 15ms for range
-- Hash: 0.018ms for equality, N/A for range
-- Index sizes (50M rows):
-- B-tree email: ~2.1 GB
-- Hash email: ~1.8 GB
π Part 3: GIN Index (Generalized Inverted Index)
βΉοΈπ GIN Index
GIN indexes are optimal for:
- Full-text search (tsvector)
- JSONB containment (@>, ?)
- Array operations (@>, &&)
- hstore operations
- Any multi-valued data
Full-Text Search GIN
-- Add tsvector column for full-text search
ALTER TABLE users ADD COLUMN bio_tsv/tsvector
GENERATED ALWAYS AS (
to_tsvector('english', coalesce(bio, ''))
) STORED;
-- Create GIN index
CREATE INDEX idx_users_bio_tsv ON users USING gin(bio_tsv);
-- Full-text search query
EXPLAIN ANALYZE
SELECT user_id, username, bio
FROM users
WHERE bio_tsv @@ to_tsquery('english', 'engineer & software');
-- Ranking with ts_rank
SELECT
user_id,
username,
bio,
ts_rank(bio_tsv, to_tsquery('english', 'engineer & software')) AS rank
FROM users
WHERE bio_tsv @@ to_tsquery('english', 'engineer & software')
ORDER BY rank DESC
LIMIT 10;
JSONB GIN Index
-- GIN index for JSONB containment
CREATE INDEX idx_users_profile ON users USING gin(profile_data);
-- Containment queries
EXPLAIN ANALYZE
SELECT * FROM users
WHERE profile_data @> '{"theme": "dark"}';
-- Key existence queries
EXPLAIN ANALYZE
SELECT * FROM users
WHERE profile_data ? 'notifications';
-- Path queries
EXPLAIN ANALYZE
SELECT * FROM users
WHERE profile_data @> '{"notifications": true}';
Array GIN Index
-- Example with tags array
ALTER TABLE users ADD COLUMN tags TEXT[];
CREATE INDEX idx_users_tags ON users USING gin(tags);
-- Array overlap query
EXPLAIN ANALYZE
SELECT * FROM users
WHERE tags && ARRAY['premium', 'beta', 'enterprise'];
-- Array contains query
EXPLAIN ANALYZE
SELECT * FROM users
WHERE tags @> ARRAY['premium'];
πΊοΈ Part 4: GiST Index (Generalized Search Tree)
βΉοΈπ GiST Index
GiST indexes are optimal for:
- Geometric/spatial data ( PostGIS)
- Range types (int4range, tsrange)
- Full-text search (alternative to GIN)
- Nearest-neighbor searches
- Pattern matching (pg_trgm)
Trigram Index for LIKE Queries
-- Enable pg_trgm extension
CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- GiST index for trigram similarity
CREATE INDEX idx_users_username_gist
ON users USING gist(username gist_trgm_ops);
-- Fuzzy search with similarity
EXPLAIN ANALYZE
SELECT username, similarity(username, 'johndoe') AS sim
FROM users
WHERE username % 'johndoe' -- Trigram match
ORDER BY sim DESC
LIMIT 10;
-- LIKE queries with leading wildcard
EXPLAIN ANALYZE
SELECT * FROM users
WHERE username LIKE '%john%';
Range Type GiST Index
-- Example with session time ranges
CREATE TABLE user_sessions (
session_id SERIAL PRIMARY KEY,
user_id INT,
active_range TSRANGE
);
CREATE INDEX idx_sessions_range ON user_sessions USING gist(active_range);
-- Overlap query
EXPLAIN ANALYZE
SELECT * FROM user_sessions
WHERE active_range && '[2024-01-01, 2024-01-31]'::tsrange;
-- Contains query
EXPLAIN ANALYZE
SELECT * FROM user_sessions
WHERE active_range @> '2024-01-15'::timestamp;
π Part 5: Partial Index
π‘π‘ Partial Index
Partial indexes are smaller and more efficient when you frequently filter on a specific condition. They only index rows matching the WHERE clause.
-- Partial index for active users only
CREATE INDEX idx_users_active ON users(user_id, last_login)
WHERE is_active = true;
-- This query uses the partial index
EXPLAIN ANALYZE
SELECT * FROM users
WHERE is_active = true AND last_login > NOW() - INTERVAL '30 days';
-- Partial index for recent orders
CREATE INDEX idx_orders_recent ON orders(order_date, customer_id)
WHERE order_date >= '2024-01-01';
-- Partial index for unprocessed records
CREATE INDEX idx_orders_pending ON orders(order_id, customer_id)
WHERE status = 'pending';
-- Partial index for high-value transactions
CREATE INDEX idx_transactions_high_value ON transactions(user_id, amount)
WHERE amount > 10000;
-- Partial index with expressions
CREATE INDEX idx_users_active_recent ON users(email)
WHERE is_active = true AND last_login > NOW() - INTERVAL '90 days';
Size Comparison
-- Full index size
SELECT pg_size_pretty(pg_relation_size('idx_users_email'));
-- ~2.1 GB
-- Partial index size
CREATE INDEX idx_users_email_active ON users(email)
WHERE is_active = true;
SELECT pg_size_pretty(pg_relation_size('idx_users_email_active'));
-- ~1.9 GB (90% of users are active)
π Part 6: Expression Index
-- Index on expression/function result
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
-- This query uses the expression index
EXPLAIN ANALYZE
SELECT * FROM users
WHERE LOWER(email) = 'user12345@example.com';
-- Index on computed values
CREATE INDEX idx_users_age_group ON users(
CASE
WHEN age < 18 THEN 'minor'
WHEN age < 65 THEN 'adult'
ELSE 'senior'
END
);
-- Index on date truncation
CREATE INDEX idx_users_created_month ON users(DATE_TRUNC('month', created_at));
-- Index with coalesce
CREATE INDEX idx_users_display_name ON users(
COALESCE(first_name, '') || ' ' || COALESCE(last_name, '')
);
π― Part 7: Index Strategy Decision Matrix
π‘β Index Selection Guide
| Query Pattern | Index Type | Example |
|---|---|---|
| Equality (=) | B-tree or Hash | WHERE id = 123 |
| Range (>, <, BETWEEN) | B-tree | WHERE date > '2024-01-01' |
| Pattern (LIKE 'abc%') | B-tree | WHERE name LIKE 'John%' |
| Pattern (LIKE '%abc%') | GiST/GIN (pg_trgm) | WHERE name LIKE '%ohn%' |
| Full-text search | GIN | WHERE tsv @@ tsquery(...) |
| JSONB containment | GIN | WHERE data @> '{"key":"val"}' |
| Array operations | GIN | WHERE tags @> ARRAY['x'] |
| Geometric/spatial | GiST | ST_DWithin(location, ...) |
| Nearest neighbor | GiST | ORDER BY location <-> point |
| Frequent filter | Partial | WHERE status = 'active' |
π§ Part 8: Index Maintenance
Monitoring Index Usage
-- Check index usage statistics
SELECT
schemaname,
relname AS table_name,
indexrelname AS index_name,
idx_scan AS times_used,
idx_tup_read AS rows_read,
idx_tup_fetch AS rows_fetched,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan DESC;
-- Find unused indexes
SELECT
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan AS times_used
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;
-- Check index bloat
SELECT
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan
FROM pg_stat_user_indexes
WHERE relname = 'users'
ORDER BY pg_relation_size(indexrelid) DESC;
Index Rebuild and Reindex
-- Rebuild index (non-blocking with CONCURRENTLY)
REINDEX INDEX CONCURRENTLY idx_users_email;
-- Rebuild all indexes on a table
REINDEX TABLE CONCURRENTLY users;
-- Check if reindex is needed
SELECT
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS current_size,
CASE
WHEN pg_relation_size(indexrelid) > 1073741824 THEN 'Consider rebuild'
ELSE 'OK'
END AS status
FROM pg_stat_user_indexes
WHERE relname = 'users';
π― Quiz Section
π Best Practices for Interviews
π‘β Indexing Best Practices
1. Index Selectively:
-- BAD: Index every column
CREATE INDEX idx_users_col1 ON users(col1);
CREATE INDEX idx_users_col2 ON users(col2);
-- ... 50 more indexes
-- GOOD: Index based on query patterns
-- Analyze slow queries first
SELECT * FROM pg_stat_user_statements
WHERE query LIKE '%users%'
ORDER BY mean_exec_time DESC;
2. Use Composite Indexes for Multi-Column Queries:
-- Instead of separate indexes, create one composite
-- BAD
CREATE INDEX idx_users_city ON users(city);
CREATE INDEX idx_users_age ON users(age);
-- GOOD
CREATE INDEX idx_users_city_age ON users(city, age);
3. Consider Covering Indexes:
-- Include frequently selected columns
CREATE INDEX idx_users_email_covering
ON users(email)
INCLUDE (username, first_name, created_at);
4. Monitor and Clean Up:
-- Regularly check for unused indexes
SELECT indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND schemaname = 'public';
5. Use Partial Indexes When Appropriate:
-- When queries always filter on a condition
CREATE INDEX idx_users_active ON users(email)
WHERE is_active = true;
π Common Follow-Up Questions
- "How do you choose between B-tree and Hash?" β Hash for equality only, B-tree for range/sort
- "When would you drop an index?" β Unused indexes waste space and slow writes
- "How do indexes affect INSERT/UPDATE performance?" β Each index adds overhead to write operations
- "What is index-only scan?" β Query answered entirely from index without table access
β οΈβ οΈ Performance Trade-off
Every index improves read performance but degrades write performance. The optimal number of indexes depends on your read/write ratio. For analytics-heavy workloads, more indexes are beneficial. For write-heavy workloads, be selective.