📇 Indexing Strategies

Netflix & Amazon Interview Deep Dive

🏢 Netflix🏢 Amazon⚡ Difficulty: Hard⏱️ 40 min

📋 Interview Question

ℹ️🔴 Netflix/Amazon Interview Question

"You have a users table with 500M rows and queries are slow. Design an indexing strategy that handles: 1) lookup by email, 2) search by city and age range, 3) full-text search on bio, 4) frequent filter on active users. Explain trade-offs of each index type."

Companies: Netflix, Amazon | Difficulty: Hard | Time: 40 minutes

📊 Setup: Users Table

CREATE TABLE users (
    user_id BIGSERIAL PRIMARY KEY,
    email VARCHAR(255) NOT NULL,
    username VARCHAR(100),
    first_name VARCHAR(100),
    last_name VARCHAR(100),
    city VARCHAR(100),
    state VARCHAR(50),
    age INT,
    bio TEXT,
    is_active BOOLEAN DEFAULT true,
    created_at TIMESTAMP DEFAULT NOW(),
    last_login TIMESTAMP,
    profile_data JSONB
);

-- Insert 500M rows
INSERT INTO users (email, username, first_name, last_name, city, state, age, bio, is_active, created_at, last_login, profile_data)
SELECT
    'user' || i || '@example.com',
    'user' || i,
    'First' || i,
    'Last' || i,
    CASE (random() * 4)::INT
        WHEN 0 THEN 'New York'
        WHEN 1 THEN 'San Francisco'
        WHEN 2 THEN 'Chicago'
        WHEN 3 THEN 'Seattle'
        ELSE 'Austin'
    END,
    CASE (random() * 4)::INT
        WHEN 0 THEN 'NY'
        WHEN 1 THEN 'CA'
        WHEN 2 THEN 'IL'
        WHEN 3 THEN 'WA'
        ELSE 'TX'
    END,
    (random() * 80 + 18)::INT,
    'Bio text for user ' || i,
    (random() > 0.1),
    NOW() - (random() * 3650 || ' days')::interval,
    NOW() - (random() * 365 || ' days')::interval,
    jsonb_build_object(
        'theme', CASE (random() * 2)::INT WHEN 0 THEN 'light' WHEN 1 THEN 'dark' ELSE 'auto' END,
        'notifications', (random() > 0.5)
    )
FROM generate_series(1, 50000000) AS i;

🌲 Part 1: B-Tree Index (Default)

ℹ️🔍 B-Tree Index

B-tree is the default index type in PostgreSQL. It's optimal for:

Equality comparisons (=, <, >, <=, >=)
Range queries
Pattern matching with prefix (LIKE 'abc%')
Sorting (ORDER BY)
Combining multiple conditions (AND)

-- Basic B-tree index for email lookup
CREATE INDEX idx_users_email ON users(email);

-- Verify with EXPLAIN
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'user12345@example.com';

-- Output: Index Scan using idx_users_email
-- actual time=0.025..0.026 rows=1 loops=1

Composite B-tree Index

-- Composite index for city + age queries
CREATE INDEX idx_users_city_age ON users(city, age);

-- Queries that benefit:
EXPLAIN ANALYZE SELECT * FROM users WHERE city = 'New York' AND age = 25;
EXPLAIN ANALYZE SELECT * FROM users WHERE city = 'New York';
EXPLAIN ANALYZE SELECT * FROM users WHERE city = 'New York' AND age BETWEEN 25 AND 35;

-- Query that WON'T use this index efficiently:
EXPLAIN ANALYZE SELECT * FROM users WHERE age = 25;  -- Skips city prefix

⚠️⚠️ Index Column Order Rule

B-tree indexes follow the leftmost prefix rule. A composite index on (city, age) can be used for:

city = 'X'
city = 'X' AND age = 25
city = 'X' AND age BETWEEN 25 AND 35

But NOT efficiently for:

age = 25 (skips city)
city LIKE '%York' (prefix not fixed)

B-tree with INCLUDE (Covering Index)

-- Covering index: includes columns needed by query
CREATE INDEX idx_users_email_covering
ON users(email)
INCLUDE (username, first_name, last_name, created_at);

-- Query can be satisfied entirely from index
EXPLAIN ANALYZE
SELECT email, username, first_name, last_name
FROM users
WHERE email LIKE 'user1000%';

-- Output: Index Only Scan (no table access needed)

🔢 Part 2: Hash Index

ℹ️🔍 Hash Index

Hash indexes are optimal for:

Equality comparisons only (=)
No range queries
No sorting
Smaller than B-tree for single-column lookups
PostgreSQL 10+ supports WAL-logged hash indexes

-- Hash index for exact email matching
CREATE INDEX idx_users_email_hash ON users USING hash(email);

-- Only equality queries benefit
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'user12345@example.com';

-- Range queries will NOT use hash index
EXPLAIN ANALYZE
SELECT * FROM users WHERE email > 'user10000@example.com';  -- Seq Scan

Hash vs B-tree Comparison

-- Performance test
-- B-tree: 0.025ms for equality, 15ms for range
-- Hash:   0.018ms for equality, N/A for range

-- Index sizes (50M rows):
-- B-tree email: ~2.1 GB
-- Hash email:   ~1.8 GB

🌐 Part 3: GIN Index (Generalized Inverted Index)

ℹ️🔍 GIN Index

GIN indexes are optimal for:

Full-text search (tsvector)
JSONB containment (@>, ?)
Array operations (@>, &&)
hstore operations
Any multi-valued data

Full-Text Search GIN

-- Add tsvector column for full-text search
ALTER TABLE users ADD COLUMN bio_tsv/tsvector
GENERATED ALWAYS AS (
    to_tsvector('english', coalesce(bio, ''))
) STORED;

-- Create GIN index
CREATE INDEX idx_users_bio_tsv ON users USING gin(bio_tsv);

-- Full-text search query
EXPLAIN ANALYZE
SELECT user_id, username, bio
FROM users
WHERE bio_tsv @@ to_tsquery('english', 'engineer & software');

-- Ranking with ts_rank
SELECT
    user_id,
    username,
    bio,
    ts_rank(bio_tsv, to_tsquery('english', 'engineer & software')) AS rank
FROM users
WHERE bio_tsv @@ to_tsquery('english', 'engineer & software')
ORDER BY rank DESC
LIMIT 10;

JSONB GIN Index

-- GIN index for JSONB containment
CREATE INDEX idx_users_profile ON users USING gin(profile_data);

-- Containment queries
EXPLAIN ANALYZE
SELECT * FROM users
WHERE profile_data @> '{"theme": "dark"}';

-- Key existence queries
EXPLAIN ANALYZE
SELECT * FROM users
WHERE profile_data ? 'notifications';

-- Path queries
EXPLAIN ANALYZE
SELECT * FROM users
WHERE profile_data @> '{"notifications": true}';

Array GIN Index

-- Example with tags array
ALTER TABLE users ADD COLUMN tags TEXT[];
CREATE INDEX idx_users_tags ON users USING gin(tags);

-- Array overlap query
EXPLAIN ANALYZE
SELECT * FROM users
WHERE tags && ARRAY['premium', 'beta', 'enterprise'];

-- Array contains query
EXPLAIN ANALYZE
SELECT * FROM users
WHERE tags @> ARRAY['premium'];

🗺️ Part 4: GiST Index (Generalized Search Tree)

ℹ️🔍 GiST Index

GiST indexes are optimal for:

Geometric/spatial data ( PostGIS)
Range types (int4range, tsrange)
Full-text search (alternative to GIN)
Nearest-neighbor searches
Pattern matching (pg_trgm)

Trigram Index for LIKE Queries

-- Enable pg_trgm extension
CREATE EXTENSION IF NOT EXISTS pg_trgm;

-- GiST index for trigram similarity
CREATE INDEX idx_users_username_gist
ON users USING gist(username gist_trgm_ops);

-- Fuzzy search with similarity
EXPLAIN ANALYZE
SELECT username, similarity(username, 'johndoe') AS sim
FROM users
WHERE username % 'johndoe'  -- Trigram match
ORDER BY sim DESC
LIMIT 10;

-- LIKE queries with leading wildcard
EXPLAIN ANALYZE
SELECT * FROM users
WHERE username LIKE '%john%';

Range Type GiST Index

-- Example with session time ranges
CREATE TABLE user_sessions (
    session_id SERIAL PRIMARY KEY,
    user_id INT,
    active_range TSRANGE
);

CREATE INDEX idx_sessions_range ON user_sessions USING gist(active_range);

-- Overlap query
EXPLAIN ANALYZE
SELECT * FROM user_sessions
WHERE active_range && '[2024-01-01, 2024-01-31]'::tsrange;

-- Contains query
EXPLAIN ANALYZE
SELECT * FROM user_sessions
WHERE active_range @> '2024-01-15'::timestamp;

📍 Part 5: Partial Index

💡💡 Partial Index

Partial indexes are smaller and more efficient when you frequently filter on a specific condition. They only index rows matching the WHERE clause.

-- Partial index for active users only
CREATE INDEX idx_users_active ON users(user_id, last_login)
WHERE is_active = true;

-- This query uses the partial index
EXPLAIN ANALYZE
SELECT * FROM users
WHERE is_active = true AND last_login > NOW() - INTERVAL '30 days';

-- Partial index for recent orders
CREATE INDEX idx_orders_recent ON orders(order_date, customer_id)
WHERE order_date >= '2024-01-01';

-- Partial index for unprocessed records
CREATE INDEX idx_orders_pending ON orders(order_id, customer_id)
WHERE status = 'pending';

-- Partial index for high-value transactions
CREATE INDEX idx_transactions_high_value ON transactions(user_id, amount)
WHERE amount > 10000;

-- Partial index with expressions
CREATE INDEX idx_users_active_recent ON users(email)
WHERE is_active = true AND last_login > NOW() - INTERVAL '90 days';

Size Comparison

-- Full index size
SELECT pg_size_pretty(pg_relation_size('idx_users_email'));
-- ~2.1 GB

-- Partial index size
CREATE INDEX idx_users_email_active ON users(email)
WHERE is_active = true;
SELECT pg_size_pretty(pg_relation_size('idx_users_email_active'));
-- ~1.9 GB (90% of users are active)

📊 Part 6: Expression Index

-- Index on expression/function result
CREATE INDEX idx_users_lower_email ON users(LOWER(email));

-- This query uses the expression index
EXPLAIN ANALYZE
SELECT * FROM users
WHERE LOWER(email) = 'user12345@example.com';

-- Index on computed values
CREATE INDEX idx_users_age_group ON users(
    CASE
        WHEN age < 18 THEN 'minor'
        WHEN age < 65 THEN 'adult'
        ELSE 'senior'
    END
);

-- Index on date truncation
CREATE INDEX idx_users_created_month ON users(DATE_TRUNC('month', created_at));

-- Index with coalesce
CREATE INDEX idx_users_display_name ON users(
    COALESCE(first_name, '') || ' ' || COALESCE(last_name, '')
);

🎯 Part 7: Index Strategy Decision Matrix

💡✅ Index Selection Guide

Query Pattern	Index Type	Example
Equality (=)	B-tree or Hash	`WHERE id = 123`
Range (>, <, BETWEEN)	B-tree	`WHERE date > '2024-01-01'`
Pattern (LIKE 'abc%')	B-tree	`WHERE name LIKE 'John%'`
Pattern (LIKE '%abc%')	GiST/GIN (pg_trgm)	`WHERE name LIKE '%ohn%'`
Full-text search	GIN	`WHERE tsv @@ tsquery(...)`
JSONB containment	GIN	`WHERE data @> '{"key":"val"}'`
Array operations	GIN	`WHERE tags @> ARRAY['x']`
Geometric/spatial	GiST	ST_DWithin(location, ...)
Nearest neighbor	GiST	`ORDER BY location <-> point`
Frequent filter	Partial	`WHERE status = 'active'`

🔧 Part 8: Index Maintenance

Monitoring Index Usage

-- Check index usage statistics
SELECT
    schemaname,
    relname AS table_name,
    indexrelname AS index_name,
    idx_scan AS times_used,
    idx_tup_read AS rows_read,
    idx_tup_fetch AS rows_fetched,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan DESC;

-- Find unused indexes
SELECT
    indexrelname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan AS times_used
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;

-- Check index bloat
SELECT
    indexrelname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan
FROM pg_stat_user_indexes
WHERE relname = 'users'
ORDER BY pg_relation_size(indexrelid) DESC;

Index Rebuild and Reindex

-- Rebuild index (non-blocking with CONCURRENTLY)
REINDEX INDEX CONCURRENTLY idx_users_email;

-- Rebuild all indexes on a table
REINDEX TABLE CONCURRENTLY users;

-- Check if reindex is needed
SELECT
    indexrelname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS current_size,
    CASE
        WHEN pg_relation_size(indexrelid) > 1073741824 THEN 'Consider rebuild'
        ELSE 'OK'
    END AS status
FROM pg_stat_user_indexes
WHERE relname = 'users';

🎯 Quiz Section

🏆 Best Practices for Interviews

💡✅ Indexing Best Practices

1. Index Selectively:

-- BAD: Index every column
CREATE INDEX idx_users_col1 ON users(col1);
CREATE INDEX idx_users_col2 ON users(col2);
-- ... 50 more indexes

-- GOOD: Index based on query patterns
-- Analyze slow queries first
SELECT * FROM pg_stat_user_statements
WHERE query LIKE '%users%'
ORDER BY mean_exec_time DESC;

2. Use Composite Indexes for Multi-Column Queries:

-- Instead of separate indexes, create one composite
-- BAD
CREATE INDEX idx_users_city ON users(city);
CREATE INDEX idx_users_age ON users(age);

-- GOOD
CREATE INDEX idx_users_city_age ON users(city, age);

3. Consider Covering Indexes:

-- Include frequently selected columns
CREATE INDEX idx_users_email_covering
ON users(email)
INCLUDE (username, first_name, created_at);

4. Monitor and Clean Up:

-- Regularly check for unused indexes
SELECT indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND schemaname = 'public';

5. Use Partial Indexes When Appropriate:

-- When queries always filter on a condition
CREATE INDEX idx_users_active ON users(email)
WHERE is_active = true;

🔗 Common Follow-Up Questions

"How do you choose between B-tree and Hash?" — Hash for equality only, B-tree for range/sort
"When would you drop an index?" — Unused indexes waste space and slow writes
"How do indexes affect INSERT/UPDATE performance?" — Each index adds overhead to write operations
"What is index-only scan?" — Query answered entirely from index without table access

⚠️⚠️ Performance Trade-off

Every index improves read performance but degrades write performance. The optimal number of indexes depends on your read/write ratio. For analytics-heavy workloads, more indexes are beneficial. For write-heavy workloads, be selective.

Topic: SQL Indexing Strategies for FAANG Interviews