Design Search Autocomplete / Typeahead System
Building real-time query suggestions with sub-50ms latency for billions of users
Interview Question
"Design a search autocomplete system like Google Search that can provide real-time query suggestions as users type, handling billions of queries with sub-50ms latency, while personalizing suggestions based on user history and trending topics."
Difficulty: Hard | Frequently asked at Google, Microsoft/Bing, Amazon, Uber
1. Requirements Gathering
Functional Requirements
- Prefix Matching: Return suggestions matching typed prefix
- Real-time Updates: Suggestions update as user types each character
- Personalization: Suggestions based on user search history
- Trending Topics: Include currently trending queries
- Spell Correction: Handle typos and misspellings
- Multi-language: Support multiple languages and scripts
- Safe Search: Filter inappropriate suggestions
Non-Functional Requirements
- Latency: < 50ms for suggestion generation (critical for UX)
- Throughput: 100,000+ queries per second, billions daily
- Freshness: Trending topics update within minutes
- Relevance: Top suggestion should be clicked > 40% of the time
- Availability: 99.99% uptime
- Scale: 8.5B+ daily searches, 1.8B+ users
- Privacy: User history must be private
βΉοΈ
Scale Perspective: Google processes over 8.5 billion searches daily. Their autocomplete system provides suggestions after just 3-4 characters, with responses in under 50ms. The system must handle massive concurrent users while providing personalized, relevant suggestions.
2. High-Level Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT LAYER β
β Chrome Browser β Mobile App β Google Search β Voice Assistant β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EDGE / CDN LAYER β
β Request Routing β Response Caching β Rate Limiting β DDoS Protection β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββΌββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββββββββββββ βββββββββββββββββ ββββββββββββββββββββββββ
β SUGGESTION SERVICE β β PERSONALIZATIONβ β TRENDING SERVICE β
β (Trie + ML Ranking) β β SERVICE β β (Real-time) β
β (< 20ms) β β (< 10ms) β β (< 15ms) β
ββββββββββββββββββββββββββ βββββββββββββββββ ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA LAYER β
β Trie Index β Query Logs β User History β Trending Data β Feature Store β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π‘
Key Insight: Autocomplete requires extremely fast prefix matching. The trie data structure is ideal for this, but must be distributed across servers for scale. ML ranking adds relevance but must be optimized for latency.
3. Data Pipeline Design
3.1 Query Data Model
from dataclasses import dataclass
from typing import List, Dict, Optional
from datetime import datetime
@dataclass
class QuerySuggestion:
query: str
frequency: int
last_updated: datetime
category: Optional[str]
language: str
is_trending: bool
trending_score: Optional[float]
personalization_score: Optional[float]
@dataclass
class QueryLog:
query: str
user_id: str
timestamp: datetime
position_clicked: Optional[int]
session_id: str
device_type: str
country: str
3.2 Trie-Based Index
class TrieNode:
def __init__(self):
self.children = {}
self.is_end = False
self.frequency = 0
self.query = None
self.last_updated = None
class DistributedTrie:
def __init__(self, num_shards=1000):
self.num_shards = num_shards
self.shards = [TrieShard(i) for i in range(num_shards)]
def get_shard(self, prefix):
shard_id = hash(prefix) % self.num_shards
return self.shards[shard_id]
def insert(self, query, frequency=1):
shard = self.get_shard(query[:3])
shard.insert(query, frequency)
def search(self, prefix, k=10):
shard = self.get_shard(prefix[:3])
return shard.search(prefix, k)
class TrieShard:
def __init__(self, shard_id):
self.shard_id = shard_id
self.root = TrieNode()
def insert(self, query, frequency=1):
node = self.root
for char in query:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.is_end = True
node.frequency += frequency
node.query = query
node.last_updated = datetime.now()
def search(self, prefix, k=10):
node = self.root
for char in prefix:
if char not in node.children:
return []
node = node.children[char]
# Collect all queries with this prefix
results = []
self._collect_queries(node, results, k)
# Sort by frequency
results.sort(key=lambda x: x.frequency, reverse=True)
return results[:k]
def _collect_queries(self, node, results, k):
if len(results) >= k:
return
if node.is_end:
results.append((node.query, node.frequency))
for child in node.children.values():
self._collect_queries(child, results, k)
3.3 Real-time Query Processing
class QueryProcessor:
def __init__(self):
self.trie = DistributedTrie()
self.user_history = RedisClient()
self.trending_service = TrendingService()
async def get_suggestions(self, prefix, user_id=None, k=10):
# Get trie-based suggestions
trie_suggestions = self.trie.search(prefix, k * 2)
# Get personalized suggestions
if user_id:
personalized = await self.get_personalized_suggestions(prefix, user_id)
else:
personalized = []
# Get trending suggestions
trending = await self.trending_service.get_trending(prefix)
# Merge and rank
all_suggestions = self.merge_suggestions(
trie_suggestions, personalized, trending
)
# Apply ML ranking
ranked = await self.rank_suggestions(all_suggestions, user_id)
return ranked[:k]
def merge_suggestions(self, trie_suggestions, personalized, trending):
merged = {}
# Add trie suggestions
for query, freq in trie_suggestions:
merged[query] = {'query': query, 'frequency': freq, 'source': 'trie'}
# Add personalized suggestions (higher weight)
for query, score in personalized:
if query in merged:
merged[query]['personalization_score'] = score
else:
merged[query] = {'query': query, 'frequency': 0, 'source': 'personalized'}
# Add trending suggestions (boost score)
for query, trend_score in trending:
if query in merged:
merged[query]['trending_score'] = trend_score
else:
merged[query] = {'query': query, 'frequency': 0, 'source': 'trending'}
return list(merged.values())
β οΈ
Critical Design Considerations:
- Memory efficiency: Tries can be memory-intensive - use compressed tries
- Distribution: Shard by prefix for even distribution
- Consistency: Handle concurrent updates gracefully
- Privacy: Don't store user search history in plain text
4. Model Selection and Training
4.1 ML-Based Ranking
class AutocompleteRanker:
def __init__(self):
self.model = self.build_model()
def build_model(self):
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
return model
def extract_features(self, query, user_context):
return {
'query_length': len(query),
'query_frequency': self.get_query_frequency(query),
'recency_score': self.get_recency_score(query),
'personalization_score': self.get_personalization_score(query, user_context),
'trending_score': self.get_trending_score(query),
'category_match': self.get_category_match(query, user_context),
'language_match': self.get_language_match(query, user_context)
}
async def rank_suggestions(self, suggestions, user_context):
features = []
for suggestion in suggestions:
feat = self.extract_features(suggestion['query'], user_context)
features.append(feat)
# Predict click probability
X = np.array([list(f.values()) for f in features])
scores = self.model.predict(X)
# Combine with original frequency
for i, suggestion in enumerate(suggestions):
suggestion['ml_score'] = scores[i][0]
suggestion['final_score'] = (
0.4 * suggestion.get('frequency', 0) / 1000000 +
0.3 * suggestion.get('personalization_score', 0) +
0.2 * suggestion.get('trending_score', 0) +
0.1 * suggestion['ml_score']
)
# Sort by final score
suggestions.sort(key=lambda x: x['final_score'], reverse=True)
return suggestions
4.2 Personalization Model
class PersonalizationModel:
def __init__(self):
self.user_embeddings = None
self.query_embeddings = None
async def get_personalized_suggestions(self, prefix, user_id):
# Get user's search history
history = await self.get_user_history(user_id)
# Get user embedding
user_emb = await self.get_user_embedding(user_id)
# Find similar queries from history
similar_queries = []
for query in history:
if query.startswith(prefix):
query_emb = await self.get_query_embedding(query)
similarity = np.dot(user_emb, query_emb)
similar_queries.append((query, similarity))
# Sort by similarity
similar_queries.sort(key=lambda x: x[1], reverse=True)
return similar_queries[:10]
βΉοΈ
Personalization Best Practices:
- Use embedding-based similarity for better matching
- Balance personalization with diversity
- Handle cold start for new users
- Respect user privacy preferences
5. Serving Architecture
5.1 Real-time Serving Pipeline
User Input β Load Balancer β Suggestion Service β Response (< 50ms)
β
βββ Trie Lookup (< 10ms)
βββ ML Ranking (< 15ms)
βββ Personalization (< 10ms)
βββ Response Assembly (< 5ms)
5.2 Caching Strategy
class AutocompleteCache:
def __init__(self):
self.l1_cache = LRUCache(maxsize=100000) # In-memory
self.l2_cache = RedisCluster() # Distributed
async def get_suggestions(self, prefix, user_id=None):
cache_key = f"{prefix}:{user_id or 'global'}"
# Check L1 cache
result = self.l1_cache.get(cache_key)
if result:
return result
# Check L2 cache
result = await self.l2_cache.get(cache_key)
if result:
self.l1_cache.set(cache_key, result, ttl=60)
return result
# Compute fresh results
result = await self.compute_suggestions(prefix, user_id)
# Cache results
self.l1_cache.set(cache_key, result, ttl=60)
await self.l2_cache.set(cache_key, result, ttl=300)
return result
π‘
Caching Tips:
- Cache prefix-based results aggressively
- Use shorter TTL for personalized results
- Cache trending topics separately
- Implement cache warming for popular prefixes
6. Monitoring and Observability
6.1 Key Metrics
class AutocompleteMetrics:
QUALITY_METRICS = ['click_through_rate', 'top_suggestion_ctr', 'suggestion_relevance']
OPERATIONAL_METRICS = ['latency_p50', 'latency_p95', 'latency_p99', 'qps', 'error_rate']
BUSINESS_METRICS = ['search_completion_rate', 'query_refinement_rate', 'user_satisfaction']
7. Scale Considerations and Trade-offs
7.1 Horizontal Scaling
Query Volume: Shard trie by prefix hash
ML Models: Horizontal scaling with load balancing
Feature Store: Redis cluster with read replicas
Cache: Multi-level caching with L1 (in-memory) and L2 (distributed)
7.2 Cost vs Performance Trade-offs
| Dimension | Option A (Cost Optimized) | Option B (Performance Optimized) |
|---|---|---|
| Data Structure | Hash-based prefix matching | Trie with compression |
| Ranking | Frequency-based only | ML-based ranking |
| Personalization | Minimal personalization | Deep personalization |
| Caching | Aggressive caching | Minimal caching |
8. Advanced Topics
8.1 Spell Correction
class SpellCorrector:
def __init__(self):
self.edit_distance_cache = {}
def get_suggestions(self, query, max_distance=2):
suggestions = []
for known_query in self.known_queries:
distance = self.edit_distance(query, known_query)
if distance <= max_distance:
suggestions.append((known_query, distance))
suggestions.sort(key=lambda x: x[1])
return suggestions[:5]
8.2 Trending Detection
class TrendingDetector:
def __init__(self):
self.query_counts = {}
self.baseline_counts = {}
async def detect_trending(self, time_window_minutes=5):
trending = []
current_counts = await self.get_query_counts(time_window_minutes)
for query, count in current_counts.items():
baseline = self.baseline_counts.get(query, 0)
if baseline > 0:
trend_score = count / baseline
if trend_score > 2.0: # 2x increase
trending.append((query, trend_score))
trending.sort(key=lambda x: x[1], reverse=True)
return trending[:100]
9. Implementation Roadmap
Phase 1: Basic Trie (Weeks 1-3)
- Implement distributed trie
- Basic prefix matching
- Frequency-based ranking
Phase 2: ML Ranking (Weeks 4-6)
- Feature engineering pipeline
- Train ranking model
- A/B testing framework
Phase 3: Advanced Features (Weeks 7-10)
- Personalization
- Trending detection
- Spell correction
Phase 4: Optimization (Weeks 11-12)
- Latency optimization
- Caching strategy
- Cost optimization
10. Summary and Key Takeaways
Architecture Recap
- Distributed trie: For fast prefix matching at scale
- ML ranking: For relevance optimization
- Personalization: For user-specific suggestions
- Multi-level caching: For low latency
Key Metrics
- Latency: < 50ms for suggestion generation
- CTR: Top suggestion should be clicked > 40%
- Freshness: Trending topics update within minutes
Common Interview Mistakes
- Not discussing distributed trie design
- Ignoring personalization requirements
- Forgetting about trending topics
- Not considering privacy requirements
βΉοΈ
Final Interview Tip: Start with the trie data structure, then discuss how to distribute it. Show understanding of both algorithmic efficiency and ML-based relevance. Emphasize the latency requirements and caching strategy.
Further Reading
- "Trie-based Autocomplete Systems" (Google Research)
- "Real-time Query Suggestion" (Microsoft Research)
- "Personalized Search Autocomplete" (Amazon)
- "Distributed Data Structures" (ACM Computing Surveys)