Clustering: K-Means, DBSCAN & Hierarchical
Unsupervised learning for discovering natural groupings
Interview Question
"Compare K-Means, DBSCAN, and hierarchical clustering. What are the strengths and weaknesses of each? How do you choose the number of clusters and evaluate clustering quality?"
Difficulty: Medium | Frequently asked at Google, LinkedIn, Amazon
Theoretical Foundation
K-Means Clustering
Algorithm:
- Initialize centroids randomly
- Assign each point to nearest centroid
- Update centroids as mean of assigned points
- Repeat until convergence
Objective:
Properties:
- Assumes spherical clusters
- Sensitive to initialization (use K-Means++)
- per iteration
- Converges to local minimum
DBSCAN (Density-Based Spatial Clustering)
Key Concepts:
- Core point: Has at least neighbors within
- Border point: Within of a core point but not a core point itself
- Noise point: Neither core nor border
Algorithm:
- Find all core points
- Expand clusters from core points
- Border points join nearest core point cluster
- Noise points remain unclustered
Properties:
- Finds arbitrary-shaped clusters
- Handles outliers naturally
- No need to specify
- Sensitive to and
Hierarchical Clustering
Agglomerative (Bottom-Up):
- Start with each point as its own cluster
- Merge closest pair of clusters
- Repeat until clusters remain
Divisive (Top-Down):
- Start with all points in one cluster
- Split clusters recursively
- Stop when clusters remain
Linkage Methods:
- Single: Minimum distance between clusters
- Complete: Maximum distance between clusters
- Average: Average distance between clusters
- Ward: Minimize within-cluster variance
Clustering Comparison
| Method | Shape | Outliers | Scalability | Parameters |
|---|---|---|---|---|
| K-Means | Spherical | Sensitive | High | |
| DBSCAN | Arbitrary | Robust | Medium | , |
| Hierarchical | Flexible | Moderate | Low | Linkage, |
Choosing the Number of Clusters
Elbow Method
Plot within-cluster sum of squares (WCSS) vs . Look for the "elbow" where improvement slows.
Silhouette Score
where is average distance to same cluster, is average distance to nearest other cluster.
Gap Statistic
Compares WCSS to expected WCSS under null reference distribution.
βΉοΈ
Key Insight: The elbow method can be ambiguous. The silhouette score provides a more principled approach, measuring both cohesion (within-cluster) and separation (between-cluster).
Evaluation Metrics
Silhouette Coefficient
Range: . Higher is better.
Adjusted Rand Index (ARI)
Corrected for chance. Range: . 1 = perfect.
Normalized Mutual Information (NMI)
Measures agreement between cluster assignments. Range: .
Code Implementation
Real-World Applications
Google: Search Clustering
- Query Clustering: Grouping similar search queries
- Document Clustering: Organizing web pages by topic
- User Segmentation: Clustering users by behavior
LinkedIn: People You May Know
- Social Network Clustering: Identifying communities
- Skill Clustering: Grouping professionals by expertise
- Company Clustering: Categorizing businesses
π‘
Google Interview Tip: Be prepared to discuss scalability. Mention Mini-Batch K-Means for large datasets and distributed clustering with MapReduce.
Common Follow-Up Questions
Q1: How do you handle the scalability of K-Means? Use Mini-Batch K-Means, which processes small random batches. Also consider distributed implementations with MapReduce or Spark.
Q2: When would you choose DBSCAN over K-Means? When clusters have arbitrary shapes, when you expect noise/outliers, or when you don't know the number of clusters.
Q3: How do you evaluate clustering without ground truth? Use intrinsic metrics: silhouette score, Davies-Bouldin index, Calinski-Harabasz index.
Q4: What is the curse of dimensionality for clustering? In high dimensions, distance metrics become less meaningful. Use dimensionality reduction before clustering.