Google & LinkedIn Interview

Clustering: K-Means, DBSCAN & Hierarchical

Unsupervised learning for discovering natural groupings

Interview Question

"Compare K-Means, DBSCAN, and hierarchical clustering. What are the strengths and weaknesses of each? How do you choose the number of clusters and evaluate clustering quality?"

Difficulty: Medium | Frequently asked at Google, LinkedIn, Amazon

Theoretical Foundation

K-Means Clustering

Algorithm:

Initialize $K$ centroids randomly
Assign each point to nearest centroid
Update centroids as mean of assigned points
Repeat until convergence

Objective:

\min \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

Properties:

Assumes spherical clusters
Sensitive to initialization (use K-Means++)
$O(nKd)$ per iteration
Converges to local minimum

DBSCAN (Density-Based Spatial Clustering)

Key Concepts:

Core point: Has at least $\minPts$ neighbors within $\epsilon$
Border point: Within $\epsilon$ of a core point but not a core point itself
Noise point: Neither core nor border

Algorithm:

Find all core points
Expand clusters from core points
Border points join nearest core point cluster
Noise points remain unclustered

Properties:

Finds arbitrary-shaped clusters
Handles outliers naturally
No need to specify $K$
Sensitive to $\epsilon$ and $\minPts$

Hierarchical Clustering

Agglomerative (Bottom-Up):

Start with each point as its own cluster
Merge closest pair of clusters
Repeat until $K$ clusters remain

Divisive (Top-Down):

Start with all points in one cluster
Split clusters recursively
Stop when $K$ clusters remain

Linkage Methods:

Single: Minimum distance between clusters
Complete: Maximum distance between clusters
Average: Average distance between clusters
Ward: Minimize within-cluster variance

Clustering Comparison

Method	Shape	Outliers	Scalability	Parameters
K-Means	Spherical	Sensitive	High	$K$
DBSCAN	Arbitrary	Robust	Medium	$\epsilon$ , $\minPts$
Hierarchical	Flexible	Moderate	Low	Linkage, $K$

Choosing the Number of Clusters

Elbow Method

Plot within-cluster sum of squares (WCSS) vs $K$ . Look for the "elbow" where improvement slows.

Silhouette Score

s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

where $a(i)$ is average distance to same cluster, $b(i)$ is average distance to nearest other cluster.

Gap Statistic

Compares WCSS to expected WCSS under null reference distribution.

ℹ️

Key Insight: The elbow method can be ambiguous. The silhouette score provides a more principled approach, measuring both cohesion (within-cluster) and separation (between-cluster).

Evaluation Metrics

Silhouette Coefficient

Range: $[-1, 1]$ . Higher is better.

Adjusted Rand Index (ARI)

Corrected for chance. Range: $[-1, 1]$ . 1 = perfect.

Normalized Mutual Information (NMI)

Measures agreement between cluster assignments. Range: $[0, 1]$ .

Code Implementation

Real-World Applications

Google: Search Clustering

Query Clustering: Grouping similar search queries
Document Clustering: Organizing web pages by topic
User Segmentation: Clustering users by behavior

LinkedIn: People You May Know

Social Network Clustering: Identifying communities
Skill Clustering: Grouping professionals by expertise
Company Clustering: Categorizing businesses

💡

Google Interview Tip: Be prepared to discuss scalability. Mention Mini-Batch K-Means for large datasets and distributed clustering with MapReduce.

Common Follow-Up Questions

Q1: How do you handle the scalability of K-Means? Use Mini-Batch K-Means, which processes small random batches. Also consider distributed implementations with MapReduce or Spark.

Q2: When would you choose DBSCAN over K-Means? When clusters have arbitrary shapes, when you expect noise/outliers, or when you don't know the number of clusters.

Q3: How do you evaluate clustering without ground truth? Use intrinsic metrics: silhouette score, Davies-Bouldin index, Calinski-Harabasz index.

Q4: What is the curse of dimensionality for clustering? In high dimensions, distance metrics become less meaningful. Use dimensionality reduction before clustering.

Clustering: K-Means, DBSCAN & Hierarchical

Clustering: K-Means, DBSCAN & Hierarchical

Interview Question

Theoretical Foundation

K-Means Clustering

DBSCAN (Density-Based Spatial Clustering)

Hierarchical Clustering

Clustering Comparison

Choosing the Number of Clusters

Elbow Method

Silhouette Score

Gap Statistic

Evaluation Metrics

Silhouette Coefficient

Adjusted Rand Index (ARI)

Normalized Mutual Information (NMI)

Code Implementation

Real-World Applications

Google: Search Clustering

LinkedIn: People You May Know

Common Follow-Up Questions

Related Topics