Clustering — K-Means, DBSCAN, Hierarchical Complete Guide

ML FoundationsUnsupervised LearningFree Lesson

Advertisement

Clustering — Complete Guide

Clustering groups similar data points together without labels. It's the most common unsupervised learning task.


K-Means Clustering

Algorithm:
1. Initialize K random centroids
2. Assign each point to nearest centroid
3. Update centroids as mean of assigned points
4. Repeat steps 2-3 until convergence

Visualization:
Iteration 0:          Iteration 1:          Iteration 3:
⊙   ⊙               ⊙   ⊙               ⊙   ⊙
  ◇   ◇    →         ◇   ◇    →         ◇   ◇
⊙   ⊙               ⊙   ⊙               ⊙   ⊙
(Random)            (Better)            (Converged)

Choosing K (Elbow Method)

inertias = []
K_range = range(1, 11)
for k in K_range:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('K')
plt.ylabel('Inertia')
plt.title('Elbow Method')
# Look for the "elbow" — where inertia stops decreasing fast

DBSCAN

Algorithm:
1. Pick unvisited point
2. Find all points within ε (epsilon) radius
3. If enough points (minPts), create cluster
4. Expand cluster by checking neighbors
5. Repeat for unvisited points

Advantages over K-Means:
├─ No need to specify K
├─ Finds arbitrary shapes
├─ Handles outliers (labels them as noise)
└─ Works with varying densities

Parameters:
ε (epsilon): Radius to search for neighbors
minPts: Minimum points to form a cluster

Hierarchical Clustering

Agglomerative (bottom-up):
1. Start with each point as its own cluster
2. Merge closest pair of clusters
3. Repeat until only K clusters remain

Dendrogram:
┌─────────────────────────────┐
│         ┌───────┐          │
│    ┌────┤       ├────┐     │
│    │    │       │    │     │
│    │    └───────┘    │     │
│    │                 │     │
│────┴─────────────────┴─────│
A    B    C    D    E    F

Cut at height h → K clusters

Python Implementation

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Generate data
X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)
X = StandardScaler().fit_transform(X)

# K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
labels_km = kmeans.fit_predict(X)
print(f"K-Means Silhouette: {silhouette_score(X, labels_km):.3f}")

# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels_db = dbscan.fit_predict(X)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
print(f"DBSCAN found {n_clusters} clusters")

# Hierarchical
hier = AgglomerativeClustering(n_clusters=4)
labels_hier = hier.fit_predict(X)
print(f"Hierarchical Silhouette: {silhouette_score(X, labels_hier):.3f}")

Evaluation

Silhouette Score: [-1, 1]
├─ +1: Well-clustered
├─  0: On boundary
└─ -1: Wrong cluster

Davies-Bouldin Index: [0, ∞)
├─ Lower = better clustering
└─ Measures cluster similarity

Calinski-Harabasz Index: [0, ∞)
├─ Higher = better clustering
└─ Ratio of between/within variance

Key Takeaways

  1. K-Means is fast but requires specifying K
  2. DBSCAN finds arbitrary shapes and handles outliers
  3. Hierarchical produces interpretable dendrograms
  4. Silhouette score is the most common evaluation metric
  5. Always scale features before clustering
  6. Use the elbow method to choose K
  7. Clustering is exploratory — interpret results carefully
  8. No single algorithm works best for all data

Advertisement

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement