Clustering — Complete Guide

Clustering groups similar data points together without labels. It's the most common unsupervised learning task.

K-Means Clustering

Algorithm:
1. Initialize K random centroids
2. Assign each point to nearest centroid
3. Update centroids as mean of assigned points
4. Repeat steps 2-3 until convergence

Visualization:
Iteration 0:          Iteration 1:          Iteration 3:
⊙   ⊙               ⊙   ⊙               ⊙   ⊙
  ◇   ◇    →         ◇   ◇    →         ◇   ◇
⊙   ⊙               ⊙   ⊙               ⊙   ⊙
(Random)            (Better)            (Converged)

Choosing K (Elbow Method)

inertias = []
K_range = range(1, 11)
for k in K_range:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('K')
plt.ylabel('Inertia')
plt.title('Elbow Method')
# Look for the "elbow" — where inertia stops decreasing fast

DBSCAN

Algorithm:
1. Pick unvisited point
2. Find all points within ε (epsilon) radius
3. If enough points (minPts), create cluster
4. Expand cluster by checking neighbors
5. Repeat for unvisited points

Advantages over K-Means:
├─ No need to specify K
├─ Finds arbitrary shapes
├─ Handles outliers (labels them as noise)
└─ Works with varying densities

Parameters:
ε (epsilon): Radius to search for neighbors
minPts: Minimum points to form a cluster

Hierarchical Clustering

Agglomerative (bottom-up):
1. Start with each point as its own cluster
2. Merge closest pair of clusters
3. Repeat until only K clusters remain

Dendrogram:
┌─────────────────────────────┐
│         ┌───────┐          │
│    ┌────┤       ├────┐     │
│    │    │       │    │     │
│    │    └───────┘    │     │
│    │                 │     │
│────┴─────────────────┴─────│
A    B    C    D    E    F

Cut at height h → K clusters

Python Implementation

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Generate data
X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)
X = StandardScaler().fit_transform(X)

# K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
labels_km = kmeans.fit_predict(X)
print(f"K-Means Silhouette: {silhouette_score(X, labels_km):.3f}")

# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels_db = dbscan.fit_predict(X)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
print(f"DBSCAN found {n_clusters} clusters")

# Hierarchical
hier = AgglomerativeClustering(n_clusters=4)
labels_hier = hier.fit_predict(X)
print(f"Hierarchical Silhouette: {silhouette_score(X, labels_hier):.3f}")

Evaluation

Silhouette Score: [-1, 1]
├─ +1: Well-clustered
├─  0: On boundary
└─ -1: Wrong cluster

Davies-Bouldin Index: [0, ∞)
├─ Lower = better clustering
└─ Measures cluster similarity

Calinski-Harabasz Index: [0, ∞)
├─ Higher = better clustering
└─ Ratio of between/within variance

Key Takeaways

K-Means is fast but requires specifying K
DBSCAN finds arbitrary shapes and handles outliers
Hierarchical produces interpretable dendrograms
Silhouette score is the most common evaluation metric
Always scale features before clustering
Use the elbow method to choose K
Clustering is exploratory — interpret results carefully
No single algorithm works best for all data

Clustering — K-Means, DBSCAN, Hierarchical Complete Guide

Clustering — Complete Guide

K-Means Clustering

Choosing K (Elbow Method)

DBSCAN

Hierarchical Clustering

Python Implementation

Evaluation

Key Takeaways

Need Expert Machine Learning Help?