Clustering — Complete Guide
Clustering groups similar data points together without labels. It's the most common unsupervised learning task.
K-Means Clustering
Algorithm:
1. Initialize K random centroids
2. Assign each point to nearest centroid
3. Update centroids as mean of assigned points
4. Repeat steps 2-3 until convergence
Visualization:
Iteration 0: Iteration 1: Iteration 3:
⊙ ⊙ ⊙ ⊙ ⊙ ⊙
◇ ◇ → ◇ ◇ → ◇ ◇
⊙ ⊙ ⊙ ⊙ ⊙ ⊙
(Random) (Better) (Converged)
Choosing K (Elbow Method)
inertias = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
# Plot elbow curve
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('K')
plt.ylabel('Inertia')
plt.title('Elbow Method')
# Look for the "elbow" — where inertia stops decreasing fast
DBSCAN
Algorithm:
1. Pick unvisited point
2. Find all points within ε (epsilon) radius
3. If enough points (minPts), create cluster
4. Expand cluster by checking neighbors
5. Repeat for unvisited points
Advantages over K-Means:
├─ No need to specify K
├─ Finds arbitrary shapes
├─ Handles outliers (labels them as noise)
└─ Works with varying densities
Parameters:
ε (epsilon): Radius to search for neighbors
minPts: Minimum points to form a cluster
Hierarchical Clustering
Agglomerative (bottom-up):
1. Start with each point as its own cluster
2. Merge closest pair of clusters
3. Repeat until only K clusters remain
Dendrogram:
┌─────────────────────────────┐
│ ┌───────┐ │
│ ┌────┤ ├────┐ │
│ │ │ │ │ │
│ │ └───────┘ │ │
│ │ │ │
│────┴─────────────────┴─────│
A B C D E F
Cut at height h → K clusters
Python Implementation
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
# Generate data
X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)
X = StandardScaler().fit_transform(X)
# K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
labels_km = kmeans.fit_predict(X)
print(f"K-Means Silhouette: {silhouette_score(X, labels_km):.3f}")
# DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels_db = dbscan.fit_predict(X)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
print(f"DBSCAN found {n_clusters} clusters")
# Hierarchical
hier = AgglomerativeClustering(n_clusters=4)
labels_hier = hier.fit_predict(X)
print(f"Hierarchical Silhouette: {silhouette_score(X, labels_hier):.3f}")
Evaluation
Silhouette Score: [-1, 1]
├─ +1: Well-clustered
├─ 0: On boundary
└─ -1: Wrong cluster
Davies-Bouldin Index: [0, ∞)
├─ Lower = better clustering
└─ Measures cluster similarity
Calinski-Harabasz Index: [0, ∞)
├─ Higher = better clustering
└─ Ratio of between/within variance
Key Takeaways
- K-Means is fast but requires specifying K
- DBSCAN finds arbitrary shapes and handles outliers
- Hierarchical produces interpretable dendrograms
- Silhouette score is the most common evaluation metric
- Always scale features before clustering
- Use the elbow method to choose K
- Clustering is exploratory — interpret results carefully
- No single algorithm works best for all data