K-Means Clustering
Unsupervised Learning Concepts
Unlike supervised learning, unsupervised learning works with data that has no labels. The goal is to discover hidden patterns, structures, or groupings in the data.
Types of Unsupervised Learning
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā UNSUPERVISED LEARNING ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā ā
ā CLUSTERING: Group similar data points ā
ā āāā K-Means ā
ā āāā DBSCAN ā
ā āāā Hierarchical Clustering ā
ā āāā Gaussian Mixture Models ā
ā ā
ā DIMENSIONALITY REDUCTION: ā
ā āāā PCA (Principal Component Analysis) ā
ā āāā t-SNE ā
ā āāā UMAP ā
ā ā
ā ASSOCIATION RULES: ā
ā āāā Market Basket Analysis ā
ā āāā Apriori Algorithm ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Why Clustering Matters
Applications:
- Customer segmentation
- Document topic grouping
- Image segmentation
- Anomaly detection
- Gene expression analysis
- Social network analysis
K-Means Algorithm
K-Means is one of the most popular and straightforward clustering algorithms.
Algorithm Steps
K-Means Algorithm Visualization:
Step 1: Initialize K centroids randomly
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā ā ā ā ā
ā ā ā (centroids)ā
ā ā ā ā ā ā
ā ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Step 2: Assign each point to nearest centroid
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā ā ā ā ā ā ā
ā [C1] [C2] ā
ā ā ā ā ā ā ā ā
ā [C3] ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Step 3: Update centroids to mean of assigned points
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā ā ā ā ā ā ā
ā ā ā (new centroids)ā
ā ā ā ā ā ā ā ā
ā ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Step 4: Repeat until convergence
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā A A A B B B ā
ā ā ā ā
ā A A A B B B ā
ā ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
A = Cluster 1, B = Cluster 2
Mathematical Formulation
DfK-Means Assignment Step
DfK-Means Update Step
ā¹ļø K-Means Convergence
K-Means is guaranteed to converge because each iteration strictly decreases the objective . However, it may converge to a local minimum. The algorithm is sensitive to initialization, which is why scikit-learn runs it n_init times (default 10) with different seeds and keeps the best result.
Complete Python Implementation
šK-Means Clustering with Optimal K Selection
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, silhouette_samples, adjusted_rand_score
from sklearn.decomposition import PCA
np.random.seed(42)
# Generate synthetic data with known clusters
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)
print(f"Dataset shape: {X.shape}")
print(f"True number of clusters: {len(np.unique(y_true))}")
# Apply K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10, max_iter=300)
y_kmeans = kmeans.fit_predict(X)
print(f"\nK-Means Results:")
print(f" Converged: {kmeans.n_iter_} iterations")
print(f" Inertia: {kmeans.inertia_:.2f}")
# Find optimal K: Elbow Method + Silhouette
inertias = []
silhouette_scores = []
K_range = range(2, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(X, kmeans.labels_))
print(f"K={k:2d}: Inertia={kmeans.inertia_:8.2f}, Silhouette={silhouette_scores[-1]:.4f}")
best_k = list(K_range)[np.argmax(silhouette_scores)]
print(f"\nOptimal K (by Silhouette): {best_k}")
Finding Optimal K
Elbow Method
Plot inertia vs K and look for the "elbow" ā the point where adding more clusters provides diminishing returns.
Silhouette Score
Silhouette Coefficient
Here,
- =Mean intra-cluster distance for sample i
- =Mean nearest-cluster distance for sample i
- : Well-clustered
- : On cluster boundary
- : Likely in wrong cluster
Elbow plot + Silhouette plot:
K=2: Inertia high, Silhouette moderate
K=3: Inertia dropping, Silhouette high
K=4: Inertia elbow, Silhouette highest ā Optimal
K=5+: Inertia plateaus, Silhouette decreasing
K-Means Variants
Mini-Batch K-Means
For large datasets, Mini-Batch K-Means uses random subsets for faster training:
from sklearn.cluster import MiniBatchKMeans
import time
# Standard K-Means
start = time.time()
kmeans_standard = KMeans(n_clusters=4, random_state=42, n_init=10)
y_standard = kmeans_standard.fit_predict(X)
time_standard = time.time() - start
# Mini-Batch K-Means
start = time.time()
kmeans_mini = MiniBatchKMeans(n_clusters=4, random_state=42, batch_size=100, n_init=10)
y_mini = kmeans_mini.fit_predict(X)
time_mini = time.time() - start
print(f"Standard K-Means: Inertia={kmeans_standard.inertia_:.2f}, Time={time_standard:.4f}s")
print(f"Mini-Batch K-Means: Inertia={kmeans_mini.inertia_:.2f}, Time={time_mini:.4f}s")
Limitations and Alternatives
K-Means Limitations
K-Means Fails On:
1. NON-SPHERICAL CLUSTERS:
ā ā ā K-Means assumes spherical clusters
ā ā ā and equal variance.
ā ā ā
Will incorrectly split these!
2. DIFFERENT SIZES:
āāāāāāā K-Means tends to find clusters
āāāā of similar size.
āāāāāāā
May split the larger cluster.
3. OUTLIERS:
ā ā ā ā Centroids are sensitive to outliers.
ā ā ā ā Outliers can pull centroids away
ā
from true cluster centers.
4. UNEQUAL DENSITIES:
āāāāā K-Means assumes uniform density.
āāāāāāāāāā Will merge sparse clusters and
āāāāāāāāāā split dense ones.
Alternatives Comparison
from sklearn.cluster import DBSCAN, AgglomerativeClustering
from sklearn.datasets import make_moons, make_circles
from sklearn.metrics import silhouette_score, adjusted_rand_score
# Generate challenging datasets
X_moons, y_moons = make_moons(n_samples=500, noise=0.1, random_state=42)
X_circles, y_circles = make_circles(n_samples=500, noise=0.05, factor=0.5, random_state=42)
algorithms = {
'K-Means': lambda X: KMeans(n_clusters=3, random_state=42, n_init=10).fit_predict(X),
'DBSCAN': lambda X: DBSCAN(eps=0.3, min_samples=5).fit_predict(X),
'Hierarchical': lambda X: AgglomerativeClustering(n_clusters=3).fit_predict(X),
}
print("Clustering Algorithm Comparison:")
print("-" * 60)
print(f"{'Algorithm':<20} {'Moons ARI':<12} {'Circles ARI':<12}")
print("-" * 60)
for name, alg_func in algorithms.items():
labels_moons = alg_func(StandardScaler().fit_transform(X_moons))
labels_circles = alg_func(StandardScaler().fit_transform(X_circles))
ari_moons = adjusted_rand_score(y_moons, labels_moons)
ari_circles = adjusted_rand_score(y_circles, labels_circles)
print(f"{name:<20} {ari_moons:<12.3f} {ari_circles:<12.3f}")
Algorithm Comparison Summary
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā CLUSTERING ALGORITHMS COMPARISON ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā ā
ā K-MEANS: ā
ā āāā Pros: Fast, simple, scales well ā
ā āāā Cons: Assumes spherical clusters, needs K ā
ā āāā Complexity: O(n Ć K Ć iterations) ā
ā āāā Best for: Spherical clusters of similar size ā
ā ā
ā DBSCAN: ā
ā āāā Pros: Finds arbitrary shapes, handles outliers ā
ā āāā Cons: Sensitive to eps, struggles with varying density ā
ā āāā Complexity: O(n log n) with spatial index ā
ā āāā Best for: Arbitrary shapes, noisy data ā
ā ā
ā HIERARCHICAL: ā
ā āāā Pros: No need for K, creates dendrogram ā
ā āāā Cons: Slow for large datasets, hard to cut ā
ā āāā Complexity: O(n³) or O(n² log n) ā
ā āāā Best for: Small datasets, when hierarchy matters ā
ā ā
ā GAUSSIAN MIXTURE: ā
ā āāā Pros: Soft assignments, handles elliptical clusters ā
ā āāā Cons: Assumes Gaussian distribution, needs K ā
ā āāā Complexity: O(n Ć K Ć iterations) ā
ā āāā Best for: Overlapping clusters, probabilistic assignment ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Practical Application
šCustomer Segmentation with K-Means
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
np.random.seed(42)
n_customers = 1000
annual_income = np.random.normal(50000, 20000, n_customers)
spending_score = np.random.normal(50, 25, n_customers)
age = np.random.normal(40, 15, n_customers)
loyalty_score = np.random.uniform(0, 100, n_customers)
segment_1 = annual_income < 40000
segment_2 = (annual_income >= 40000) & (annual_income < 70000)
segment_3 = annual_income >= 70000
spending_score[segment_1] *= 0.6
spending_score[segment_3] *= 1.4
X_customers = np.column_stack([annual_income, spending_score, age, loyalty_score])
scaler = StandardScaler()
X_customers_scaled = scaler.fit_transform(X_customers)
# Find optimal K
for k in range(2, 8):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_customers_scaled)
sil = silhouette_score(X_customers_scaled, kmeans.labels_)
print(f"K={k}: Silhouette={sil:.4f}, Inertia={kmeans.inertia_:.0f}")
# Apply with optimal K
best_k = 4
kmeans_final = KMeans(n_clusters=best_k, random_state=42, n_init=10)
customer_segments = kmeans_final.fit_predict(X_customers_scaled)
print(f"\nCustomer Segments (K={best_k}):")
for i in range(best_k):
mask = customer_segments == i
segment_data = X_customers[mask]
print(f"\nSegment {i+1} ({mask.sum()} customers):")
print(f" Avg Income: ${segment_data[:, 0].mean():,.0f}")
print(f" Avg Spending: {segment_data[:, 1].mean():.1f}")
print(f" Avg Age: {segment_data[:, 2].mean():.1f}")
Key Takeaways
šSummary: K-Means Clustering
- K-Means is fast and simple but assumes spherical clusters of equal size
- Objective: Minimize within-cluster sum of squares:
- Elbow Method and Silhouette Score help find optimal K
- DBSCAN handles arbitrary shapes and outliers without requiring K
- Hierarchical clustering provides dendrograms for understanding structure
- Always scale features before clustering
- Mini-Batch K-Means is essential for large datasets ( vs )
- Visualization is crucial for validating clustering results
Practice Exercises
Exercise 1: Image Segmentation
# Use K-Means to segment an image into K colors:
# 1. Load an image (or use random pixels)
# 2. Reshape to (n_pixels, 3)
# 3. Apply K-Means with K=2, 4, 8, 16
# 4. Recreate image with cluster colors
# 5. Analyze compression ratio
Exercise 2: Anomaly Detection
# Use clustering for anomaly detection:
# 1. Generate normal data + anomalies
# 2. Apply K-Means and DBSCAN
# 3. Identify points far from centroids (K-Means)
# 4. Identify noise points (DBSCAN)
# 5. Compare detection rates
Exercise 3: Compare Clustering Algorithms
# Compare K-Means, DBSCAN, and Hierarchical:
# 1. Create 5 different synthetic datasets
# 2. Apply all three algorithms
# 3. Evaluate with ARI and Silhouette
# 4. Create visualization
# 5. Discuss which works best for each dataset
Exercise 4: Choosing K for Real Data
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
# 1. Apply K-Means with K=2,3,4,5
# 2. Use elbow method and silhouette analysis
# 3. Compare with true labels (ARI)
# 4. Visualize using PCA
Summary Table
| Algorithm | K Required | Shape Assumption | Complexity | Best For |
|---|---|---|---|---|
| K-Means | Yes | Spherical | O(nK) | Spherical clusters |
| DBSCAN | No | Arbitrary | O(n log n) | Noisy, irregular shapes |
| Hierarchical | No | Any | O(n²) | Small datasets |
| GMM | Yes | Elliptical | O(nK) | Soft assignments |