K-Means Clustering
Unsupervised Learning Concepts
Unlike supervised learning, unsupervised learning works with data that has no labels. The goal is to discover hidden patterns, structures, or groupings in the data.
Types of Unsupervised Learning
+-----------------------------------------------------------------+
| UNSUPERVISED LEARNING |
+-----------------------------------------------------------------+
| |
| CLUSTERING: Group similar data points |
| +-- K-Means |
| +-- DBSCAN |
| +-- Hierarchical Clustering |
| +-- Gaussian Mixture Models |
| |
| DIMENSIONALITY REDUCTION: |
| +-- PCA (Principal Component Analysis) |
| +-- t-SNE |
| +-- UMAP |
| |
| ASSOCIATION RULES: |
| +-- Market Basket Analysis |
| +-- Apriori Algorithm |
| |
+-----------------------------------------------------------------+
Why Clustering Matters
Applications:
- Customer segmentation
- Document topic grouping
- Image segmentation
- Anomaly detection
- Gene expression analysis
- Social network analysis
K-Means Algorithm
K-Means is one of the most popular and straightforward clustering algorithms.
Algorithm Steps
K-Means Algorithm Visualization:
Step 1: Initialize K centroids randomly
+------------------------------+
| o o o o |
| * * (centroids)|
| o o o o |
| * |
+------------------------------+
Step 2: Assign each point to nearest centroid
+------------------------------+
| o o o o o o |
| [C1] [C2] |
| o o o o o o |
| [C3] |
+------------------------------+
Step 3: Update centroids to mean of assigned points
+------------------------------+
| o o o o o o |
| ā ā (new centroids)|
| o o o o o o |
| ā |
+------------------------------+
Step 4: Repeat until convergence
+------------------------------+
| A A A B B B |
| ā ā |
| A A A B B B |
| ā |
+------------------------------+
A = Cluster 1, B = Cluster 2
Mathematical Formulation
Here,
- =Number of clusters
- =Set of points assigned to cluster k
- =Centroid of cluster k
- =Squared Euclidean distance
DfK-Means Assignment Step
DfK-Means Update Step
K-Means Convergence
K-Means is guaranteed to converge because each iteration strictly decreases the objective . However, it may converge to a local minimum. The algorithm is sensitive to initialization, which is why scikit-learn runs it n_init times (default 10) with different seeds and keeps the best result.
Complete Python Implementation
K-Means Clustering with Optimal K Selection
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, silhouette_samples, adjusted_rand_score
from sklearn.decomposition import PCA
np.random.seed(42)
# Generate synthetic data with known clusters
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)
print(f"Dataset shape: {X.shape}")
print(f"True number of clusters: {len(np.unique(y_true))}")
# Apply K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10, max_iter=300)
y_kmeans = kmeans.fit_predict(X)
print(f"\nK-Means Results:")
print(f" Converged: {kmeans.n_iter_} iterations")
print(f" Inertia: {kmeans.inertia_:.2f}")
# Find optimal K: Elbow Method + Silhouette
inertias = []
silhouette_scores = []
K_range = range(2, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(X, kmeans.labels_))
print(f"K={k:2d}: Inertia={kmeans.inertia_:8.2f}, Silhouette={silhouette_scores[-1]:.4f}")
best_k = list(K_range)[np.argmax(silhouette_scores)]
print(f"\nOptimal K (by Silhouette): {best_k}")
Finding Optimal K
Elbow Method
Plot inertia vs K and look for the "elbow" ā the point where adding more clusters provides diminishing returns.
Silhouette Score
Silhouette Coefficient
Here,
- =Mean intra-cluster distance for sample i
- =Mean nearest-cluster distance for sample i
- : Well-clustered
- : On cluster boundary
- : Likely in wrong cluster
Elbow plot + Silhouette plot:
K=2: Inertia high, Silhouette moderate
K=3: Inertia dropping, Silhouette high
K=4: Inertia elbow, Silhouette highest <- Optimal
K=5+: Inertia plateaus, Silhouette decreasing
K-Means Variants
Mini-Batch K-Means
For large datasets, Mini-Batch K-Means uses random subsets for faster training:
from sklearn.cluster import MiniBatchKMeans
import time
# Standard K-Means
start = time.time()
kmeans_standard = KMeans(n_clusters=4, random_state=42, n_init=10)
y_standard = kmeans_standard.fit_predict(X)
time_standard = time.time() - start
# Mini-Batch K-Means
start = time.time()
kmeans_mini = MiniBatchKMeans(n_clusters=4, random_state=42, batch_size=100, n_init=10)
y_mini = kmeans_mini.fit_predict(X)
time_mini = time.time() - start
print(f"Standard K-Means: Inertia={kmeans_standard.inertia_:.2f}, Time={time_standard:.4f}s")
print(f"Mini-Batch K-Means: Inertia={kmeans_mini.inertia_:.2f}, Time={time_mini:.4f}s")
Limitations and Alternatives
K-Means Limitations
K-Means Fails On:
1. NON-SPHERICAL CLUSTERS:
o o o K-Means assumes spherical clusters
o o o and equal variance.
o o o
Will incorrectly split these!
2. DIFFERENT SIZES:
ooooooo K-Means tends to find clusters
**** of similar size.
ooooooo
May split the larger cluster.
3. OUTLIERS:
o o o o Centroids are sensitive to outliers.
o o o o Outliers can pull centroids away
ā
from true cluster centers.
4. UNEQUAL DENSITIES:
ooooo K-Means assumes uniform density.
********** Will merge sparse clusters and
********** split dense ones.
Alternatives Comparison
from sklearn.cluster import DBSCAN, AgglomerativeClustering
from sklearn.datasets import make_moons, make_circles
from sklearn.metrics import silhouette_score, adjusted_rand_score
# Generate challenging datasets
X_moons, y_moons = make_moons(n_samples=500, noise=0.1, random_state=42)
X_circles, y_circles = make_circles(n_samples=500, noise=0.05, factor=0.5, random_state=42)
algorithms = {
'K-Means': lambda X: KMeans(n_clusters=3, random_state=42, n_init=10).fit_predict(X),
'DBSCAN': lambda X: DBSCAN(eps=0.3, min_samples=5).fit_predict(X),
'Hierarchical': lambda X: AgglomerativeClustering(n_clusters=3).fit_predict(X),
}
print("Clustering Algorithm Comparison:")
print("-" * 60)
print(f"{'Algorithm':<20} {'Moons ARI':<12} {'Circles ARI':<12}")
print("-" * 60)
for name, alg_func in algorithms.items():
labels_moons = alg_func(StandardScaler().fit_transform(X_moons))
labels_circles = alg_func(StandardScaler().fit_transform(X_circles))
ari_moons = adjusted_rand_score(y_moons, labels_moons)
ari_circles = adjusted_rand_score(y_circles, labels_circles)
print(f"{name:<20} {ari_moons:<12.3f} {ari_circles:<12.3f}")
Algorithm Comparison Summary
+-----------------------------------------------------------------+
| CLUSTERING ALGORITHMS COMPARISON |
+-----------------------------------------------------------------+
| |
| K-MEANS: |
| +-- Pros: Fast, simple, scales well |
| +-- Cons: Assumes spherical clusters, needs K |
| +-- Complexity: O(n Ć K Ć iterations) |
| +-- Best for: Spherical clusters of similar size |
| |
| DBSCAN: |
| +-- Pros: Finds arbitrary shapes, handles outliers |
| +-- Cons: Sensitive to eps, struggles with varying density |
| +-- Complexity: O(n log n) with spatial index |
| +-- Best for: Arbitrary shapes, noisy data |
| |
| HIERARCHICAL: |
| +-- Pros: No need for K, creates dendrogram |
| +-- Cons: Slow for large datasets, hard to cut |
| +-- Complexity: O(n³) or O(n² log n) |
| +-- Best for: Small datasets, when hierarchy matters |
| |
| GAUSSIAN MIXTURE: |
| +-- Pros: Soft assignments, handles elliptical clusters |
| +-- Cons: Assumes Gaussian distribution, needs K |
| +-- Complexity: O(n Ć K Ć iterations) |
| +-- Best for: Overlapping clusters, probabilistic assignment |
| |
+-----------------------------------------------------------------+
Practical Application
Customer Segmentation with K-Means
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
np.random.seed(42)
n_customers = 1000
annual_income = np.random.normal(50000, 20000, n_customers)
spending_score = np.random.normal(50, 25, n_customers)
age = np.random.normal(40, 15, n_customers)
loyalty_score = np.random.uniform(0, 100, n_customers)
segment_1 = annual_income < 40000
segment_2 = (annual_income >= 40000) & (annual_income < 70000)
segment_3 = annual_income >= 70000
spending_score[segment_1] *= 0.6
spending_score[segment_3] *= 1.4
X_customers = np.column_stack([annual_income, spending_score, age, loyalty_score])
scaler = StandardScaler()
X_customers_scaled = scaler.fit_transform(X_customers)
# Find optimal K
for k in range(2, 8):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_customers_scaled)
sil = silhouette_score(X_customers_scaled, kmeans.labels_)
print(f"K={k}: Silhouette={sil:.4f}, Inertia={kmeans.inertia_:.0f}")
# Apply with optimal K
best_k = 4
kmeans_final = KMeans(n_clusters=best_k, random_state=42, n_init=10)
customer_segments = kmeans_final.fit_predict(X_customers_scaled)
print(f"\nCustomer Segments (K={best_k}):")
for i in range(best_k):
mask = customer_segments == i
segment_data = X_customers[mask]
print(f"\nSegment {i+1} ({mask.sum()} customers):")
print(f" Avg Income: ${segment_data[:, 0].mean():,.0f}")
print(f" Avg Spending: {segment_data[:, 1].mean():.1f}")
print(f" Avg Age: {segment_data[:, 2].mean():.1f}")
Key Takeaways
Summary: K-Means Clustering
- K-Means is fast and simple but assumes spherical clusters of equal size
- Objective: Minimize within-cluster sum of squares:
- Elbow Method and Silhouette Score help find optimal K
- DBSCAN handles arbitrary shapes and outliers without requiring K
- Hierarchical clustering provides dendrograms for understanding structure
- Always scale features before clustering
- Mini-Batch K-Means is essential for large datasets ( vs )
- Visualization is crucial for validating clustering results
Practice Exercises
Exercise 1: Image Segmentation
# Use K-Means to segment an image into K colors:
# 1. Load an image (or use random pixels)
# 2. Reshape to (n_pixels, 3)
# 3. Apply K-Means with K=2, 4, 8, 16
# 4. Recreate image with cluster colors
# 5. Analyze compression ratio
Exercise 2: Anomaly Detection
# Use clustering for anomaly detection:
# 1. Generate normal data + anomalies
# 2. Apply K-Means and DBSCAN
# 3. Identify points far from centroids (K-Means)
# 4. Identify noise points (DBSCAN)
# 5. Compare detection rates
Exercise 3: Compare Clustering Algorithms
# Compare K-Means, DBSCAN, and Hierarchical:
# 1. Create 5 different synthetic datasets
# 2. Apply all three algorithms
# 3. Evaluate with ARI and Silhouette
# 4. Create visualization
# 5. Discuss which works best for each dataset
Exercise 4: Choosing K for Real Data
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
# 1. Apply K-Means with K=2,3,4,5
# 2. Use elbow method and silhouette analysis
# 3. Compare with true labels (ARI)
# 4. Visualize using PCA
Summary Table
| Algorithm | K Required | Shape Assumption | Complexity | Best For |
|---|---|---|---|---|
| K-Means | Yes | Spherical | O(nK) | Spherical clusters |
| DBSCAN | No | Arbitrary | O(n log n) | Noisy, irregular shapes |
| Hierarchical | No | Any | O(n²) | Small datasets |
| GMM | Yes | Elliptical | O(nK) | Soft assignments |