K-Means Clustering

Module 2: Machine LearningFree Lesson

Advertisement

K-Means Clustering

Unsupervised Learning Concepts

Unlike supervised learning, unsupervised learning works with data that has no labels. The goal is to discover hidden patterns, structures, or groupings in the data.

Types of Unsupervised Learning

Architecture Diagram
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                   UNSUPERVISED LEARNING                         │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│                                                                 │
│  CLUSTERING: Group similar data points                          │
│  ā”œā”€ā”€ K-Means                                                    │
│  ā”œā”€ā”€ DBSCAN                                                     │
│  ā”œā”€ā”€ Hierarchical Clustering                                    │
│  └── Gaussian Mixture Models                                    │
│                                                                 │
│  DIMENSIONALITY REDUCTION:                                      │
│  ā”œā”€ā”€ PCA (Principal Component Analysis)                         │
│  ā”œā”€ā”€ t-SNE                                                      │
│  └── UMAP                                                       │
│                                                                 │
│  ASSOCIATION RULES:                                             │
│  ā”œā”€ā”€ Market Basket Analysis                                     │
│  └── Apriori Algorithm                                          │
│                                                                 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Why Clustering Matters

Applications:

  • Customer segmentation
  • Document topic grouping
  • Image segmentation
  • Anomaly detection
  • Gene expression analysis
  • Social network analysis

K-Means Algorithm

K-Means is one of the most popular and straightforward clustering algorithms.

Algorithm Steps

Architecture Diagram
K-Means Algorithm Visualization:

Step 1: Initialize K centroids randomly
        ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
        │  ā—‹   ā—‹   ā—‹   ā—‹              │
        │    ā—           ā—  (centroids)│
        │  ā—‹   ā—‹   ā—‹   ā—‹              │
        │        ā—                     │
        ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Step 2: Assign each point to nearest centroid
        ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
        │  ā—‹ ā—‹ ā—‹   ā—‹ ā—‹ ā—‹              │
        │   [C1]    [C2]               │
        │  ā—‹ ā—‹ ā—‹   ā—‹ ā—‹ ā—‹              │
        │          [C3]                │
        ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Step 3: Update centroids to mean of assigned points
        ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
        │  ā—‹ ā—‹ ā—‹   ā—‹ ā—‹ ā—‹              │
        │    āŠ™       āŠ™  (new centroids)│
        │  ā—‹ ā—‹ ā—‹   ā—‹ ā—‹ ā—‹              │
        │        āŠ™                     │
        ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Step 4: Repeat until convergence
        ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
        │  A A A   B B B              │
        │   āŠ™       āŠ™                 │
        │  A A A   B B B              │
        │        āŠ™                    │
        ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
        A = Cluster 1, B = Cluster 2

Mathematical Formulation

J=āˆ‘k=1Kāˆ‘xi∈Ck∄xiāˆ’Ī¼k∄2J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

DfK-Means Assignment Step

Ck={xi:∄xiāˆ’Ī¼kāˆ„ā‰¤āˆ„xiāˆ’Ī¼jāˆ„ā€…ā€Šāˆ€ā€‰j≠k}C_k = \{x_i : \|x_i - \mu_k\| \leq \|x_i - \mu_j\| \; \forall \, j \neq k\}

DfK-Means Update Step

μk=1∣Ckāˆ£āˆ‘xi∈Ckxi\mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i

ā„¹ļø K-Means Convergence

K-Means is guaranteed to converge because each iteration strictly decreases the objective JJ. However, it may converge to a local minimum. The algorithm is sensitive to initialization, which is why scikit-learn runs it n_init times (default 10) with different seeds and keeps the best result.

Complete Python Implementation

šŸ“K-Means Clustering with Optimal K Selection

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, silhouette_samples, adjusted_rand_score
from sklearn.decomposition import PCA

np.random.seed(42)

# Generate synthetic data with known clusters
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)

print(f"Dataset shape: {X.shape}")
print(f"True number of clusters: {len(np.unique(y_true))}")

# Apply K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10, max_iter=300)
y_kmeans = kmeans.fit_predict(X)

print(f"\nK-Means Results:")
print(f"  Converged: {kmeans.n_iter_} iterations")
print(f"  Inertia: {kmeans.inertia_:.2f}")

# Find optimal K: Elbow Method + Silhouette
inertias = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X, kmeans.labels_))
    print(f"K={k:2d}: Inertia={kmeans.inertia_:8.2f}, Silhouette={silhouette_scores[-1]:.4f}")

best_k = list(K_range)[np.argmax(silhouette_scores)]
print(f"\nOptimal K (by Silhouette): {best_k}")

Finding Optimal K

Elbow Method

Plot inertia vs K and look for the "elbow" — the point where adding more clusters provides diminishing returns.

Silhouette Score

Silhouette Coefficient

s(i)=b(i)āˆ’a(i)max⁔(a(i),b(i))s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

Here,

  • a(i)a(i)=Mean intra-cluster distance for sample i
  • b(i)b(i)=Mean nearest-cluster distance for sample i
  • s(i)approx1s(i) \\approx 1: Well-clustered
  • s(i)approx0s(i) \\approx 0: On cluster boundary
  • s(i)<0s(i) < 0: Likely in wrong cluster
Architecture Diagram
Elbow plot + Silhouette plot:
K=2: Inertia high, Silhouette moderate
K=3: Inertia dropping, Silhouette high
K=4: Inertia elbow, Silhouette highest ← Optimal
K=5+: Inertia plateaus, Silhouette decreasing

K-Means Variants

Mini-Batch K-Means

For large datasets, Mini-Batch K-Means uses random subsets for faster training:

from sklearn.cluster import MiniBatchKMeans
import time

# Standard K-Means
start = time.time()
kmeans_standard = KMeans(n_clusters=4, random_state=42, n_init=10)
y_standard = kmeans_standard.fit_predict(X)
time_standard = time.time() - start

# Mini-Batch K-Means
start = time.time()
kmeans_mini = MiniBatchKMeans(n_clusters=4, random_state=42, batch_size=100, n_init=10)
y_mini = kmeans_mini.fit_predict(X)
time_mini = time.time() - start

print(f"Standard K-Means: Inertia={kmeans_standard.inertia_:.2f}, Time={time_standard:.4f}s")
print(f"Mini-Batch K-Means: Inertia={kmeans_mini.inertia_:.2f}, Time={time_mini:.4f}s")

Limitations and Alternatives

K-Means Limitations

Architecture Diagram
K-Means Fails On:

1. NON-SPHERICAL CLUSTERS:
   ā—‹ ā—‹ ā—‹             K-Means assumes spherical clusters
      ā—‹ ā—‹ ā—‹           and equal variance.
   ā—‹ ā—‹ ā—‹             
                     Will incorrectly split these!

2. DIFFERENT SIZES:
   ā—‹ā—‹ā—‹ā—‹ā—‹ā—‹ā—‹           K-Means tends to find clusters
   ā—ā—ā—ā—              of similar size.
   ā—‹ā—‹ā—‹ā—‹ā—‹ā—‹ā—‹           
                     May split the larger cluster.

3. OUTLIERS:
   ā—‹ ā—‹ ā—‹ ā—‹           Centroids are sensitive to outliers.
   ā—‹ ā—‹ ā—‹ ā—‹           Outliers can pull centroids away
         ā˜…           from true cluster centers.

4. UNEQUAL DENSITIES:
   ā—‹ā—‹ā—‹ā—‹ā—‹             K-Means assumes uniform density.
   ā—ā—ā—ā—ā—ā—ā—ā—ā—ā—        Will merge sparse clusters and
   ā—ā—ā—ā—ā—ā—ā—ā—ā—ā—        split dense ones.

Alternatives Comparison

from sklearn.cluster import DBSCAN, AgglomerativeClustering
from sklearn.datasets import make_moons, make_circles
from sklearn.metrics import silhouette_score, adjusted_rand_score

# Generate challenging datasets
X_moons, y_moons = make_moons(n_samples=500, noise=0.1, random_state=42)
X_circles, y_circles = make_circles(n_samples=500, noise=0.05, factor=0.5, random_state=42)

algorithms = {
    'K-Means': lambda X: KMeans(n_clusters=3, random_state=42, n_init=10).fit_predict(X),
    'DBSCAN': lambda X: DBSCAN(eps=0.3, min_samples=5).fit_predict(X),
    'Hierarchical': lambda X: AgglomerativeClustering(n_clusters=3).fit_predict(X),
}

print("Clustering Algorithm Comparison:")
print("-" * 60)
print(f"{'Algorithm':<20} {'Moons ARI':<12} {'Circles ARI':<12}")
print("-" * 60)

for name, alg_func in algorithms.items():
    labels_moons = alg_func(StandardScaler().fit_transform(X_moons))
    labels_circles = alg_func(StandardScaler().fit_transform(X_circles))
    
    ari_moons = adjusted_rand_score(y_moons, labels_moons)
    ari_circles = adjusted_rand_score(y_circles, labels_circles)
    
    print(f"{name:<20} {ari_moons:<12.3f} {ari_circles:<12.3f}")

Algorithm Comparison Summary

Architecture Diagram
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                 CLUSTERING ALGORITHMS COMPARISON                 │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│                                                                 │
│  K-MEANS:                                                       │
│  ā”œā”€ā”€ Pros: Fast, simple, scales well                           │
│  ā”œā”€ā”€ Cons: Assumes spherical clusters, needs K                  │
│  ā”œā”€ā”€ Complexity: O(n Ɨ K Ɨ iterations)                         │
│  └── Best for: Spherical clusters of similar size               │
│                                                                 │
│  DBSCAN:                                                        │
│  ā”œā”€ā”€ Pros: Finds arbitrary shapes, handles outliers            │
│  ā”œā”€ā”€ Cons: Sensitive to eps, struggles with varying density    │
│  ā”œā”€ā”€ Complexity: O(n log n) with spatial index                 │
│  └── Best for: Arbitrary shapes, noisy data                    │
│                                                                 │
│  HIERARCHICAL:                                                  │
│  ā”œā”€ā”€ Pros: No need for K, creates dendrogram                   │
│  ā”œā”€ā”€ Cons: Slow for large datasets, hard to cut                │
│  ā”œā”€ā”€ Complexity: O(n³) or O(n² log n)                          │
│  └── Best for: Small datasets, when hierarchy matters          │
│                                                                 │
│  GAUSSIAN MIXTURE:                                              │
│  ā”œā”€ā”€ Pros: Soft assignments, handles elliptical clusters       │
│  ā”œā”€ā”€ Cons: Assumes Gaussian distribution, needs K              │
│  ā”œā”€ā”€ Complexity: O(n Ɨ K Ɨ iterations)                         │
│  └── Best for: Overlapping clusters, probabilistic assignment  │
│                                                                 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Practical Application

šŸ“Customer Segmentation with K-Means

import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

np.random.seed(42)
n_customers = 1000

annual_income = np.random.normal(50000, 20000, n_customers)
spending_score = np.random.normal(50, 25, n_customers)
age = np.random.normal(40, 15, n_customers)
loyalty_score = np.random.uniform(0, 100, n_customers)

segment_1 = annual_income < 40000
segment_2 = (annual_income >= 40000) & (annual_income < 70000)
segment_3 = annual_income >= 70000

spending_score[segment_1] *= 0.6
spending_score[segment_3] *= 1.4

X_customers = np.column_stack([annual_income, spending_score, age, loyalty_score])

scaler = StandardScaler()
X_customers_scaled = scaler.fit_transform(X_customers)

# Find optimal K
for k in range(2, 8):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_customers_scaled)
    sil = silhouette_score(X_customers_scaled, kmeans.labels_)
    print(f"K={k}: Silhouette={sil:.4f}, Inertia={kmeans.inertia_:.0f}")

# Apply with optimal K
best_k = 4
kmeans_final = KMeans(n_clusters=best_k, random_state=42, n_init=10)
customer_segments = kmeans_final.fit_predict(X_customers_scaled)

print(f"\nCustomer Segments (K={best_k}):")
for i in range(best_k):
    mask = customer_segments == i
    segment_data = X_customers[mask]
    print(f"\nSegment {i+1} ({mask.sum()} customers):")
    print(f"  Avg Income:   ${segment_data[:, 0].mean():,.0f}")
    print(f"  Avg Spending: {segment_data[:, 1].mean():.1f}")
    print(f"  Avg Age:      {segment_data[:, 2].mean():.1f}")

Key Takeaways

šŸ“‹Summary: K-Means Clustering

  1. K-Means is fast and simple but assumes spherical clusters of equal size
  2. Objective: Minimize within-cluster sum of squares: J=sumk=1KsumxiinCk∣xiāˆ’muk∣2J = \\sum_{k=1}^{K} \\sum_{x_i \\in C_k} \\|x_i - \\mu_k\\|^2
  3. Elbow Method and Silhouette Score help find optimal K
  4. DBSCAN handles arbitrary shapes and outliers without requiring K
  5. Hierarchical clustering provides dendrograms for understanding structure
  6. Always scale features before clustering
  7. Mini-Batch K-Means is essential for large datasets (O(nK)O(nK) vs O(nKcdottextiter)O(nK \\cdot \\text{iter}))
  8. Visualization is crucial for validating clustering results

Practice Exercises

Exercise 1: Image Segmentation

# Use K-Means to segment an image into K colors:
# 1. Load an image (or use random pixels)
# 2. Reshape to (n_pixels, 3)
# 3. Apply K-Means with K=2, 4, 8, 16
# 4. Recreate image with cluster colors
# 5. Analyze compression ratio

Exercise 2: Anomaly Detection

# Use clustering for anomaly detection:
# 1. Generate normal data + anomalies
# 2. Apply K-Means and DBSCAN
# 3. Identify points far from centroids (K-Means)
# 4. Identify noise points (DBSCAN)
# 5. Compare detection rates

Exercise 3: Compare Clustering Algorithms

# Compare K-Means, DBSCAN, and Hierarchical:
# 1. Create 5 different synthetic datasets
# 2. Apply all three algorithms
# 3. Evaluate with ARI and Silhouette
# 4. Create visualization
# 5. Discuss which works best for each dataset

Exercise 4: Choosing K for Real Data

from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
# 1. Apply K-Means with K=2,3,4,5
# 2. Use elbow method and silhouette analysis
# 3. Compare with true labels (ARI)
# 4. Visualize using PCA

Summary Table

AlgorithmK RequiredShape AssumptionComplexityBest For
K-MeansYesSphericalO(nK)Spherical clusters
DBSCANNoArbitraryO(n log n)Noisy, irregular shapes
HierarchicalNoAnyO(n²)Small datasets
GMMYesEllipticalO(nK)Soft assignments

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement