Clustering Algorithms

What Is Clustering Algorithms?

Clustering Algorithms is a key concept in Machine Learning. Understanding it is essential for building effective data science solutions.

Core Idea: Clustering Algorithms provides a systematic approach to unsupervised learning problems by learning patterns from data and applying them to make predictions or decisions.

Key Concepts

Clustering Fundamentals

K-Means objective:

$\min_{C_1,...,C_k} \sum_{j=1}^k \sum_{x \in C_j} \|x - \mu_j\|^2$

Algorithm:

1. Initialise k centroids (K-Means++ for smart init)
2. Assign each point to nearest centroid
3. Update centroids = mean of assigned points
4. Repeat 2-3 until convergence

Choosing k — Elbow method:

$\text{WCSS}(k) = \sum_{j=1}^k \sum_{x \in C_j} \|x - \mu_j\|^2$

Plot WCSS vs k; choose k at the "elbow".

Silhouette score:

$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \in [-1, 1]$

Algorithm	Shape Assumption	Noise Robust	Scalable
K-Means	Spherical	❌	✅
DBSCAN	Arbitrary	✅	Moderate
Hierarchical	Any	❌	❌
Gaussian Mixture	Elliptical	Partial	Moderate

Python Implementation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings("ignore")

# Load example dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Prepare data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Class distribution: {dict(pd.Series(y).value_counts())}")
print(f"Train / Test split: {len(X_train)} / {len(X_test)}")

from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score

# Find optimal k
wcss = []
sil_scores = []
K_range = range(2, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_train_s)
    wcss.append(km.inertia_)
    sil_scores.append(silhouette_score(X_train_s, labels))

# Elbow plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(K_range, wcss, "bo-"); ax1.set_title("Elbow Method (WCSS)")
ax2.plot(K_range, sil_scores, "go-"); ax2.set_title("Silhouette Score")
plt.tight_layout(); plt.show()

best_k = K_range[sil_scores.index(max(sil_scores))]
model = KMeans(n_clusters=best_k, random_state=42, n_init=10)
print(f"Best k: {best_k}")

Evaluation & Results

# Evaluate model performance
y_pred = model.predict(X_test_s)

print(f"Accuracy  : {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred,
      target_names=data.target_names))

# Cross-validation for robust estimate
cv_scores = cross_val_score(model, X_train_s, y_train, cv=5, scoring="f1")
print(f"\n5-Fold CV F1: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# Visualise results
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm, display_labels=data.target_names).plot(ax=axes[0])
axes[0].set_title("Confusion Matrix")

# CV scores
axes[1].bar(range(1, 6), cv_scores, color="#3b82f6", edgecolor="white")
axes[1].axhline(cv_scores.mean(), color="red", linestyle="--",
                label=f"Mean={cv_scores.mean():.4f}")
axes[1].set_xlabel("Fold"); axes[1].set_ylabel("F1 Score")
axes[1].set_title("Cross-Validation Scores")
axes[1].legend(); axes[1].grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.show()

Comparison with Related Methods

Method	Strengths	Weaknesses	Best For
Clustering Algorithms	Effective on structured data	May need tuning	Classification/Regression
Random Forest	Robust, handles missing data	Slow inference	Tabular data
XGBoost	High accuracy, fast	Many hyperparameters	Competitions, production
Logistic Reg.	Interpretable, fast	Linear boundary only	Binary baseline
SVM	Good in high-dim	Slow on large data	Text, images

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

param_grid = {
    "C":       [0.01, 0.1, 1.0, 10.0],
    "gamma":   ["scale", "auto"],
    "kernel":  ["rbf", "linear"],
}

grid = GridSearchCV(model, param_grid, cv=5,
                    scoring="f1", n_jobs=-1, verbose=0)
grid.fit(X_train_s, y_train)

print(f"Best params : {grid.best_params_}")
print(f"Best CV F1  : {grid.best_score_:.4f}")
print(f"Test F1     : {grid.score(X_test_s, y_test):.4f}")

Key Takeaways

Clustering Algorithms is a powerful method for unsupervised learning tasks
Always scale features before applying distance-based or regularised methods
Use cross-validation — never evaluate on the same data used for training
Start simple — a strong baseline prevents over-engineering
Visualise everything — confusion matrices, learning curves, feature importances