Clustering Algorithms

Machine LearningUnsupervised LearningFree Lesson

Advertisement

What Is Clustering Algorithms?

Clustering Algorithms is a key concept in Machine Learning. Understanding it is essential for building effective data science solutions.

Core Idea: Clustering Algorithms provides a systematic approach to unsupervised learning problems by learning patterns from data and applying them to make predictions or decisions.


Key Concepts

Clustering Fundamentals

K-Means objective:

minC1,...,Ckj=1kxCjxμj2\min_{C_1,...,C_k} \sum_{j=1}^k \sum_{x \in C_j} \|x - \mu_j\|^2

Algorithm:

1. Initialise k centroids (K-Means++ for smart init)
2. Assign each point to nearest centroid
3. Update centroids = mean of assigned points
4. Repeat 2-3 until convergence

Choosing k — Elbow method:

WCSS(k)=j=1kxCjxμj2\text{WCSS}(k) = \sum_{j=1}^k \sum_{x \in C_j} \|x - \mu_j\|^2

Plot WCSS vs k; choose k at the "elbow".

Silhouette score:

s(i)=b(i)a(i)max(a(i),b(i))[1,1]s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \in [-1, 1]

AlgorithmShape AssumptionNoise RobustScalable
K-MeansSpherical
DBSCANArbitraryModerate
HierarchicalAny
Gaussian MixtureEllipticalPartialModerate

Python Implementation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings("ignore")

# Load example dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Prepare data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Class distribution: {dict(pd.Series(y).value_counts())}")
print(f"Train / Test split: {len(X_train)} / {len(X_test)}")
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score

# Find optimal k
wcss = []
sil_scores = []
K_range = range(2, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_train_s)
    wcss.append(km.inertia_)
    sil_scores.append(silhouette_score(X_train_s, labels))

# Elbow plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(K_range, wcss, "bo-"); ax1.set_title("Elbow Method (WCSS)")
ax2.plot(K_range, sil_scores, "go-"); ax2.set_title("Silhouette Score")
plt.tight_layout(); plt.show()

best_k = K_range[sil_scores.index(max(sil_scores))]
model = KMeans(n_clusters=best_k, random_state=42, n_init=10)
print(f"Best k: {best_k}")

Evaluation & Results

# Evaluate model performance
y_pred = model.predict(X_test_s)

print(f"Accuracy  : {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred,
      target_names=data.target_names))

# Cross-validation for robust estimate
cv_scores = cross_val_score(model, X_train_s, y_train, cv=5, scoring="f1")
print(f"\n5-Fold CV F1: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# Visualise results
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm, display_labels=data.target_names).plot(ax=axes[0])
axes[0].set_title("Confusion Matrix")

# CV scores
axes[1].bar(range(1, 6), cv_scores, color="#3b82f6", edgecolor="white")
axes[1].axhline(cv_scores.mean(), color="red", linestyle="--",
                label=f"Mean={cv_scores.mean():.4f}")
axes[1].set_xlabel("Fold"); axes[1].set_ylabel("F1 Score")
axes[1].set_title("Cross-Validation Scores")
axes[1].legend(); axes[1].grid(True, alpha=0.3, axis="y")

plt.tight_layout()
plt.show()

Comparison with Related Methods

MethodStrengthsWeaknessesBest For
Clustering AlgorithmsEffective on structured dataMay need tuningClassification/Regression
Random ForestRobust, handles missing dataSlow inferenceTabular data
XGBoostHigh accuracy, fastMany hyperparametersCompetitions, production
Logistic Reg.Interpretable, fastLinear boundary onlyBinary baseline
SVMGood in high-dimSlow on large dataText, images

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

param_grid = {
    "C":       [0.01, 0.1, 1.0, 10.0],
    "gamma":   ["scale", "auto"],
    "kernel":  ["rbf", "linear"],
}

grid = GridSearchCV(model, param_grid, cv=5,
                    scoring="f1", n_jobs=-1, verbose=0)
grid.fit(X_train_s, y_train)

print(f"Best params : {grid.best_params_}")
print(f"Best CV F1  : {grid.best_score_:.4f}")
print(f"Test F1     : {grid.score(X_test_s, y_test):.4f}")

Key Takeaways

  1. Clustering Algorithms is a powerful method for unsupervised learning tasks
  2. Always scale features before applying distance-based or regularised methods
  3. Use cross-validation — never evaluate on the same data used for training
  4. Start simple — a strong baseline prevents over-engineering
  5. Visualise everything — confusion matrices, learning curves, feature importances

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement