Imbalanced Data: SMOTE, Class Weights

Module 2: Machine LearningFree Lesson

Advertisement

Imbalanced Data: SMOTE, Class Weights

Why Class Imbalance Matters

In many real-world problems, the distribution of classes is heavily skewed. Fraud detection, disease diagnosis, and defect identification all share one trait: the event of interest is rare.

DfClass Imbalance

A situation in a classification problem where the classes are not represented equally in the dataset — one class (the majority) significantly outnumbers the other(s) (the minority). Class imbalance can cause models to be biased toward the majority class, failing to learn the patterns of the minority class.

Architecture Diagram
Dataset Distribution Example:
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│  Class 0 (Normal):     ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ  98.5%  │
│  Class 1 (Fraud):      ā–ˆ                         1.5%  │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

A naive classifier that always predicts the majority class achieves 98.5% accuracy — yet catches zero fraud cases. This is the accuracy paradox.

The accuracy paradox is a direct consequence of using accuracy as the sole metric for imbalanced problems. When the minority class is rare, a trivial classifier (always predicting the majority) can achieve high accuracy while being completely useless. This is why precision, recall, F1, and AUPRC are essential metrics for imbalanced datasets.

The Cost of Imbalance

Architecture Diagram
Confusion Matrix for Imbalanced Data:
                    Predicted
                  Neg       Pos
Actual  Neg  [  TN=9850  FP=100 ]  ← Majority class
        Pos  [   FN=45    TP=5  ]  ← Minority class (rarely detected)

Accuracy = (9850 + 5) / 10000 = 98.55%  ← Looks great!
Recall   = 5 / (5 + 45) = 10%           ← Catches only 10% of fraud!

ThAccuracy Paradox

For a dataset with class prevalence pp for the minority class, a trivial classifier that always predicts the majority class achieves accuracy 1āˆ’p1 - p, regardless of the classifier's ability to distinguish classes. Therefore, accuracy alone is an unreliable metric when p≪0.5p \ll 0.5.

Resampling Techniques

1. Random Oversampling

Duplicate minority class examples to balance the dataset.

from imblearn.over_sampling import RandomOverSampler
import pandas as pd
import numpy as np

np.random.seed(42)
X = np.vstack([
    np.random.randn(950, 5),
    np.random.randn(50, 5) + 3
])
y = np.array([0]*950 + [1]*50)

print(f"Before oversampling: {np.bincount(y)}")

ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)

print(f"After oversampling:  {np.bincount(y_res)}")

Pros: Simple, increases minority class size. Cons: Can cause overfitting by duplicating exact samples.

2. Random Undersampling

Remove majority class examples to balance the dataset.

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)

print(f"After undersampling:  {np.bincount(y_res)}")

Pros: Reduces training time, removes redundant samples. Cons: Discards potentially useful information.

3. SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE generates synthetic examples rather than duplicating existing ones. It interpolates between minority class neighbors.

DfSMOTE

Synthetic Minority Over-sampling Technique creates synthetic samples by interpolating between existing minority class instances and their k-nearest neighbors. Each synthetic sample lies on the line segment connecting a minority sample to one of its neighbors, ensuring the synthetic data follows the minority class distribution.

xnew=xi+λ⋅(x^āˆ’xi),λ∼U(0,1)x_{\text{new}} = x_i + \lambda \cdot (\hat{x} - x_i), \quad \lambda \sim U(0, 1)
Architecture Diagram
SMOTE Algorithm:
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│  1. Pick a minority class sample x_i                    │
│  2. Find k nearest minority neighbors                   │
│  3. Randomly select a neighbor x_hat                    │
│  4. Generate synthetic sample:                          │
│     x_new = x_i + lambda * (x_hat - x_i)               │
│     where lambda ~ U(0,1)                               │
│                                                          │
│  Visual:                                                 │
│     x_i *------------------* x_hat                      │
│          |    x_new (star)  |                            │
│          |  lambda = 0.4    |                            │
│          +------------------+                            │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

The choice of k_neighbors in SMOTE affects the smoothness of the generated samples. Small k (e.g., 3) produces samples close to the original minority instances, while large k (e.g., 10) produces more diverse synthetic samples. A good default is k=5. Be careful when the minority class has fewer than k samples — SMOTE will fail.

from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from collections import Counter

X, y = make_classification(
    n_samples=1000, n_features=10, n_informative=5,
    weights=[0.95, 0.05], random_state=42
)

print(f"Original distribution: {Counter(y)}")

smote = SMOTE(sampling_strategy='auto', random_state=42, k_neighbors=5)
X_res, y_res = smote.fit_resample(X, y)

print(f"After SMOTE:           {Counter(y_res)}")

Variants of SMOTE

from imblearn.over_sampling import (
    BorderlineSMOTE,
    SVMSMOTE,
    ADASYN
)

# BorderlineSMOTE: focuses on samples near the decision boundary
bsmote = BorderlineSMOTE(random_state=42)
X_bs, y_bs = bsmote.fit_resample(X, y)

# SVMSMOTE: uses SVM to find decision boundary
svmsmote = SVMSMOTE(random_state=42)
X_svm, y_svm = svmsmote.fit_resample(X, y)

# ADASYN: adaptive, generates more samples for harder-to-learn instances
adasyn = ADASYN(random_state=42)
X_ada, y_ada = adasyn.fit_resample(X, y)

DfADASYN

Adaptive Synthetic Sampling generates more synthetic samples for minority class instances that are harder to learn (those with more majority class neighbors). This focuses the synthetic generation on the decision boundary where misclassifications occur.

4. Combination: SMOTE + Undersampling

from imblearn.combine import SMOTEENN, SMOTETomek

# SMOTEENN: SMOTE + Edited Nearest Neighbours cleanup
smote_enn = SMOTEENN(random_state=42)
X_se, y_se = smote_enn.fit_resample(X, y)

# SMOTETomek: SMOTE + Tomek links cleanup
smote_tomek = SMOTETomek(random_state=42)
X_st, y_st = smote_tomek.fit_resample(X, y)

Class Weights

Instead of resampling, adjust the loss function to penalize misclassification of the minority class more heavily.

Mathematical Formulation

Weighted Cross-Entropy Loss

L=āˆ’1Nāˆ‘i=1N[w0ā‹…yilog⁔(y^i)+w1ā‹…(1āˆ’yi)log⁔(1āˆ’y^i)]L = -\frac{1}{N}\sum_{i=1}^{N} \left[ w_0 \cdot y_i \log(\hat{y}_i) + w_1 \cdot (1 - y_i) \log(1 - \hat{y}_i) \right]

Here,

  • =Weight for the majority class
  • =Weight for the minority class
  • =True label (0 or 1)
  • =Predicted probability

Where weights are typically:

Inverse Frequency Weights

wc=NCā‹…ncw_c = \frac{N}{C \cdot n_c}

Here,

  • =Total samples
  • =Number of classes
  • =Samples in class c
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=2000, n_features=20, n_informative=10,
    weights=[0.9, 0.1], random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Without class weights
lr_default = LogisticRegression(random_state=42)
lr_default.fit(X_train, y_train)

# With class weights
lr_weighted = LogisticRegression(class_weight='balanced', random_state=42)
lr_weighted.fit(X_train, y_train)

print("Without class weights:")
print(classification_report(y_test, lr_default.predict(X_test)))

print("With class weights:")
print(classification_report(y_test, lr_weighted.predict(X_test)))

Manual Weight Assignment

# Custom weights: heavily penalize missing the minority class
custom_weights = {0: 1, 1: 10}

lr_custom = LogisticRegression(class_weight=custom_weights, random_state=42)
lr_custom.fit(X_train, y_train)

# Or use compute_class_weight
from sklearn.utils.class_weight import compute_class_weight

classes = np.unique(y_train)
weights = compute_class_weight('balanced', classes=classes, y=y_train)
weight_dict = dict(zip(classes, weights))
print(f"Computed weights: {weight_dict}")

Class weights and resampling are mathematically related: assigning weight nmajority/nminorityn_{\text{majority}} / n_{\text{minority}} to the minority class is equivalent to oversampling the minority class by that factor. However, class weights operate in the loss function (no data modification), while resampling creates new data points. Class weights are generally preferred for speed, while SMOTE can capture more complex minority class structure.

Evaluation Metrics for Imbalanced Data

Confusion Matrix Components

Architecture Diagram
                    Predicted
                  Neg        Pos
Actual  Neg  [    TN    |    FP   ]
        Pos  [    FN    |    TP   ]

Precision = TP / (TP + FP)   → Of predicted positives, how many correct?
Recall    = TP / (TP + FN)   → Of actual positives, how many detected?
F1-Score  = 2 Ɨ (Precision Ɨ Recall) / (Precision + Recall)

šŸ“Comprehensive Imbalanced Data Evaluation

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    average_precision_score,
    precision_recall_curve,
    roc_curve
)
import matplotlib.pyplot as plt

# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:, 1]

# Standard metrics
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Normal', 'Minority']))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{cm}")

# ROC-AUC (can be misleading for imbalanced data)
roc_auc = roc_auc_score(y_test, y_prob)
print(f"ROC-AUC: {roc_auc:.4f}")

# Average Precision (better metric for imbalanced data)
ap = average_precision_score(y_test, y_prob)
print(f"Average Precision: {ap:.4f}")

Precision-Recall vs ROC Curve

Architecture Diagram
ROC Curve:                      Precision-Recall Curve:
  TPR ↑                           Precision ↑
  1.0 ┤  ╭────────               1.0 ┤
      │ ╭╯                           │ ╭──────╮
      │╭╯                            ││      │
  0.5─┤╯                         0.5─┤│      ╰────
      │                             │╰╯
  0.0 ┼──────────→ FPR          0.0 ┼──────────→ Recall
      0.0  0.5  1.0                 0.0  0.5  1.0

  ⚠ Can be overly optimistic    āœ“ More informative for imbalance

For imbalanced data, ROC curves can be misleadingly optimistic. The FPR is computed using all true negatives, and when the majority class is very large, even many false positives result in a small FPR. PR curves avoid this issue by focusing only on the positive class and the false positives relative to true positives.

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
axes[0].plot(fpr, tpr, label=f'ROC-AUC = {roc_auc:.3f}')
axes[0].plot([0, 1], [0, 1], 'k--', alpha=0.5)
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve')
axes[0].legend()

# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_prob)
axes[1].plot(recall, precision, label=f'AP = {ap:.3f}')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
axes[1].legend()

plt.tight_layout()
plt.savefig('evaluation_curves.png', dpi=150)
plt.show()

Complete Pipeline: Resampling + Evaluation

šŸ“SMOTE Pipeline with Cross-Validation

from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Build pipeline with SMOTE + classifier
pipeline = ImbPipeline([
    ('scaler', StandardScaler()),
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(
        n_estimators=100, random_state=42
    ))
])

# Cross-validate with StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
    pipeline, X, y, cv=cv,
    scoring='average_precision', n_jobs=-1
)

print(f"Average Precision: {scores.mean():.4f} ± {scores.std():.4f}")

# Compare multiple strategies
strategies = {
    'Baseline': LogisticRegression(random_state=42),
    'Class Weight': LogisticRegression(class_weight='balanced', random_state=42),
    'SMOTE': ImbPipeline([
        ('smote', SMOTE(random_state=42)),
        ('clf', LogisticRegression(random_state=42))
    ]),
}

print("\nStrategy Comparison:")
for name, model in strategies.items():
    if name in ['Baseline', 'Class Weight']:
        pipe = ImbPipeline([
            ('scaler', StandardScaler()),
            ('clf', model)
        ])
    else:
        pipe = ImbPipeline([
            ('scaler', StandardScaler()),
            ('smote', SMOTE(random_state=42)),
            ('clf', LogisticRegression(random_state=42))
        ])

    scores = cross_val_score(
        pipe, X, y, cv=cv,
        scoring='average_precision'
    )
    print(f"{name:>15}: {scores.mean():.4f} ± {scores.std():.4f}")

Always apply SMOTE (or any resampling) inside the cross-validation loop, not before. If you resample before splitting, synthetic samples may leak into the validation set, giving overly optimistic estimates. The imblearn.pipeline.Pipeline ensures resampling happens only on the training folds.

Decision Framework

Architecture Diagram
Which technique to choose?
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                                                                 │
│  Is the dataset very large (>100k)?                             │
│  ā”œā”€ Yes → Undersampling or class weights (SMOTE is slow)       │
│  └─ No                                                          │
│      │                                                          │
│      Is the minority class <50 samples?                         │
│      ā”œā”€ Yes → SMOTE with caution (k_neighbors must be < n)     │
│      └─ No → SMOTE or SMOTE variants                           │
│                                                                 │
│  Priority: Speed → Class weights                                │
│  Priority: Performance → SMOTE + ensemble                       │
│  Priority: Simplicity → Class weights                           │
│                                                                 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Key Takeaways

  1. Accuracy is misleading for imbalanced datasets — use precision, recall, F1, and AUPRC instead
  2. SMOTE creates synthetic minority samples by interpolation — avoids overfitting from pure oversampling
  3. Class weights adjust the loss function without changing data — simpler but may underperform
  4. Combine techniques: SMOTE + undersampling hybrids often yield best results
  5. Always evaluate on a stratified holdout set — resampling should only be applied to training data

šŸ“‹Summary: Imbalanced Data — SMOTE, Class Weights

  1. Class imbalance causes accuracy paradox: a trivial majority-class classifier can achieve high accuracy while being useless for the minority class.
  2. Accuracy is unreliable for imbalanced data; use precision, recall, F1, and Average Precision (AUPRC) instead.
  3. Random oversampling duplicates minority samples (risk of overfitting); random undersampling discards majority samples (risk of information loss).
  4. SMOTE generates synthetic minority samples via interpolation between k-nearest neighbors: xnew=xi+Ī»(x^āˆ’xi)x_{\text{new}} = x_i + \lambda (\hat{x} - x_i).
  5. Class weights penalize minority misclassifications more heavily in the loss function: wc=N/(Cā‹…nc)w_c = N / (C \cdot n_c).
  6. ADASYN adaptively generates more synthetic samples for harder-to-learn minority instances near the decision boundary.
  7. Always resample inside cross-validation to prevent data leakage — use imblearn.pipeline.Pipeline.
  8. PR curves are preferred over ROC curves for imbalanced evaluation because ROC AUC can be misleadingly optimistic.
  9. Choose class weights for speed and simplicity; choose SMOTE + ensemble for maximum performance on difficult problems.
  10. Combine strategies (e.g., SMOTE + Tomek links, class weights + undersampling) for robust handling of real-world imbalance.

Practice Exercises

  1. Credit Card Fraud: Train a Random Forest on an imbalanced fraud dataset. Compare performance with and without SMOTE. Which metric matters most?
  2. Medical Diagnosis: Build a pipeline that uses BorderlineSMOTE with XGBoost. Tune the k_neighbors parameter.
  3. Cost-Sensitive Learning: Implement custom loss weights where missing a positive case costs 10x more than a false alarm. Compare with SMOTE.
  4. Evaluation: Plot both ROC and Precision-Recall curves. Under what conditions does ROC give a misleading picture?

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement