Imbalanced Data: SMOTE, Class Weights
Why Class Imbalance Matters
In many real-world problems, the distribution of classes is heavily skewed. Fraud detection, disease diagnosis, and defect identification all share one trait: the event of interest is rare.
DfClass Imbalance
A situation in a classification problem where the classes are not represented equally in the dataset ā one class (the majority) significantly outnumbers the other(s) (the minority). Class imbalance can cause models to be biased toward the majority class, failing to learn the patterns of the minority class.
Dataset Distribution Example:
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Class 0 (Normal): āāāāāāāāāāāāāāāāāāāāāāāā 98.5% ā
ā Class 1 (Fraud): ā 1.5% ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
A naive classifier that always predicts the majority class achieves 98.5% accuracy ā yet catches zero fraud cases. This is the accuracy paradox.
The accuracy paradox is a direct consequence of using accuracy as the sole metric for imbalanced problems. When the minority class is rare, a trivial classifier (always predicting the majority) can achieve high accuracy while being completely useless. This is why precision, recall, F1, and AUPRC are essential metrics for imbalanced datasets.
The Cost of Imbalance
Confusion Matrix for Imbalanced Data:
Predicted
Neg Pos
Actual Neg [ TN=9850 FP=100 ] ā Majority class
Pos [ FN=45 TP=5 ] ā Minority class (rarely detected)
Accuracy = (9850 + 5) / 10000 = 98.55% ā Looks great!
Recall = 5 / (5 + 45) = 10% ā Catches only 10% of fraud!
ThAccuracy Paradox
For a dataset with class prevalence for the minority class, a trivial classifier that always predicts the majority class achieves accuracy , regardless of the classifier's ability to distinguish classes. Therefore, accuracy alone is an unreliable metric when .
Resampling Techniques
1. Random Oversampling
Duplicate minority class examples to balance the dataset.
from imblearn.over_sampling import RandomOverSampler
import pandas as pd
import numpy as np
np.random.seed(42)
X = np.vstack([
np.random.randn(950, 5),
np.random.randn(50, 5) + 3
])
y = np.array([0]*950 + [1]*50)
print(f"Before oversampling: {np.bincount(y)}")
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)
print(f"After oversampling: {np.bincount(y_res)}")
Pros: Simple, increases minority class size. Cons: Can cause overfitting by duplicating exact samples.
2. Random Undersampling
Remove majority class examples to balance the dataset.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)
print(f"After undersampling: {np.bincount(y_res)}")
Pros: Reduces training time, removes redundant samples. Cons: Discards potentially useful information.
3. SMOTE (Synthetic Minority Over-sampling Technique)
SMOTE generates synthetic examples rather than duplicating existing ones. It interpolates between minority class neighbors.
DfSMOTE
Synthetic Minority Over-sampling Technique creates synthetic samples by interpolating between existing minority class instances and their k-nearest neighbors. Each synthetic sample lies on the line segment connecting a minority sample to one of its neighbors, ensuring the synthetic data follows the minority class distribution.
SMOTE Algorithm:
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā 1. Pick a minority class sample x_i ā
ā 2. Find k nearest minority neighbors ā
ā 3. Randomly select a neighbor x_hat ā
ā 4. Generate synthetic sample: ā
ā x_new = x_i + lambda * (x_hat - x_i) ā
ā where lambda ~ U(0,1) ā
ā ā
ā Visual: ā
ā x_i *------------------* x_hat ā
ā | x_new (star) | ā
ā | lambda = 0.4 | ā
ā +------------------+ ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
The choice of k_neighbors in SMOTE affects the smoothness of the generated samples. Small k (e.g., 3) produces samples close to the original minority instances, while large k (e.g., 10) produces more diverse synthetic samples. A good default is k=5. Be careful when the minority class has fewer than k samples ā SMOTE will fail.
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from collections import Counter
X, y = make_classification(
n_samples=1000, n_features=10, n_informative=5,
weights=[0.95, 0.05], random_state=42
)
print(f"Original distribution: {Counter(y)}")
smote = SMOTE(sampling_strategy='auto', random_state=42, k_neighbors=5)
X_res, y_res = smote.fit_resample(X, y)
print(f"After SMOTE: {Counter(y_res)}")
Variants of SMOTE
from imblearn.over_sampling import (
BorderlineSMOTE,
SVMSMOTE,
ADASYN
)
# BorderlineSMOTE: focuses on samples near the decision boundary
bsmote = BorderlineSMOTE(random_state=42)
X_bs, y_bs = bsmote.fit_resample(X, y)
# SVMSMOTE: uses SVM to find decision boundary
svmsmote = SVMSMOTE(random_state=42)
X_svm, y_svm = svmsmote.fit_resample(X, y)
# ADASYN: adaptive, generates more samples for harder-to-learn instances
adasyn = ADASYN(random_state=42)
X_ada, y_ada = adasyn.fit_resample(X, y)
DfADASYN
Adaptive Synthetic Sampling generates more synthetic samples for minority class instances that are harder to learn (those with more majority class neighbors). This focuses the synthetic generation on the decision boundary where misclassifications occur.
4. Combination: SMOTE + Undersampling
from imblearn.combine import SMOTEENN, SMOTETomek
# SMOTEENN: SMOTE + Edited Nearest Neighbours cleanup
smote_enn = SMOTEENN(random_state=42)
X_se, y_se = smote_enn.fit_resample(X, y)
# SMOTETomek: SMOTE + Tomek links cleanup
smote_tomek = SMOTETomek(random_state=42)
X_st, y_st = smote_tomek.fit_resample(X, y)
Class Weights
Instead of resampling, adjust the loss function to penalize misclassification of the minority class more heavily.
Mathematical Formulation
Weighted Cross-Entropy Loss
Here,
- =Weight for the majority class
- =Weight for the minority class
- =True label (0 or 1)
- =Predicted probability
Where weights are typically:
Inverse Frequency Weights
Here,
- =Total samples
- =Number of classes
- =Samples in class c
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=2000, n_features=20, n_informative=10,
weights=[0.9, 0.1], random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# Without class weights
lr_default = LogisticRegression(random_state=42)
lr_default.fit(X_train, y_train)
# With class weights
lr_weighted = LogisticRegression(class_weight='balanced', random_state=42)
lr_weighted.fit(X_train, y_train)
print("Without class weights:")
print(classification_report(y_test, lr_default.predict(X_test)))
print("With class weights:")
print(classification_report(y_test, lr_weighted.predict(X_test)))
Manual Weight Assignment
# Custom weights: heavily penalize missing the minority class
custom_weights = {0: 1, 1: 10}
lr_custom = LogisticRegression(class_weight=custom_weights, random_state=42)
lr_custom.fit(X_train, y_train)
# Or use compute_class_weight
from sklearn.utils.class_weight import compute_class_weight
classes = np.unique(y_train)
weights = compute_class_weight('balanced', classes=classes, y=y_train)
weight_dict = dict(zip(classes, weights))
print(f"Computed weights: {weight_dict}")
Class weights and resampling are mathematically related: assigning weight to the minority class is equivalent to oversampling the minority class by that factor. However, class weights operate in the loss function (no data modification), while resampling creates new data points. Class weights are generally preferred for speed, while SMOTE can capture more complex minority class structure.
Evaluation Metrics for Imbalanced Data
Confusion Matrix Components
Predicted
Neg Pos
Actual Neg [ TN | FP ]
Pos [ FN | TP ]
Precision = TP / (TP + FP) ā Of predicted positives, how many correct?
Recall = TP / (TP + FN) ā Of actual positives, how many detected?
F1-Score = 2 Ć (Precision Ć Recall) / (Precision + Recall)
šComprehensive Imbalanced Data Evaluation
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
classification_report,
confusion_matrix,
roc_auc_score,
average_precision_score,
precision_recall_curve,
roc_curve
)
import matplotlib.pyplot as plt
# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:, 1]
# Standard metrics
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Normal', 'Minority']))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{cm}")
# ROC-AUC (can be misleading for imbalanced data)
roc_auc = roc_auc_score(y_test, y_prob)
print(f"ROC-AUC: {roc_auc:.4f}")
# Average Precision (better metric for imbalanced data)
ap = average_precision_score(y_test, y_prob)
print(f"Average Precision: {ap:.4f}")
Precision-Recall vs ROC Curve
ROC Curve: Precision-Recall Curve:
TPR ā Precision ā
1.0 ⤠āāāāāāāāā 1.0 ā¤
ā ā⯠ā āāāāāāāā®
āā⯠āā ā
0.5āā¤āÆ 0.5āā¤ā ā°āāāā
ā āā°āÆ
0.0 ā¼āāāāāāāāāāā FPR 0.0 ā¼āāāāāāāāāāā Recall
0.0 0.5 1.0 0.0 0.5 1.0
ā Can be overly optimistic ā More informative for imbalance
For imbalanced data, ROC curves can be misleadingly optimistic. The FPR is computed using all true negatives, and when the majority class is very large, even many false positives result in a small FPR. PR curves avoid this issue by focusing only on the positive class and the false positives relative to true positives.
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
axes[0].plot(fpr, tpr, label=f'ROC-AUC = {roc_auc:.3f}')
axes[0].plot([0, 1], [0, 1], 'k--', alpha=0.5)
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve')
axes[0].legend()
# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_prob)
axes[1].plot(recall, precision, label=f'AP = {ap:.3f}')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
axes[1].legend()
plt.tight_layout()
plt.savefig('evaluation_curves.png', dpi=150)
plt.show()
Complete Pipeline: Resampling + Evaluation
šSMOTE Pipeline with Cross-Validation
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, StratifiedKFold
# Build pipeline with SMOTE + classifier
pipeline = ImbPipeline([
('scaler', StandardScaler()),
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier(
n_estimators=100, random_state=42
))
])
# Cross-validate with StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
pipeline, X, y, cv=cv,
scoring='average_precision', n_jobs=-1
)
print(f"Average Precision: {scores.mean():.4f} ± {scores.std():.4f}")
# Compare multiple strategies
strategies = {
'Baseline': LogisticRegression(random_state=42),
'Class Weight': LogisticRegression(class_weight='balanced', random_state=42),
'SMOTE': ImbPipeline([
('smote', SMOTE(random_state=42)),
('clf', LogisticRegression(random_state=42))
]),
}
print("\nStrategy Comparison:")
for name, model in strategies.items():
if name in ['Baseline', 'Class Weight']:
pipe = ImbPipeline([
('scaler', StandardScaler()),
('clf', model)
])
else:
pipe = ImbPipeline([
('scaler', StandardScaler()),
('smote', SMOTE(random_state=42)),
('clf', LogisticRegression(random_state=42))
])
scores = cross_val_score(
pipe, X, y, cv=cv,
scoring='average_precision'
)
print(f"{name:>15}: {scores.mean():.4f} ± {scores.std():.4f}")
Always apply SMOTE (or any resampling) inside the cross-validation loop, not before. If you resample before splitting, synthetic samples may leak into the validation set, giving overly optimistic estimates. The imblearn.pipeline.Pipeline ensures resampling happens only on the training folds.
Decision Framework
Which technique to choose?
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā Is the dataset very large (>100k)? ā
ā āā Yes ā Undersampling or class weights (SMOTE is slow) ā
ā āā No ā
ā ā ā
ā Is the minority class <50 samples? ā
ā āā Yes ā SMOTE with caution (k_neighbors must be < n) ā
ā āā No ā SMOTE or SMOTE variants ā
ā ā
ā Priority: Speed ā Class weights ā
ā Priority: Performance ā SMOTE + ensemble ā
ā Priority: Simplicity ā Class weights ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Key Takeaways
- Accuracy is misleading for imbalanced datasets ā use precision, recall, F1, and AUPRC instead
- SMOTE creates synthetic minority samples by interpolation ā avoids overfitting from pure oversampling
- Class weights adjust the loss function without changing data ā simpler but may underperform
- Combine techniques: SMOTE + undersampling hybrids often yield best results
- Always evaluate on a stratified holdout set ā resampling should only be applied to training data
šSummary: Imbalanced Data ā SMOTE, Class Weights
- Class imbalance causes accuracy paradox: a trivial majority-class classifier can achieve high accuracy while being useless for the minority class.
- Accuracy is unreliable for imbalanced data; use precision, recall, F1, and Average Precision (AUPRC) instead.
- Random oversampling duplicates minority samples (risk of overfitting); random undersampling discards majority samples (risk of information loss).
- SMOTE generates synthetic minority samples via interpolation between k-nearest neighbors: .
- Class weights penalize minority misclassifications more heavily in the loss function: .
- ADASYN adaptively generates more synthetic samples for harder-to-learn minority instances near the decision boundary.
- Always resample inside cross-validation to prevent data leakage ā use
imblearn.pipeline.Pipeline. - PR curves are preferred over ROC curves for imbalanced evaluation because ROC AUC can be misleadingly optimistic.
- Choose class weights for speed and simplicity; choose SMOTE + ensemble for maximum performance on difficult problems.
- Combine strategies (e.g., SMOTE + Tomek links, class weights + undersampling) for robust handling of real-world imbalance.
Practice Exercises
- Credit Card Fraud: Train a Random Forest on an imbalanced fraud dataset. Compare performance with and without SMOTE. Which metric matters most?
- Medical Diagnosis: Build a pipeline that uses BorderlineSMOTE with XGBoost. Tune the
k_neighborsparameter. - Cost-Sensitive Learning: Implement custom loss weights where missing a positive case costs 10x more than a false alarm. Compare with SMOTE.
- Evaluation: Plot both ROC and Precision-Recall curves. Under what conditions does ROC give a misleading picture?