Model Evaluation: ROC, AUC, PR Curves

Module 2: Machine LearningFree Lesson

Advertisement

Model Evaluation: ROC, AUC, PR Curves

Classification Metrics Deep Dive

Beyond accuracy, we need metrics that capture different aspects of model performance.

The Confusion Matrix

DfConfusion Matrix

A table that summarizes the performance of a classification model by comparing predicted labels against true labels. For binary classification, it consists of four entries: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

Architecture Diagram
CONFUSION MATRIX LAYOUT:

                    Predicted
                    Negative    Positive
Actual  Negative  [  TN    |    FP   ]
        Positive  [  FN    |    TP   ]

Where:
โ€ข TP (True Positive):  Correctly predicted positive
โ€ข TN (True Negative):  Correctly predicted negative
โ€ข FP (False Positive): Incorrectly predicted positive (Type I Error)
โ€ข FN (False Negative): Incorrectly predicted negative (Type II Error)

Derived Metrics

Accuracy:

Accuracy

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Here,

  • =True Positives
  • =True Negatives
  • =False Positives (Type I error)
  • =False Negatives (Type II error)

Precision (Positive Predictive Value):

Precision

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Here,

  • =True Positives
  • =False Positives

Recall (Sensitivity, True Positive Rate):

Recall (Sensitivity)

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

Here,

  • =True Positives
  • =False Negatives

Specificity (True Negative Rate):

Specificity

Specificity=TNTN+FP\text{Specificity} = \frac{TN}{TN + FP}

Here,

  • =True Negatives
  • =False Positives

F1 Score (Harmonic Mean):

F1=2โ‹…Precisionโ‹…RecallPrecision+Recall=2TP2TP+FP+FNF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}

The F1 score is the harmonic mean of precision and recall, which penalizes extreme values more than the arithmetic mean. A classifier must have both high precision AND high recall to achieve a high F1 score. The F-beta generalization allows weighting recall more heavily than precision (beta > 1) or vice versa (beta < 1).

Complete Metrics Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    confusion_matrix, classification_report, 
    precision_score, recall_score, f1_score, accuracy_score
)
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

print("=" * 70)
print("CLASSIFICATION METRICS DEEP DIVE")
print("=" * 70)

# Generate imbalanced dataset
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, weights=[0.7, 0.3],
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Class distribution (train): {np.bincount(y_train)}")
print(f"Class distribution (test): {np.bincount(y_test)}")

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Get predictions
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

print("\nConfusion Matrix:")
print(cm)
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve plots True Positive Rate vs False Positive Rate at different thresholds.

DfROC Curve

A graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It is created by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.

Mathematical Definition

True Positive Rate (Recall/Sensitivity):

True Positive Rate

TPR=TPTP+FNTPR = \frac{TP}{TP + FN}

Here,

  • =True Positives
  • =False Negatives

False Positive Rate (1 - Specificity):

False Positive Rate

FPR=FPFP+TNFPR = \frac{FP}{FP + TN}

Here,

  • =False Positives
  • =True Negatives

AUC (Area Under Curve):

AUC=โˆซ01TPR(FPRโˆ’1(t))โ€‰dt=P(f^(x+)>f^(xโˆ’))AUC = \int_0^1 TPR(FPR^{-1}(t)) \, dt = P(\hat{f}(x^+) > \hat{f}(x^-))

ThAUC Probabilistic Interpretation

The AUC of a classifier is equivalent to the probability that a randomly chosen positive instance is ranked higher (has a higher predicted probability) than a randomly chosen negative instance. That is, AUC=P(f^(x+)>f^(xโˆ’))AUC = P(\hat{f}(x^+) > \hat{f}(x^-)) where x+x^+ and xโˆ’x^- are drawn from the positive and negative classes respectively.

Visual Representation

Architecture Diagram
ROC CURVE INTERPRETATION:

TPR โ†‘
1.0 โ”‚         โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Perfect Classifier
    โ”‚        โ•ฑ
    โ”‚       โ•ฑ   AUC = 1.0
0.8 โ”‚      โ•ฑ
    โ”‚     โ•ฑ    โ— Good Classifier
    โ”‚    โ•ฑ     AUC = 0.9
0.6 โ”‚   โ•ฑ
    โ”‚  โ•ฑ      โ—‹ Random Classifier
    โ”‚ โ•ฑ       AUC = 0.5
0.4 โ”‚โ•ฑ
    โ”‚         ร— Worst Classifier
    โ”‚         AUC = 0.0
0.2 โ”‚
    โ”‚
0.0 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ FPR
    0.0  0.2  0.4  0.6  0.8  1.0

Interpretation:
โ€ข AUC = 0.5: Random guessing (diagonal)
โ€ข AUC > 0.5: Better than random
โ€ข AUC = 1.0: Perfect classifier
โ€ข AUC < 0.5: Worse than random (inverted predictions)

AUC is threshold-independent โ€” it evaluates the classifier across all possible thresholds. This makes it useful for comparing models without committing to a specific operating point. However, AUC can be misleading for highly imbalanced datasets; in such cases, use PR AUC (Average Precision) instead.

Complete ROC Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

print("\n" + "=" * 70)
print("ROC CURVE AND AUC")
print("=" * 70)

# Generate dataset
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, weights=[0.7, 0.3], random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Get predicted probabilities
y_proba = clf.predict_proba(X_test)[:, 1]

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

print(f"ROC AUC: {roc_auc:.4f}")

# Visualize ROC Curve
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: ROC Curve
ax1 = axes[0]
ax1.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC Curve (AUC = {roc_auc:.4f})')
ax1.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier (AUC = 0.5)')
ax1.fill_between(fpr, tpr, alpha=0.2)
ax1.set_xlabel('False Positive Rate (1 - Specificity)')
ax1.set_ylabel('True Positive Rate (Recall)')
ax1.set_title('ROC Curve')
ax1.legend(loc='lower right')
ax1.grid(True, alpha=0.3)
ax1.set_xlim([0, 1])
ax1.set_ylim([0, 1.05])

# Mark optimal threshold (Youden's J statistic)
J = tpr - fpr
optimal_idx = np.argmax(J)
optimal_threshold = thresholds[optimal_idx]
ax1.scatter(fpr[optimal_idx], tpr[optimal_idx], c='red', marker='o', s=100, 
            label=f'Optimal Threshold = {optimal_threshold:.3f}')
ax1.legend(loc='lower right')

# Plot 2: Threshold vs TPR/FPR
ax2 = axes[1]
ax2.plot(thresholds, tpr, 'b-', linewidth=2, label='TPR (Recall)')
ax2.plot(thresholds, fpr, 'r-', linewidth=2, label='FPR')
ax2.plot(thresholds, tpr - fpr, 'g--', linewidth=2, label="Youden's J")
ax2.axvline(x=optimal_threshold, color='gray', linestyle='--', alpha=0.5)
ax2.set_xlabel('Threshold')
ax2.set_ylabel('Score')
ax2.set_title('TPR and FPR vs Threshold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150, bbox_inches='tight')
plt.show()

๐Ÿ“Comparing Multiple Classifiers with ROC

# Compare multiple classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'SVM (probability)': SVC(probability=True, random_state=42)
}

fig, ax = plt.subplots(figsize=(10, 8))
colors = ['blue', 'red', 'green', 'orange']

for (name, clf), color in zip(classifiers.items(), colors):
    clf.fit(X_train, y_train)
    y_proba = clf.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    roc_auc = auc(fpr, tpr)
    ax.plot(fpr, tpr, color=color, linewidth=2, 
            label=f'{name} (AUC = {roc_auc:.4f})')
    print(f"{name:25s} AUC: {roc_auc:.4f}")

ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC = 0.5)')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves for Multiple Classifiers', fontsize=14)
ax.legend(loc='lower right', fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('roc_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

Precision-Recall Curves

PR curves are especially useful for imbalanced datasets.

DfPrecision-Recall Curve

A plot of precision (y-axis) versus recall (x-axis) for different threshold settings. Unlike ROC curves, PR curves focus on the positive (minority) class and are more informative when classes are imbalanced.

When to Use PR Curves

Architecture Diagram
ROC vs PR CURVES:

ROC CURVE:                           PR CURVE:
โ€ข Uses FPR and TPR                   โ€ข Uses Precision and Recall
โ€ข Affected by class imbalance        โ€ข Better for imbalanced data
โ€ข Good when classes are balanced     โ€ข Good when positive class is rare
โ€ข Shows overall performance          โ€ข Focuses on positive class

Example: Fraud Detection (1% fraud)
โ€ข ROC AUC might be 0.95 (looks good)
โ€ข PR AUC might be 0.30 (reality check!)

USE PR CURVES WHEN:
โ€ข Positive class is rare (< 20%)
โ€ข Cost of false positives is high
โ€ข You care more about positive predictions

Mathematical Definition

Average Precision (AP):

Average Precision

AP=โˆ‘n(Rnโˆ’Rnโˆ’1)PnAP = \sum_{n} (R_n - R_{n-1}) P_n

Here,

  • =Recall at threshold n
  • =Precision at threshold n

PR AUC:

PR-AUC=โˆซ01P(Rโˆ’1(r))โ€‰drPR\text{-}AUC = \int_0^1 P(R^{-1}(r)) \, dr

The baseline for a PR curve is the positive class prevalence (e.g., 0.10 for 10% positive cases), not 0.5 as in ROC curves. A model with PR AUC below the baseline is worse than random. The area under the PR curve (Average Precision) provides a single summary statistic, but the shape of the curve reveals trade-offs at different operating points.

Complete PR Curve Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (
    precision_recall_curve, average_precision_score,
    PrecisionRecallDisplay
)
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

print("\n" + "=" * 70)
print("PRECISION-RECALL CURVES")
print("=" * 70)

# Generate imbalanced dataset
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, weights=[0.9, 0.1],
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Class distribution: {np.bincount(y)}")
print(f"Positive class ratio: {y.mean():.2%}")

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Get probabilities
y_proba = clf.predict_proba(X_test)[:, 1]

# Compute PR curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
avg_precision = average_precision_score(y_test, y_proba)

print(f"\nAverage Precision: {avg_precision:.4f}")

# Visualize PR Curve
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: PR Curve
ax1 = axes[0]
ax1.plot(recall, precision, 'b-', linewidth=2, label=f'PR Curve (AP = {avg_precision:.4f})')
ax1.fill_between(recall, precision, alpha=0.2)
ax1.set_xlabel('Recall')
ax1.set_ylabel('Precision')
ax1.set_title('Precision-Recall Curve')
ax1.legend(loc='lower left')
ax1.grid(True, alpha=0.3)

# Baseline
baseline = y_test.mean()
ax1.axhline(y=baseline, color='gray', linestyle='--', alpha=0.5, 
            label=f'Baseline (prevalence = {baseline:.3f})')

# Plot 2: Precision and Recall vs Threshold
ax2 = axes[1]
ax2.plot(thresholds, precision[:-1], 'b-', linewidth=2, label='Precision')
ax2.plot(thresholds, recall[:-1], 'r-', linewidth=2, label='Recall')
ax2.plot(thresholds, 2 * (precision[:-1] * recall[:-1]) / 
         (precision[:-1] + recall[:-1] + 1e-10), 'g--', linewidth=2, label='F1 Score')
ax2.set_xlabel('Threshold')
ax2.set_ylabel('Score')
ax2.set_title('Precision, Recall, and F1 vs Threshold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('pr_curve.png', dpi=150, bbox_inches='tight')
plt.show()

Threshold Tuning

Optimal Threshold Selection

DfYouden's J Statistic

A threshold selection method that maximizes the sum of sensitivity and specificity minus 1. It finds the threshold where J=TPRโˆ’FPRJ = TPR - FPR is maximized, providing a balance between capturing positives and avoiding false alarms.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, precision_recall_curve, f1_score
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

print("\n" + "=" * 70)
print("THRESHOLD TUNING")
print("=" * 70)

# Generate dataset
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, weights=[0.7, 0.3], random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)[:, 1]

# Method 1: Youden's J
fpr, tpr, thresholds_roc = roc_curve(y_test, y_proba)
J = tpr - fpr
optimal_idx_youden = np.argmax(J)
optimal_threshold_youden = thresholds_roc[optimal_idx_youden]

# Method 2: Maximize F1 Score
precision, recall, thresholds_pr = precision_recall_curve(y_test, y_proba)
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-10)
optimal_idx_f1 = np.argmax(f1_scores[:-1])
optimal_threshold_f1 = thresholds_pr[optimal_idx_f1]

# Method 3: Equal error rate (FPR = FNR)
fnr = 1 - tpr
equal_error_idx = np.argmin(np.abs(fpr - fnr))
optimal_threshold_eer = thresholds_roc[equal_error_idx]

print(f"Method                  Optimal Threshold")
print("-" * 50)
print(f"Youden's J:             {optimal_threshold_youden:.4f}")
print(f"Max F1:                 {optimal_threshold_f1:.4f}")
print(f"Equal Error Rate:       {optimal_threshold_eer:.4f}")

The choice of threshold depends on the cost structure of your problem. If false negatives are costly (e.g., missing a disease), lower the threshold to increase recall. If false positives are costly (e.g., spam filter blocking legitimate email), raise the threshold to increase precision. Always choose the threshold based on the business context, not just the default 0.5.

Cost-Sensitive Threshold Selection

Total Classification Cost

Ctotal=CFPโ‹…FP+CFNโ‹…FNC_{\text{total}} = C_{FP} \cdot FP + C_{FN} \cdot FN

Here,

  • =Cost of a false positive
  • =Cost of a false negative
  • =Number of false positives
  • =Number of false negatives
def calculate_cost(y_true, y_pred, cost_fp=1, cost_fn=10):
    """Calculate total cost given predictions."""
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    return fp * cost_fp + fn * cost_fn

from sklearn.metrics import confusion_matrix

thresholds_to_try = np.arange(0.1, 0.9, 0.05)
costs = []

print(f"\n{'Threshold':<12} {'FP Cost':<10} {'FN Cost':<10} {'Total Cost':<12} {'Accuracy'}")
print("-" * 60)

for threshold in thresholds_to_try:
    y_pred = (y_proba >= threshold).astype(int)
    cost = calculate_cost(y_test, y_pred, cost_fp=1, cost_fn=10)
    accuracy = (y_test == y_pred).mean()
    costs.append(cost)
    
    if threshold in [0.3, 0.5, 0.7, optimal_threshold_youden]:
        cm = confusion_matrix(y_test, y_pred)
        fp_cost = cm[0, 1] * 1
        fn_cost = cm[1, 0] * 10
        print(f"{threshold:<12.2f} {fp_cost:<10} {fn_cost:<10} {cost:<12} {accuracy:.4f}")

optimal_cost_threshold = thresholds_to_try[np.argmin(costs)]
print(f"\nOptimal threshold for cost sensitivity: {optimal_cost_threshold:.4f}")

Multi-Class Evaluation Strategies

One-vs-Rest and One-vs-One

Architecture Diagram
MULTI-CLASS EVALUATION STRATEGIES:

Given 3 classes: A, B, C

ONE-vs-REST (OvR):
โ€ข Train K binary classifiers
โ€ข A vs (B+C), B vs (A+C), C vs (A+B)
โ€ข Combine predictions

ONE-vs-ONE (OvO):
โ€ข Train K(K-1)/2 binary classifiers
โ€ข A vs B, A vs C, B vs C
โ€ข Majority voting

MACRO vs MICRO vs WEIGHTED:
โ€ข Macro: Average metric across classes (treats all classes equally)
โ€ข Micro: Aggregate TP, FP, FN across classes (biased toward majority)
โ€ข Weighted: Weight by class frequency

DfMacro vs Micro Averaging

Macro averaging computes the metric independently for each class and takes the average, treating all classes equally regardless of size. Micro averaging aggregates the contributions of all classes to compute the average metric, which is dominated by the majority class. Weighted averaging weights each class's metric by its support (number of true instances).

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    roc_curve, auc, precision_recall_curve, average_precision_score,
    roc_auc_score
)
from sklearn.preprocessing import label_binarize
from itertools import cycle
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

print("\n" + "=" * 70)
print("MULTI-CLASS EVALUATION")
print("=" * 70)

# Load digits dataset (10 classes)
digits = load_digits()
X, y = digits.data, digits.target

# Use only first 5 classes for clarity
X = X[y < 5]
y = y[y < 5]

n_classes = len(np.unique(y))
print(f"Dataset: {X.shape}")
print(f"Number of classes: {n_classes}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Binarize labels for ROC
y_test_bin = label_binarize(y_test, classes=range(n_classes))

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)

# Calculate different AUC averaging methods
macro_auc = roc_auc_score(y_test_bin, y_proba, average='macro')
micro_auc = roc_auc_score(y_test_bin, y_proba, average='micro')
weighted_auc = roc_auc_score(y_test_bin, y_proba, average='weighted')

print(f"\nMacro AUC:    {macro_auc:.4f} (treats all classes equally)")
print(f"Micro AUC:    {micro_auc:.4f} (aggregates all predictions)")
print(f"Weighted AUC: {weighted_auc:.4f} (weights by class frequency)")

Key Takeaways

  1. Confusion Matrix is the foundation for all metrics
  2. Precision measures how many positive predictions are correct
  3. Recall measures how many actual positives are captured
  4. F1 Score balances precision and recall
  5. ROC Curve shows performance across all thresholds
  6. AUC provides a single number summary (threshold-independent)
  7. PR Curves are better for imbalanced datasets
  8. Threshold tuning is crucial for real-world applications
  9. Multi-class evaluation requires macro/micro/weighted averaging
  10. Always consider costs when choosing thresholds

Summary Table

MetricFormulaBest ForInterpretation
Accuracy(TP+TN)/NBalanced dataOverall correctness
PrecisionTP/(TP+FP)Low FP costPositive prediction quality
RecallTP/(TP+FN)Low FN costCapture all positives
F12ยทPยทR/(P+R)Balance P&RHarmonic mean
AUC-ROCโˆซTPR dFPRBalanced dataThreshold-independent
APโˆซP dRImbalanced dataPositive class focus

When to Use What

Architecture Diagram
DECISION GUIDE:

Is your data balanced?
โ”œโ”€โ”€ Yes โ†’ Use AUC-ROC
โ””โ”€โ”€ No โ†’ Use PR AUC (Average Precision)

What matters more?
โ”œโ”€โ”€ Catch all positives (e.g., disease) โ†’ Optimize Recall
โ”œโ”€โ”€ Don't raise false alarms (e.g., spam) โ†’ Optimize Precision
โ””โ”€โ”€ Balance both โ†’ Optimize F1 or F-beta

Multi-class?
โ”œโ”€โ”€ All classes equally important โ†’ Macro average
โ”œโ”€โ”€ Care more about majority โ†’ Micro average
โ””โ”€โ”€ Weighted by importance โ†’ Weighted average

๐Ÿ“‹Summary: Model Evaluation โ€” ROC, AUC, PR Curves

  1. The confusion matrix (TP, TN, FP, FN) is the foundation from which all classification metrics are derived.
  2. Accuracy = (TP+TN)/N is misleading for imbalanced data; prefer precision, recall, and F1.
  3. The F1 score is the harmonic mean of precision and recall โ€” it penalizes extreme imbalances between the two.
  4. The ROC curve plots TPR vs FPR across thresholds; AUC gives a threshold-independent summary. AUC = P(positive ranked higher than negative).
  5. PR curves are more informative than ROC for imbalanced datasets because they focus on the positive class. The baseline is the positive class prevalence, not 0.5.
  6. Average Precision (AP) is the area under the PR curve and provides a single-number summary for imbalanced evaluation.
  7. Threshold selection should be driven by the cost structure: Youden's J for balanced costs, cost-sensitive optimization for asymmetric costs.
  8. For multi-class problems, use macro (equal class weight), micro (aggregate all), or weighted (frequency-based) averaging of AUC.
  9. Always report confidence intervals (via bootstrap) for AUC to understand uncertainty in your evaluation.
  10. Match your metric to your problem: ROC-AUC for balanced data, PR-AUC for imbalanced data, cost-sensitive metrics when error types have different consequences.

Practice Exercises

Exercise 1: Medical Diagnosis

"""
Build a medical diagnosis system:
1. Generate imbalanced dataset (5% disease prevalence)
2. Train multiple classifiers
3. Compare ROC and PR curves
4. Choose threshold minimizing FN (missing disease)
5. Calculate cost of different error types
"""

# Your code here

Exercise 2: Anomaly Detection

"""
Anomaly detection evaluation:
1. Generate data with 1% anomalies
2. Train anomaly detector
3. Evaluate with PR curve (not ROC!)
4. Find threshold with 95% recall
5. Report precision at that threshold
"""

# Your code here

Exercise 3: Multi-class Comparison

"""
Compare multi-class evaluation strategies:
1. Use Iris or Digits dataset
2. Train 3 classifiers
3. Calculate macro, micro, weighted AUC
4. Analyze per-class performance
5. Identify worst-performing classes
"""

# Your code here

Exercise 4: Cost-Sensitive Learning

"""
Implement cost-sensitive evaluation:
1. Define cost matrix for different errors
2. Train standard classifier
3. Find cost-optimal threshold
4. Compare with default threshold (0.5)
5. Visualize cost landscape
"""

# Your code here

Congratulations! You've completed Module 2: Machine Learning. You now have a solid foundation in:

  • Cross-validation and model evaluation
  • Hyperparameter tuning strategies
  • Unsupervised learning (clustering)
  • Dimensionality reduction (PCA)
  • Comprehensive model evaluation metrics

Next: We'll explore Deep Learning fundamentals in Module 3!

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement