Model Evaluation: ROC, AUC, PR Curves

Classification Metrics Deep Dive

Beyond accuracy, we need metrics that capture different aspects of model performance.

The Confusion Matrix

DfConfusion Matrix

A table that summarizes the performance of a classification model by comparing predicted labels against true labels. For binary classification, it consists of four entries: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

Architecture Diagram

CONFUSION MATRIX LAYOUT:

                    Predicted
                    Negative    Positive
Actual  Negative  [  TN    |    FP   ]
        Positive  [  FN    |    TP   ]

Where:
• TP (True Positive):  Correctly predicted positive
• TN (True Negative):  Correctly predicted negative
• FP (False Positive): Incorrectly predicted positive (Type I Error)
• FN (False Negative): Incorrectly predicted negative (Type II Error)

Derived Metrics

Accuracy:

Accuracy

\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Here,

=True Positives
=True Negatives
=False Positives (Type I error)
=False Negatives (Type II error)

Precision (Positive Predictive Value):

Precision

\text{Precision} = \frac{TP}{TP + FP}

Here,

=True Positives
=False Positives

Recall (Sensitivity, True Positive Rate):

Recall (Sensitivity)

\text{Recall} = \frac{TP}{TP + FN}

Here,

=True Positives
=False Negatives

Specificity (True Negative Rate):

Specificity

\text{Specificity} = \frac{TN}{TN + FP}

Here,

=True Negatives
=False Positives

F1 Score (Harmonic Mean):

F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}

The F1 score is the harmonic mean of precision and recall, which penalizes extreme values more than the arithmetic mean. A classifier must have both high precision AND high recall to achieve a high F1 score. The F-beta generalization allows weighting recall more heavily than precision (beta > 1) or vice versa (beta < 1).

Complete Metrics Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    confusion_matrix, classification_report, 
    precision_score, recall_score, f1_score, accuracy_score
)
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

print("=" * 70)
print("CLASSIFICATION METRICS DEEP DIVE")
print("=" * 70)

# Generate imbalanced dataset
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, weights=[0.7, 0.3],
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Class distribution (train): {np.bincount(y_train)}")
print(f"Class distribution (test): {np.bincount(y_test)}")

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Get predictions
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

print("\nConfusion Matrix:")
print(cm)
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve plots True Positive Rate vs False Positive Rate at different thresholds.

DfROC Curve

A graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It is created by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.

Mathematical Definition

True Positive Rate (Recall/Sensitivity):

True Positive Rate

TPR = \frac{TP}{TP + FN}

Here,

=True Positives
=False Negatives

False Positive Rate (1 - Specificity):

False Positive Rate

FPR = \frac{FP}{FP + TN}

Here,

=False Positives
=True Negatives

AUC (Area Under Curve):

AUC = \int_0^1 TPR(FPR^{-1}(t)) \, dt = P(\hat{f}(x^+) > \hat{f}(x^-))

ThAUC Probabilistic Interpretation

The AUC of a classifier is equivalent to the probability that a randomly chosen positive instance is ranked higher (has a higher predicted probability) than a randomly chosen negative instance. That is, $AUC = P(\hat{f}(x^+) > \hat{f}(x^-))$ where $x^+$ and $x^-$ are drawn from the positive and negative classes respectively.

Visual Representation

Architecture Diagram

ROC CURVE INTERPRETATION:

TPR ↑
1.0 │         ╭─────────────── Perfect Classifier
    │        ╱
    │       ╱   AUC = 1.0
0.8 │      ╱
    │     ╱    ● Good Classifier
    │    ╱     AUC = 0.9
0.6 │   ╱
    │  ╱      ○ Random Classifier
    │ ╱       AUC = 0.5
0.4 │╱
    │         × Worst Classifier
    │         AUC = 0.0
0.2 │
    │
0.0 └─────────────────────→ FPR
    0.0  0.2  0.4  0.6  0.8  1.0

Interpretation:
• AUC = 0.5: Random guessing (diagonal)
• AUC > 0.5: Better than random
• AUC = 1.0: Perfect classifier
• AUC < 0.5: Worse than random (inverted predictions)

AUC is threshold-independent — it evaluates the classifier across all possible thresholds. This makes it useful for comparing models without committing to a specific operating point. However, AUC can be misleading for highly imbalanced datasets; in such cases, use PR AUC (Average Precision) instead.

Complete ROC Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

print("\n" + "=" * 70)
print("ROC CURVE AND AUC")
print("=" * 70)

# Generate dataset
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, weights=[0.7, 0.3], random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Get predicted probabilities
y_proba = clf.predict_proba(X_test)[:, 1]

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

print(f"ROC AUC: {roc_auc:.4f}")

# Visualize ROC Curve
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: ROC Curve
ax1 = axes[0]
ax1.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC Curve (AUC = {roc_auc:.4f})')
ax1.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier (AUC = 0.5)')
ax1.fill_between(fpr, tpr, alpha=0.2)
ax1.set_xlabel('False Positive Rate (1 - Specificity)')
ax1.set_ylabel('True Positive Rate (Recall)')
ax1.set_title('ROC Curve')
ax1.legend(loc='lower right')
ax1.grid(True, alpha=0.3)
ax1.set_xlim([0, 1])
ax1.set_ylim([0, 1.05])

# Mark optimal threshold (Youden's J statistic)
J = tpr - fpr
optimal_idx = np.argmax(J)
optimal_threshold = thresholds[optimal_idx]
ax1.scatter(fpr[optimal_idx], tpr[optimal_idx], c='red', marker='o', s=100, 
            label=f'Optimal Threshold = {optimal_threshold:.3f}')
ax1.legend(loc='lower right')

# Plot 2: Threshold vs TPR/FPR
ax2 = axes[1]
ax2.plot(thresholds, tpr, 'b-', linewidth=2, label='TPR (Recall)')
ax2.plot(thresholds, fpr, 'r-', linewidth=2, label='FPR')
ax2.plot(thresholds, tpr - fpr, 'g--', linewidth=2, label="Youden's J")
ax2.axvline(x=optimal_threshold, color='gray', linestyle='--', alpha=0.5)
ax2.set_xlabel('Threshold')
ax2.set_ylabel('Score')
ax2.set_title('TPR and FPR vs Threshold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150, bbox_inches='tight')
plt.show()

📝Comparing Multiple Classifiers with ROC

# Compare multiple classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'SVM (probability)': SVC(probability=True, random_state=42)
}

fig, ax = plt.subplots(figsize=(10, 8))
colors = ['blue', 'red', 'green', 'orange']

for (name, clf), color in zip(classifiers.items(), colors):
    clf.fit(X_train, y_train)
    y_proba = clf.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    roc_auc = auc(fpr, tpr)
    ax.plot(fpr, tpr, color=color, linewidth=2, 
            label=f'{name} (AUC = {roc_auc:.4f})')
    print(f"{name:25s} AUC: {roc_auc:.4f}")

ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC = 0.5)')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves for Multiple Classifiers', fontsize=14)
ax.legend(loc='lower right', fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('roc_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

Precision-Recall Curves

PR curves are especially useful for imbalanced datasets.

DfPrecision-Recall Curve

A plot of precision (y-axis) versus recall (x-axis) for different threshold settings. Unlike ROC curves, PR curves focus on the positive (minority) class and are more informative when classes are imbalanced.

When to Use PR Curves

Architecture Diagram

ROC vs PR CURVES:

ROC CURVE:                           PR CURVE:
• Uses FPR and TPR                   • Uses Precision and Recall
• Affected by class imbalance        • Better for imbalanced data
• Good when classes are balanced     • Good when positive class is rare
• Shows overall performance          • Focuses on positive class

Example: Fraud Detection (1% fraud)
• ROC AUC might be 0.95 (looks good)
• PR AUC might be 0.30 (reality check!)

USE PR CURVES WHEN:
• Positive class is rare (< 20%)
• Cost of false positives is high
• You care more about positive predictions

Mathematical Definition

Average Precision (AP):

Average Precision

AP = \sum_{n} (R_n - R_{n-1}) P_n

Here,

=Recall at threshold n
=Precision at threshold n

PR AUC:

PR\text{-}AUC = \int_0^1 P(R^{-1}(r)) \, dr

The baseline for a PR curve is the positive class prevalence (e.g., 0.10 for 10% positive cases), not 0.5 as in ROC curves. A model with PR AUC below the baseline is worse than random. The area under the PR curve (Average Precision) provides a single summary statistic, but the shape of the curve reveals trade-offs at different operating points.

Complete PR Curve Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (
    precision_recall_curve, average_precision_score,
    PrecisionRecallDisplay
)
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

print("\n" + "=" * 70)
print("PRECISION-RECALL CURVES")
print("=" * 70)

# Generate imbalanced dataset
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, weights=[0.9, 0.1],
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Class distribution: {np.bincount(y)}")
print(f"Positive class ratio: {y.mean():.2%}")

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Get probabilities
y_proba = clf.predict_proba(X_test)[:, 1]

# Compute PR curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
avg_precision = average_precision_score(y_test, y_proba)

print(f"\nAverage Precision: {avg_precision:.4f}")

# Visualize PR Curve
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: PR Curve
ax1 = axes[0]
ax1.plot(recall, precision, 'b-', linewidth=2, label=f'PR Curve (AP = {avg_precision:.4f})')
ax1.fill_between(recall, precision, alpha=0.2)
ax1.set_xlabel('Recall')
ax1.set_ylabel('Precision')
ax1.set_title('Precision-Recall Curve')
ax1.legend(loc='lower left')
ax1.grid(True, alpha=0.3)

# Baseline
baseline = y_test.mean()
ax1.axhline(y=baseline, color='gray', linestyle='--', alpha=0.5, 
            label=f'Baseline (prevalence = {baseline:.3f})')

# Plot 2: Precision and Recall vs Threshold
ax2 = axes[1]
ax2.plot(thresholds, precision[:-1], 'b-', linewidth=2, label='Precision')
ax2.plot(thresholds, recall[:-1], 'r-', linewidth=2, label='Recall')
ax2.plot(thresholds, 2 * (precision[:-1] * recall[:-1]) / 
         (precision[:-1] + recall[:-1] + 1e-10), 'g--', linewidth=2, label='F1 Score')
ax2.set_xlabel('Threshold')
ax2.set_ylabel('Score')
ax2.set_title('Precision, Recall, and F1 vs Threshold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('pr_curve.png', dpi=150, bbox_inches='tight')
plt.show()

Threshold Tuning

Optimal Threshold Selection

DfYouden's J Statistic

A threshold selection method that maximizes the sum of sensitivity and specificity minus 1. It finds the threshold where $J = TPR - FPR$ is maximized, providing a balance between capturing positives and avoiding false alarms.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, precision_recall_curve, f1_score
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

print("\n" + "=" * 70)
print("THRESHOLD TUNING")
print("=" * 70)

# Generate dataset
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_classes=2, weights=[0.7, 0.3], random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)[:, 1]

# Method 1: Youden's J
fpr, tpr, thresholds_roc = roc_curve(y_test, y_proba)
J = tpr - fpr
optimal_idx_youden = np.argmax(J)
optimal_threshold_youden = thresholds_roc[optimal_idx_youden]

# Method 2: Maximize F1 Score
precision, recall, thresholds_pr = precision_recall_curve(y_test, y_proba)
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-10)
optimal_idx_f1 = np.argmax(f1_scores[:-1])
optimal_threshold_f1 = thresholds_pr[optimal_idx_f1]

# Method 3: Equal error rate (FPR = FNR)
fnr = 1 - tpr
equal_error_idx = np.argmin(np.abs(fpr - fnr))
optimal_threshold_eer = thresholds_roc[equal_error_idx]

print(f"Method                  Optimal Threshold")
print("-" * 50)
print(f"Youden's J:             {optimal_threshold_youden:.4f}")
print(f"Max F1:                 {optimal_threshold_f1:.4f}")
print(f"Equal Error Rate:       {optimal_threshold_eer:.4f}")

The choice of threshold depends on the cost structure of your problem. If false negatives are costly (e.g., missing a disease), lower the threshold to increase recall. If false positives are costly (e.g., spam filter blocking legitimate email), raise the threshold to increase precision. Always choose the threshold based on the business context, not just the default 0.5.

Cost-Sensitive Threshold Selection

Total Classification Cost

C_{\text{total}} = C_{FP} \cdot FP + C_{FN} \cdot FN

Here,

=Cost of a false positive
=Cost of a false negative
=Number of false positives
=Number of false negatives

def calculate_cost(y_true, y_pred, cost_fp=1, cost_fn=10):
    """Calculate total cost given predictions."""
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    return fp * cost_fp + fn * cost_fn

from sklearn.metrics import confusion_matrix

thresholds_to_try = np.arange(0.1, 0.9, 0.05)
costs = []

print(f"\n{'Threshold':<12} {'FP Cost':<10} {'FN Cost':<10} {'Total Cost':<12} {'Accuracy'}")
print("-" * 60)

for threshold in thresholds_to_try:
    y_pred = (y_proba >= threshold).astype(int)
    cost = calculate_cost(y_test, y_pred, cost_fp=1, cost_fn=10)
    accuracy = (y_test == y_pred).mean()
    costs.append(cost)
    
    if threshold in [0.3, 0.5, 0.7, optimal_threshold_youden]:
        cm = confusion_matrix(y_test, y_pred)
        fp_cost = cm[0, 1] * 1
        fn_cost = cm[1, 0] * 10
        print(f"{threshold:<12.2f} {fp_cost:<10} {fn_cost:<10} {cost:<12} {accuracy:.4f}")

optimal_cost_threshold = thresholds_to_try[np.argmin(costs)]
print(f"\nOptimal threshold for cost sensitivity: {optimal_cost_threshold:.4f}")

Multi-Class Evaluation Strategies

One-vs-Rest and One-vs-One

Architecture Diagram

MULTI-CLASS EVALUATION STRATEGIES:

Given 3 classes: A, B, C

ONE-vs-REST (OvR):
• Train K binary classifiers
• A vs (B+C), B vs (A+C), C vs (A+B)
• Combine predictions

ONE-vs-ONE (OvO):
• Train K(K-1)/2 binary classifiers
• A vs B, A vs C, B vs C
• Majority voting

MACRO vs MICRO vs WEIGHTED:
• Macro: Average metric across classes (treats all classes equally)
• Micro: Aggregate TP, FP, FN across classes (biased toward majority)
• Weighted: Weight by class frequency

DfMacro vs Micro Averaging

Macro averaging computes the metric independently for each class and takes the average, treating all classes equally regardless of size. Micro averaging aggregates the contributions of all classes to compute the average metric, which is dominated by the majority class. Weighted averaging weights each class's metric by its support (number of true instances).

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    roc_curve, auc, precision_recall_curve, average_precision_score,
    roc_auc_score
)
from sklearn.preprocessing import label_binarize
from itertools import cycle
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

print("\n" + "=" * 70)
print("MULTI-CLASS EVALUATION")
print("=" * 70)

# Load digits dataset (10 classes)
digits = load_digits()
X, y = digits.data, digits.target

# Use only first 5 classes for clarity
X = X[y < 5]
y = y[y < 5]

n_classes = len(np.unique(y))
print(f"Dataset: {X.shape}")
print(f"Number of classes: {n_classes}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Binarize labels for ROC
y_test_bin = label_binarize(y_test, classes=range(n_classes))

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)

# Calculate different AUC averaging methods
macro_auc = roc_auc_score(y_test_bin, y_proba, average='macro')
micro_auc = roc_auc_score(y_test_bin, y_proba, average='micro')
weighted_auc = roc_auc_score(y_test_bin, y_proba, average='weighted')

print(f"\nMacro AUC:    {macro_auc:.4f} (treats all classes equally)")
print(f"Micro AUC:    {micro_auc:.4f} (aggregates all predictions)")
print(f"Weighted AUC: {weighted_auc:.4f} (weights by class frequency)")

Key Takeaways

Confusion Matrix is the foundation for all metrics
Precision measures how many positive predictions are correct
Recall measures how many actual positives are captured
F1 Score balances precision and recall
ROC Curve shows performance across all thresholds
AUC provides a single number summary (threshold-independent)
PR Curves are better for imbalanced datasets
Threshold tuning is crucial for real-world applications
Multi-class evaluation requires macro/micro/weighted averaging
Always consider costs when choosing thresholds

Summary Table

Metric	Formula	Best For	Interpretation
Accuracy	(TP+TN)/N	Balanced data	Overall correctness
Precision	TP/(TP+FP)	Low FP cost	Positive prediction quality
Recall	TP/(TP+FN)	Low FN cost	Capture all positives
F1	2·P·R/(P+R)	Balance P&R	Harmonic mean
AUC-ROC	∫TPR dFPR	Balanced data	Threshold-independent
AP	∫P dR	Imbalanced data	Positive class focus

When to Use What

Architecture Diagram

DECISION GUIDE:

Is your data balanced?
├── Yes → Use AUC-ROC
└── No → Use PR AUC (Average Precision)

What matters more?
├── Catch all positives (e.g., disease) → Optimize Recall
├── Don't raise false alarms (e.g., spam) → Optimize Precision
└── Balance both → Optimize F1 or F-beta

Multi-class?
├── All classes equally important → Macro average
├── Care more about majority → Micro average
└── Weighted by importance → Weighted average

📋Summary: Model Evaluation — ROC, AUC, PR Curves

The confusion matrix (TP, TN, FP, FN) is the foundation from which all classification metrics are derived.
Accuracy = (TP+TN)/N is misleading for imbalanced data; prefer precision, recall, and F1.
The F1 score is the harmonic mean of precision and recall — it penalizes extreme imbalances between the two.
The ROC curve plots TPR vs FPR across thresholds; AUC gives a threshold-independent summary. AUC = P(positive ranked higher than negative).
PR curves are more informative than ROC for imbalanced datasets because they focus on the positive class. The baseline is the positive class prevalence, not 0.5.
Average Precision (AP) is the area under the PR curve and provides a single-number summary for imbalanced evaluation.
Threshold selection should be driven by the cost structure: Youden's J for balanced costs, cost-sensitive optimization for asymmetric costs.
For multi-class problems, use macro (equal class weight), micro (aggregate all), or weighted (frequency-based) averaging of AUC.
Always report confidence intervals (via bootstrap) for AUC to understand uncertainty in your evaluation.
Match your metric to your problem: ROC-AUC for balanced data, PR-AUC for imbalanced data, cost-sensitive metrics when error types have different consequences.

Practice Exercises

Exercise 1: Medical Diagnosis

"""
Build a medical diagnosis system:
1. Generate imbalanced dataset (5% disease prevalence)
2. Train multiple classifiers
3. Compare ROC and PR curves
4. Choose threshold minimizing FN (missing disease)
5. Calculate cost of different error types
"""

# Your code here

Exercise 2: Anomaly Detection

"""
Anomaly detection evaluation:
1. Generate data with 1% anomalies
2. Train anomaly detector
3. Evaluate with PR curve (not ROC!)
4. Find threshold with 95% recall
5. Report precision at that threshold
"""

# Your code here

Exercise 3: Multi-class Comparison

"""
Compare multi-class evaluation strategies:
1. Use Iris or Digits dataset
2. Train 3 classifiers
3. Calculate macro, micro, weighted AUC
4. Analyze per-class performance
5. Identify worst-performing classes
"""

# Your code here

Exercise 4: Cost-Sensitive Learning

"""
Implement cost-sensitive evaluation:
1. Define cost matrix for different errors
2. Train standard classifier
3. Find cost-optimal threshold
4. Compare with default threshold (0.5)
5. Visualize cost landscape
"""

# Your code here

Congratulations! You've completed Module 2: Machine Learning. You now have a solid foundation in:

Cross-validation and model evaluation
Hyperparameter tuning strategies
Unsupervised learning (clustering)
Dimensionality reduction (PCA)
Comprehensive model evaluation metrics

Next: We'll explore Deep Learning fundamentals in Module 3!

Model Evaluation: ROC, AUC, PR Curves

Model Evaluation: ROC, AUC, PR Curves

Classification Metrics Deep Dive

The Confusion Matrix

DfConfusion Matrix

Derived Metrics

Accuracy

Precision

Recall (Sensitivity)

Specificity

Complete Metrics Implementation

ROC Curve and AUC

DfROC Curve

Mathematical Definition

True Positive Rate

False Positive Rate

ThAUC Probabilistic Interpretation

Visual Representation

Complete ROC Implementation

📝Comparing Multiple Classifiers with ROC

Precision-Recall Curves

DfPrecision-Recall Curve

When to Use PR Curves

Mathematical Definition

Average Precision

Complete PR Curve Implementation

Threshold Tuning

Optimal Threshold Selection

DfYouden's J Statistic

Cost-Sensitive Threshold Selection

Total Classification Cost

Multi-Class Evaluation Strategies

One-vs-Rest and One-vs-One

DfMacro vs Micro Averaging

Key Takeaways

Summary Table

When to Use What

📋Summary: Model Evaluation — ROC, AUC, PR Curves

Practice Exercises

Exercise 1: Medical Diagnosis

Exercise 2: Anomaly Detection

Exercise 3: Multi-class Comparison

Exercise 4: Cost-Sensitive Learning

Need Expert Data Science Help?