Model Evaluation — Complete Guide

Choosing the right metric and evaluation strategy is critical. A model with 99% accuracy might be useless if the data is imbalanced.

Classification Metrics

Confusion Matrix:
                    Predicted
                  0      1
Actual    0  [  TN  |  FP  ]
          1  [  FN  |  TP  ]

Accuracy:  (TP + TN) / (TP + TN + FP + FN)
           = overall correctness

Precision: TP / (TP + FP)
           = of predicted positives, how many correct?
           = important when FP is costly (spam filter)

Recall:    TP / (TP + FN)
           = of actual positives, how many found?
           = important when FN is costly (cancer detection)

F1 Score:  2 × (Precision × Recall) / (Precision + Recall)
           = harmonic mean of precision and recall

AUC-ROC:   Area under ROC curve
           = threshold-independent performance
           = 0.5 = random, 1.0 = perfect

Regression Metrics

MSE:  (1/n) Σ(yᵢ - ŷᵢ)²
      Penalizes large errors heavily

RMSE: √MSE
      In same units as target

MAE:  (1/n) Σ|yᵢ - ŷᵢ|
      Average absolute error

R²:   1 - SS_res / SS_tot
      Proportion of variance explained
      1.0 = perfect, 0.0 = predicts mean

Adjusted R²: R² penalized for number of features
             Prevents overfitting with many features

Cross-Validation

Problem: Single train/test split is unreliable

Solution: K-Fold Cross-Validation

Data split into K folds:
Fold:  [1] [2] [3] [4] [5]
Run 1: Test Train Train Train Train
Run 2: Train Test Train Train Train
Run 3: Train Train Test Train Train
Run 4: Train Train Train Test Train
Run 5: Train Train Train Train Test

Final score = average of all K runs

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

Bias-Variance Tradeoff

Total Error = Bias² + Variance + Irreducible Error

High Bias (Underfitting):
├─ Model too simple
├─ Misses patterns
├─ Low training accuracy
└─ Low test accuracy

High Variance (Overfitting):
├─ Model too complex
├─ Memorizes noise
├─ High training accuracy
└─ Low test accuracy

Sweet Spot:
├─ Balanced complexity
├─ Captures patterns, not noise
├─ Good training accuracy
└─ Good test accuracy

Visualization:
Error
│  Training ╲    ╱ Test
│             ╲╱
│             ╱╲
│            ╱  ╲
│           ╱    ╲
└──────────────────── Model Complexity
      Underfitting | Overfitting

Learning Curves

Training score vs Training set size:

High Bias:
Score │ ───────────── (both converge low)
      │
      └──────────────── Training Size

High Variance:
Score │ ╱─────── (gap between train and test)
      │╱
      └──────────────── Training Size

Good Fit:
Score │    ╱───────── (both converge high)
      │   ╱
      └──────────────── Training Size

Key Takeaways

Accuracy is misleading for imbalanced datasets
Use precision when false positives are costly
Use recall when false negatives are costly
F1 score balances precision and recall
AUC-ROC is threshold-independent
Always use cross-validation for reliable estimates
Bias-variance tradeoff is the central challenge
Learning curves diagnose underfitting vs overfitting
Choose metrics that match your business objective
No free lunch — no single model works best for all problems

Model Evaluation — Metrics, Cross-Validation & Selection

Model Evaluation — Complete Guide

Classification Metrics

Regression Metrics

Cross-Validation

Bias-Variance Tradeoff

Learning Curves

Key Takeaways

Need Expert Machine Learning Help?