Model Evaluation — Metrics, Cross-Validation & Selection

ML FoundationsEvaluationFree Lesson

Advertisement

Model Evaluation — Complete Guide

Choosing the right metric and evaluation strategy is critical. A model with 99% accuracy might be useless if the data is imbalanced.


Classification Metrics

Confusion Matrix:
                    Predicted
                  0      1
Actual    0  [  TN  |  FP  ]
          1  [  FN  |  TP  ]

Accuracy:  (TP + TN) / (TP + TN + FP + FN)
           = overall correctness

Precision: TP / (TP + FP)
           = of predicted positives, how many correct?
           = important when FP is costly (spam filter)

Recall:    TP / (TP + FN)
           = of actual positives, how many found?
           = important when FN is costly (cancer detection)

F1 Score:  2 × (Precision × Recall) / (Precision + Recall)
           = harmonic mean of precision and recall

AUC-ROC:   Area under ROC curve
           = threshold-independent performance
           = 0.5 = random, 1.0 = perfect

Regression Metrics

MSE:  (1/n) Σ(yᵢ - ŷᵢ)²
      Penalizes large errors heavily

RMSE: √MSE
      In same units as target

MAE:  (1/n) Σ|yᵢ - ŷᵢ|
      Average absolute error

R²:   1 - SS_res / SS_tot
      Proportion of variance explained
      1.0 = perfect, 0.0 = predicts mean

Adjusted R²: R² penalized for number of features
             Prevents overfitting with many features

Cross-Validation

Problem: Single train/test split is unreliable

Solution: K-Fold Cross-Validation

Data split into K folds:
Fold:  [1] [2] [3] [4] [5]
Run 1: Test Train Train Train Train
Run 2: Train Test Train Train Train
Run 3: Train Train Test Train Train
Run 4: Train Train Train Test Train
Run 5: Train Train Train Train Test

Final score = average of all K runs

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

Bias-Variance Tradeoff

Total Error = Bias² + Variance + Irreducible Error

High Bias (Underfitting):
├─ Model too simple
├─ Misses patterns
├─ Low training accuracy
└─ Low test accuracy

High Variance (Overfitting):
├─ Model too complex
├─ Memorizes noise
├─ High training accuracy
└─ Low test accuracy

Sweet Spot:
├─ Balanced complexity
├─ Captures patterns, not noise
├─ Good training accuracy
└─ Good test accuracy

Visualization:
Error
│  Training ╲    ╱ Test
│             ╲╱
│             ╱╲
│            ╱  ╲
│           ╱    ╲
└──────────────────── Model Complexity
      Underfitting | Overfitting

Learning Curves

Training score vs Training set size:

High Bias:
Score │ ───────────── (both converge low)
      │
      └──────────────── Training Size

High Variance:
Score │ ╱─────── (gap between train and test)
      │╱
      └──────────────── Training Size

Good Fit:
Score │    ╱───────── (both converge high)
      │   ╱
      └──────────────── Training Size

Key Takeaways

  1. Accuracy is misleading for imbalanced datasets
  2. Use precision when false positives are costly
  3. Use recall when false negatives are costly
  4. F1 score balances precision and recall
  5. AUC-ROC is threshold-independent
  6. Always use cross-validation for reliable estimates
  7. Bias-variance tradeoff is the central challenge
  8. Learning curves diagnose underfitting vs overfitting
  9. Choose metrics that match your business objective
  10. No free lunch — no single model works best for all problems

Advertisement

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement