Model Evaluation — Complete Guide
Choosing the right metric and evaluation strategy is critical. A model with 99% accuracy might be useless if the data is imbalanced.
Classification Metrics
Confusion Matrix:
Predicted
0 1
Actual 0 [ TN | FP ]
1 [ FN | TP ]
Accuracy: (TP + TN) / (TP + TN + FP + FN)
= overall correctness
Precision: TP / (TP + FP)
= of predicted positives, how many correct?
= important when FP is costly (spam filter)
Recall: TP / (TP + FN)
= of actual positives, how many found?
= important when FN is costly (cancer detection)
F1 Score: 2 × (Precision × Recall) / (Precision + Recall)
= harmonic mean of precision and recall
AUC-ROC: Area under ROC curve
= threshold-independent performance
= 0.5 = random, 1.0 = perfect
Regression Metrics
MSE: (1/n) Σ(yᵢ - ŷᵢ)²
Penalizes large errors heavily
RMSE: √MSE
In same units as target
MAE: (1/n) Σ|yᵢ - ŷᵢ|
Average absolute error
R²: 1 - SS_res / SS_tot
Proportion of variance explained
1.0 = perfect, 0.0 = predicts mean
Adjusted R²: R² penalized for number of features
Prevents overfitting with many features
Cross-Validation
Problem: Single train/test split is unreliable
Solution: K-Fold Cross-Validation
Data split into K folds:
Fold: [1] [2] [3] [4] [5]
Run 1: Test Train Train Train Train
Run 2: Train Test Train Train Train
Run 3: Train Train Test Train Train
Run 4: Train Train Train Test Train
Run 5: Train Train Train Train Test
Final score = average of all K runs
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
Bias-Variance Tradeoff
Total Error = Bias² + Variance + Irreducible Error
High Bias (Underfitting):
├─ Model too simple
├─ Misses patterns
├─ Low training accuracy
└─ Low test accuracy
High Variance (Overfitting):
├─ Model too complex
├─ Memorizes noise
├─ High training accuracy
└─ Low test accuracy
Sweet Spot:
├─ Balanced complexity
├─ Captures patterns, not noise
├─ Good training accuracy
└─ Good test accuracy
Visualization:
Error
│ Training ╲ ╱ Test
│ ╲╱
│ ╱╲
│ ╱ ╲
│ ╱ ╲
└──────────────────── Model Complexity
Underfitting | Overfitting
Learning Curves
Training score vs Training set size:
High Bias:
Score │ ───────────── (both converge low)
│
└──────────────── Training Size
High Variance:
Score │ ╱─────── (gap between train and test)
│╱
└──────────────── Training Size
Good Fit:
Score │ ╱───────── (both converge high)
│ ╱
└──────────────── Training Size
Key Takeaways
- Accuracy is misleading for imbalanced datasets
- Use precision when false positives are costly
- Use recall when false negatives are costly
- F1 score balances precision and recall
- AUC-ROC is threshold-independent
- Always use cross-validation for reliable estimates
- Bias-variance tradeoff is the central challenge
- Learning curves diagnose underfitting vs overfitting
- Choose metrics that match your business objective
- No free lunch — no single model works best for all problems