Why Model Evaluation Matters
A model is only as good as our ability to measure its performance. Choosing the wrong metric can lead to catastrophically poor deployment decisions. This lesson covers the essential evaluation toolkit every ML practitioner must master.
1. The Confusion Matrix
For a binary classifier, the confusion matrix partitions predictions into four categories:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
Key Definitions
- Positive class = the class of interest (e.g., "has cancer", "fraudulent transaction")
- TP: Model predicts positive, actual is positive
- FP: Model predicts positive, actual is negative (Type I error)
- FN: Model predicts negative, actual is positive (Type II error)
- TN: Model predicts negative, actual is negative
The choice of which class is "positive" is arbitrary but critical. In disease screening, the diseased class is typically positive. Changing this flips TP↔TN and FP↔FN.
2. Classification Metrics
2.1 Accuracy
The fraction of all predictions that are correct. Dangerous when classes are imbalanced. A model predicting "no cancer" for all patients in a population with 1% prevalence achieves 99% accuracy while being completely useless.
2.2 Precision (Positive Predictive Value)
Of all instances the model predicted as positive, what fraction actually are positive? High precision means few false alarms.
Use when: False positives are costly (e.g., spam filter marking legitimate emails as spam).
2.3 Recall (Sensitivity, True Positive Rate)
Of all actual positive instances, what fraction did the model catch? High recall means few missed positives.
Use when: False negatives are costly (e.g., failing to detect cancer).
2.4 F1-Score
The harmonic mean of precision and recall. Balances both concerns. The harmonic mean penalizes extreme imbalances more than the arithmetic mean.
2.5 Fβ-Score (Generalized)
- : Equal weight (standard F1)
- : Recall weighted 2× more than precision
- : Precision weighted 2× more than recall
Comparison Table
| Metric | Range | Best When | Weakness |
|---|---|---|---|
| Accuracy | [0, 1] | Balanced classes | Misleading with imbalance |
| Precision | [0, 1] | FP costly | Ignores FN |
| Recall | [0, 1] | FN costly | Ignores FP |
| F1 | [0, 1] | Balanced need | Ignores TN |
Multi-class extensions: For K classes, compute precision, recall, F1 per class, then aggregate:
- Macro-average: Unweighted mean across classes (treats all classes equally)
- Micro-average: Aggregate TP, FP, FN globally, then compute (biased toward majority class)
- Weighted-average: Weighted by class support (number of instances per class)
3. The ROC Curve
3.1 Concept
The Receiver Operating Characteristic (ROC) curve plots the tradeoff between catching positives and raising false alarms as the classification threshold varies.
For each possible threshold :
- Classify instances with score ≥ as positive
- Compute (FPR, TPR) pair
- Plot the curve
3.2 Interpreting the ROC Curve
- Top-left corner (0, 1): Perfect classifier — catches all positives with no false alarms
- Diagonal line: Random guessing — no discriminative power
- Below diagonal: Worse than random (usually indicates flipped labels)
- Closer to top-left: Better model
3.3 Threshold Selection
The ROC curve does not tell you which threshold to use. Common strategies:
- Closest to (0, 1): Minimize
- Youden's J statistic: Maximize
- Cost-based: Assign costs to FP and FN, minimize expected cost
4. AUC — Area Under the ROC Curve
Equivalently, AUC is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance:
AUC Interpretation
| AUC Range | Interpretation |
|---|---|
| 0.90 – 1.00 | Excellent |
| 0.80 – 0.90 | Good |
| 0.70 – 0.80 | Fair |
| 0.60 – 0.70 | Poor |
| 0.50 – 0.60 | Fail (no better than random) |
AUC is threshold-invariant: It evaluates the model's ranking ability across all thresholds simultaneously.
AUC is class-prevalence-invariant: Unlike accuracy, it doesn't change when class proportions shift.
AUC can be misleading with severe class imbalance. A model with AUC = 0.95 may still have poor precision at the operating point needed for deployment. Always examine the ROC curve itself, not just the single AUC number.
5. Precision-Recall (PR) Curve
When the positive class is rare, the PR curve provides a more informative view than ROC.
The PR curve plots Precision (y-axis) vs. Recall (x-axis) across all thresholds.
Average Precision (AP)
where and are recall and precision at the -th threshold. AP summarizes the PR curve into a single number (area under the PR curve).
When to Use PR vs. ROC
| Scenario | Prefer |
|---|---|
| Balanced classes | ROC |
| Severe imbalance (positive class rare) | PR |
| Both classes matter equally | ROC |
| False positives very costly | PR |
| False negatives very costly | ROC (or PR with recall focus) |
6. Regression Metrics
6.1 Mean Squared Error (MSE)
Penalizes large errors quadratically. Sensitive to outliers. Units are squared.
6.2 Root Mean Squared Error (RMSE)
Same units as the target variable. More interpretable than MSE.
6.3 Mean Absolute Error (MAE)
Robust to outliers. Linear penalty for errors.
6.4 R² (Coefficient of Determination)
- : Perfect prediction
- : Model performs no better than predicting the mean
- : Model is worse than the mean
Metric Comparison
| Metric | Sensitive to Outliers | Interpretable Units | Penalizes Large Errors |
|---|---|---|---|
| MSE | Yes (quadratic) | No (squared) | Strongly |
| RMSE | Yes (quadratic) | Yes | Strongly |
| MAE | No (linear) | Yes | Linearly |
| R² | Depends | No (normalized) | Relative |
Adjusted R² penalizes model complexity: , where is the number of predictors. Use this to compare models with different numbers of features.
7. Bias-Variance Tradeoff
The expected prediction error for a model can be decomposed as:
Definitions
Bias (systematic error): Error from erroneous assumptions in the learning algorithm. High bias causes the model to miss relevant relations between features and targets.
Variance (sensitivity to data): Error from sensitivity to small fluctuations in the training set. High variance causes the model to model random noise.
The Tradeoff
- Increasing model complexity → decreases bias, increases variance
- Decreasing model complexity → increases bias, decreases variance
- Optimal complexity → minimizes total error
Practical Implications
| Symptom | Diagnosis | Remedies |
|---|---|---|
| High training error | High bias (underfitting) | More features, more complex model, reduce regularization |
| Low training error, high test error | High variance (overfitting) | More data, regularization, ensemble methods, feature selection |
| High training and test error | Very high bias | Fundamentally wrong model class |
8. Learning Curves
Learning curves plot model performance as a function of training set size, revealing whether the model suffers from bias, variance, or both.
Reading Learning Curves
High Variance (overfitting):
- Training error: Low and stays low
- Validation error: Starts high, slowly decreases
- Large gap between curves
- Fix: More data, regularization, simpler model, feature reduction
High Bias (underfitting):
- Training error: Starts moderate, stays moderate/high
- Validation error: Starts high, converges toward training error
- Small gap, both converge to high error
- Fix: More complex model, more features, reduce regularization
The Irreducible Error
Both curves converge to a floor above zero — this is the irreducible error (), the noise inherent in the data that no model can eliminate.
9. Summary Table
| Metric | Type | Best For | Sensitive to Imbalance? |
|---|---|---|---|
| Accuracy | Classification | Balanced classes | Yes (misleading) |
| Precision | Classification | FP costly | No |
| Recall | Classification | FN costly | No |
| F1-Score | Classification | Balanced P/R | Moderate |
| AUC-ROC | Classification | Ranking quality | No |
| AP / AUC-PR | Classification | Imbalanced positive class | No |
| MSE/RMSE | Regression | Large error penalty | N/A |
| MAE | Regression | Robust to outliers | N/A |
| R² | Regression | Variance explained | N/A |
10. Key Takeaways
- Never use accuracy alone with imbalanced classes — use precision, recall, F1, or AUC.
- ROC curves show the full threshold tradeoff; AUC summarizes ranking ability.
- PR curves are more informative than ROC when the positive class is rare.
- Bias-variance decomposition diagnoses underfitting vs. overfitting.
- Learning curves reveal whether more data or a more complex model will help.
- Choose metrics aligned with business costs — what matters more: false alarms or missed detections?
Next up: Module 8 covers hyperparameter tuning and model selection — how to systematically optimize the models you've learned to evaluate.