Model Evaluation: ROC, AUC, Precision, Recall and F1

Why Model Evaluation Matters

A model is only as good as our ability to measure its performance. Choosing the wrong metric can lead to catastrophically poor deployment decisions. This lesson covers the essential evaluation toolkit every ML practitioner must master.

1. The Confusion Matrix

For a binary classifier, the confusion matrix partitions predictions into four categories:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Confusion Matrix
Predicted Positive
Predicted Negative
Actual Positive
Actual Negative
TP
True Positive
Correctly predicted positive
FN
False Negative
Missed positive (Type II)
FP
False Positive
False alarm (Type I)
TN
True Negative
Correctly predicted negative
?� Correct predictions ?�
?� Errors ?�

Key Definitions

Positive class = the class of interest (e.g., "has cancer", "fraudulent transaction")
TP: Model predicts positive, actual is positive
FP: Model predicts positive, actual is negative (Type I error)
FN: Model predicts negative, actual is positive (Type II error)
TN: Model predicts negative, actual is negative

⚠️

The choice of which class is "positive" is arbitrary but critical. In disease screening, the diseased class is typically positive. Changing this flips TP?�TN and FP?�FN.

2. Classification Metrics

2.1 Accuracy

The fraction of all predictions that are correct. Dangerous when classes are imbalanced. A model predicting "no cancer" for all patients in a population with 1% prevalence achieves 99% accuracy while being completely useless.

2.2 Precision (Positive Predictive Value)

Of all instances the model predicted as positive, what fraction actually are positive? High precision means few false alarms.

Use when: False positives are costly (e.g., spam filter marking legitimate emails as spam).

2.3 Recall (Sensitivity, True Positive Rate)

Of all actual positive instances, what fraction did the model catch? High recall means few missed positives.

Use when: False negatives are costly (e.g., failing to detect cancer).

2.4 F1-Score

The harmonic mean of precision and recall. Balances both concerns. The harmonic mean penalizes extreme imbalances more than the arithmetic mean.

2.5 Fβ-Score (Generalized)

: Equal weight (standard F1)
: Recall weighted 2× more than precision
: Precision weighted 2× more than recall

Comparison Table

Metric	Range	Best When	Weakness
Accuracy	[0, 1]	Balanced classes	Misleading with imbalance
Precision	[0, 1]	FP costly	Ignores FN
Recall	[0, 1]	FN costly	Ignores FP
F1	[0, 1]	Balanced need	Ignores TN

ℹ️

Multi-class extensions: For K classes, compute precision, recall, F1 per class, then aggregate:

Macro-average: Unweighted mean across classes (treats all classes equally)
Micro-average: Aggregate TP, FP, FN globally, then compute (biased toward majority class)
Weighted-average: Weighted by class support (number of instances per class)

3. The ROC Curve

3.1 Concept

The Receiver Operating Characteristic (ROC) curve plots the tradeoff between catching positives and raising false alarms as the classification threshold varies.

For each possible threshold :

Classify instances with score ��¥ as positive
Compute (FPR, TPR) pair
Plot the curve

3.2 Interpreting the ROC Curve

Top-left corner (0, 1): Perfect classifier — catches all positives with no false alarms
Diagonal line: Random guessing — no discriminative power
Below diagonal: Worse than random (usually indicates flipped labels)
Closer to top-left: Better model

3.3 Threshold Selection

The ROC curve does not tell you which threshold to use. Common strategies:

Closest to (0, 1): Minimize
Youden's J statistic: Maximize
Cost-based: Assign costs to FP and FN, minimize expected cost

4. AUC — Area Under the ROC Curve

Equivalently, AUC is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance:

AUC Interpretation

AUC Range	Interpretation
0.90 – 1.00	Excellent
0.80 – 0.90	Good
0.70 – 0.80	Fair
0.60 – 0.70	Poor
0.50 – 0.60	Fail (no better than random)

AUC is threshold-invariant: It evaluates the model's ranking ability across all thresholds simultaneously.

AUC is class-prevalence-invariant: Unlike accuracy, it doesn't change when class proportions shift.

⚠️

AUC can be misleading with severe class imbalance. A model with AUC = 0.95 may still have poor precision at the operating point needed for deployment. Always examine the ROC curve itself, not just the single AUC number.

5. Precision-Recall (PR) Curve

When the positive class is rare, the PR curve provides a more informative view than ROC.

The PR curve plots Precision (y-axis) vs. Recall (x-axis) across all thresholds.

Average Precision (AP)

where and are recall and precision at the -th threshold. AP summarizes the PR curve into a single number (area under the PR curve).

When to Use PR vs. ROC

Scenario	Prefer
Balanced classes	ROC
Severe imbalance (positive class rare)	PR
Both classes matter equally	ROC
False positives very costly	PR
False negatives very costly	ROC (or PR with recall focus)

6. Regression Metrics

6.1 Mean Squared Error (MSE)

Penalizes large errors quadratically. Sensitive to outliers. Units are squared.

6.2 Root Mean Squared Error (RMSE)

Same units as the target variable. More interpretable than MSE.

6.3 Mean Absolute Error (MAE)

Robust to outliers. Linear penalty for errors.

6.4 R² (Coefficient of Determination)

: Perfect prediction
: Model performs no better than predicting the mean
: Model is worse than the mean

Metric Comparison

Metric	Sensitive to Outliers	Interpretable Units	Penalizes Large Errors
MSE	Yes (quadratic)	No (squared)	Strongly
RMSE	Yes (quadratic)	Yes	Strongly
MAE	No (linear)	Yes	Linearly
R²	Depends	No (normalized)	Relative

ℹ️

Adjusted R² penalizes model complexity: , where is the number of predictors. Use this to compare models with different numbers of features.

7. Bias-Variance Tradeoff

The expected prediction error for a model can be decomposed as:

Bias-Variance Decomposition
Target (true value)
High Bias, Low Variance
Underfitting
Low Bias, High Variance
Overfitting
Low Bias, Low Variance
Good Fit
Model Complexity ?�
Simple
Complex

Definitions

Bias (systematic error): Error from erroneous assumptions in the learning algorithm. High bias causes the model to miss relevant relations between features and targets.

Variance (sensitivity to data): Error from sensitivity to small fluctuations in the training set. High variance causes the model to model random noise.

The Tradeoff

Increasing model complexity ?� decreases bias, increases variance
Decreasing model complexity ?� increases bias, decreases variance
Optimal complexity ?� minimizes total error

Practical Implications

Symptom	Diagnosis	Remedies
High training error	High bias (underfitting)	More features, more complex model, reduce regularization
Low training error, high test error	High variance (overfitting)	More data, regularization, ensemble methods, feature selection
High training and test error	Very high bias	Fundamentally wrong model class

8. Learning Curves

Learning curves plot model performance as a function of training set size, revealing whether the model suffers from bias, variance, or both.

Reading Learning Curves

High Variance (overfitting):

Training error: Low and stays low
Validation error: Starts high, slowly decreases
Large gap between curves
Fix: More data, regularization, simpler model, feature reduction

High Bias (underfitting):

Training error: Starts moderate, stays moderate/high
Validation error: Starts high, converges toward training error
Small gap, both converge to high error
Fix: More complex model, more features, reduce regularization

The Irreducible Error

Both curves converge to a floor above zero — this is the irreducible error (), the noise inherent in the data that no model can eliminate.

9. Summary Table

Metric	Type	Best For	Sensitive to Imbalance?
Accuracy	Classification	Balanced classes	Yes (misleading)
Precision	Classification	FP costly	No
Recall	Classification	FN costly	No
F1-Score	Classification	Balanced P/R	Moderate
AUC-ROC	Classification	Ranking quality	No
AP / AUC-PR	Classification	Imbalanced positive class	No
MSE/RMSE	Regression	Large error penalty	N/A
MAE	Regression	Robust to outliers	N/A
R²	Regression	Variance explained	N/A

10. Key Takeaways

Never use accuracy alone with imbalanced classes — use precision, recall, F1, or AUC.
ROC curves show the full threshold tradeoff; AUC summarizes ranking ability.
PR curves are more informative than ROC when the positive class is rare.
Bias-variance decomposition diagnoses underfitting vs. overfitting.
Learning curves reveal whether more data or a more complex model will help.
Choose metrics aligned with business costs — what matters more: false alarms or missed detections?

ℹ️

Next up: Module 8 covers hyperparameter tuning and model selection — how to systematically optimize the models you've learned to evaluate.