CW

Model Evaluation: ROC, AUC, Precision, Recall and F1

Module 7: Machine Learning FundamentalsFree Lesson

Advertisement

Why Model Evaluation Matters

A model is only as good as our ability to measure its performance. Choosing the wrong metric can lead to catastrophically poor deployment decisions. This lesson covers the essential evaluation toolkit every ML practitioner must master.


1. The Confusion Matrix

For a binary classifier, the confusion matrix partitions predictions into four categories:

Predicted PositivePredicted Negative
Actual PositiveTrue Positive (TP)False Negative (FN)
Actual NegativeFalse Positive (FP)True Negative (TN)
Confusion MatrixPredicted PositivePredicted NegativeActual PositiveActual NegativeTPTrue PositiveCorrectly predicted positiveFNFalse NegativeMissed positive (Type II)FPFalse PositiveFalse alarm (Type I)TNTrue NegativeCorrectly predicted negative← Correct predictions →← Errors →

Key Definitions

  • Positive class = the class of interest (e.g., "has cancer", "fraudulent transaction")
  • TP: Model predicts positive, actual is positive
  • FP: Model predicts positive, actual is negative (Type I error)
  • FN: Model predicts negative, actual is positive (Type II error)
  • TN: Model predicts negative, actual is negative

The choice of which class is "positive" is arbitrary but critical. In disease screening, the diseased class is typically positive. Changing this flips TP↔TN and FP↔FN.


2. Classification Metrics

2.1 Accuracy

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

The fraction of all predictions that are correct. Dangerous when classes are imbalanced. A model predicting "no cancer" for all patients in a population with 1% prevalence achieves 99% accuracy while being completely useless.

2.2 Precision (Positive Predictive Value)

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Of all instances the model predicted as positive, what fraction actually are positive? High precision means few false alarms.

Use when: False positives are costly (e.g., spam filter marking legitimate emails as spam).

2.3 Recall (Sensitivity, True Positive Rate)

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

Of all actual positive instances, what fraction did the model catch? High recall means few missed positives.

Use when: False negatives are costly (e.g., failing to detect cancer).

2.4 F1-Score

F1=2PrecisionRecallPrecision+RecallF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

The harmonic mean of precision and recall. Balances both concerns. The harmonic mean penalizes extreme imbalances more than the arithmetic mean.

2.5 Fβ-Score (Generalized)

Fβ=(1+β2)PrecisionRecallβ2Precision+RecallF_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}
  • β=1\beta = 1: Equal weight (standard F1)
  • β=2\beta = 2: Recall weighted 2× more than precision
  • β=0.5\beta = 0.5: Precision weighted 2× more than recall

Comparison Table

MetricRangeBest WhenWeakness
Accuracy[0, 1]Balanced classesMisleading with imbalance
Precision[0, 1]FP costlyIgnores FN
Recall[0, 1]FN costlyIgnores FP
F1[0, 1]Balanced needIgnores TN

Multi-class extensions: For K classes, compute precision, recall, F1 per class, then aggregate:

  • Macro-average: Unweighted mean across classes (treats all classes equally)
  • Micro-average: Aggregate TP, FP, FN globally, then compute (biased toward majority class)
  • Weighted-average: Weighted by class support (number of instances per class)

3. The ROC Curve

3.1 Concept

The Receiver Operating Characteristic (ROC) curve plots the tradeoff between catching positives and raising false alarms as the classification threshold varies.

True Positive Rate (TPR)=Recall=TPTP+FN\text{True Positive Rate (TPR)} = \text{Recall} = \frac{TP}{TP + FN}
False Positive Rate (FPR)=FPFP+TN\text{False Positive Rate (FPR)} = \frac{FP}{FP + TN}

For each possible threshold t[0,1]t \in [0, 1]:

  1. Classify instances with score ≥ tt as positive
  2. Compute (FPR, TPR) pair
  3. Plot the curve
False Positive Rate (FPR)True Positive Rate (TPR)00.250.50.751.000.250.50.751.0Random (AUC = 0.5)Good model (AUC ≈ 0.92)Perfect (AUC = 1.0)ROC Curve

3.2 Interpreting the ROC Curve

  • Top-left corner (0, 1): Perfect classifier — catches all positives with no false alarms
  • Diagonal line: Random guessing — no discriminative power
  • Below diagonal: Worse than random (usually indicates flipped labels)
  • Closer to top-left: Better model

3.3 Threshold Selection

The ROC curve does not tell you which threshold to use. Common strategies:

  1. Closest to (0, 1): Minimize FPR2+(1TPR)2\sqrt{\text{FPR}^2 + (1 - \text{TPR})^2}
  2. Youden's J statistic: Maximize J=TPRFPRJ = \text{TPR} - \text{FPR}
  3. Cost-based: Assign costs to FP and FN, minimize expected cost

4. AUC — Area Under the ROC Curve

AUC=01TPR(FPR1(x))dx\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}^{-1}(x)) \, dx

Equivalently, AUC is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance:

AUC=P(f^(x+)>f^(x))\text{AUC} = P(\hat{f}(x^+) > \hat{f}(x^-))

AUC Interpretation

AUC RangeInterpretation
0.90 – 1.00Excellent
0.80 – 0.90Good
0.70 – 0.80Fair
0.60 – 0.70Poor
0.50 – 0.60Fail (no better than random)

AUC is threshold-invariant: It evaluates the model's ranking ability across all thresholds simultaneously.

AUC is class-prevalence-invariant: Unlike accuracy, it doesn't change when class proportions shift.

AUC can be misleading with severe class imbalance. A model with AUC = 0.95 may still have poor precision at the operating point needed for deployment. Always examine the ROC curve itself, not just the single AUC number.


5. Precision-Recall (PR) Curve

When the positive class is rare, the PR curve provides a more informative view than ROC.

Precision=TPTP+FP,Recall=TPTP+FN\text{Precision} = \frac{TP}{TP + FP}, \qquad \text{Recall} = \frac{TP}{TP + FN}

The PR curve plots Precision (y-axis) vs. Recall (x-axis) across all thresholds.

RecallPrecision00.250.50.751.000.250.50.751.0Baseline (prevalence = 0.1)Good model (AP ≈ 0.88)Poor model (AP ≈ 0.35)Precision-Recall Curve

Average Precision (AP)

AP=k=1n(RkRk1)Pk\text{AP} = \sum_{k=1}^{n} (R_k - R_{k-1}) \cdot P_k

where RkR_k and PkP_k are recall and precision at the kk-th threshold. AP summarizes the PR curve into a single number (area under the PR curve).

When to Use PR vs. ROC

ScenarioPrefer
Balanced classesROC
Severe imbalance (positive class rare)PR
Both classes matter equallyROC
False positives very costlyPR
False negatives very costlyROC (or PR with recall focus)

6. Regression Metrics

6.1 Mean Squared Error (MSE)

MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Penalizes large errors quadratically. Sensitive to outliers. Units are squared.

6.2 Root Mean Squared Error (RMSE)

RMSE=MSE=1ni=1n(yiy^i)2\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}

Same units as the target variable. More interpretable than MSE.

6.3 Mean Absolute Error (MAE)

MAE=1ni=1nyiy^i\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

Robust to outliers. Linear penalty for errors.

6.4 R² (Coefficient of Determination)

R2=1i=1n(yiy^i)2i=1n(yiyˉ)2=1SSresSStotR^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}
  • R2=1R^2 = 1: Perfect prediction
  • R2=0R^2 = 0: Model performs no better than predicting the mean
  • R2<0R^2 < 0: Model is worse than the mean

Metric Comparison

MetricSensitive to OutliersInterpretable UnitsPenalizes Large Errors
MSEYes (quadratic)No (squared)Strongly
RMSEYes (quadratic)YesStrongly
MAENo (linear)YesLinearly
R²DependsNo (normalized)Relative

Adjusted R² penalizes model complexity: Radj2=1(1R2)(n1)np1R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n-1)}{n-p-1}, where pp is the number of predictors. Use this to compare models with different numbers of features.


7. Bias-Variance Tradeoff

The expected prediction error for a model can be decomposed as:

Expected Error=Bias2+Variance+Irreducible Noise\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}
Bias-Variance DecompositionTarget (true value)High Bias, Low VarianceUnderfittingLow Bias, High VarianceOverfittingLow Bias, Low VarianceGood FitModel Complexity →SimpleComplex

Definitions

Bias (systematic error): Error from erroneous assumptions in the learning algorithm. High bias causes the model to miss relevant relations between features and targets.

Bias[f^(x)]=E[f^(x)]f(x)\text{Bias}[\hat{f}(x)] = E[\hat{f}(x)] - f(x)

Variance (sensitivity to data): Error from sensitivity to small fluctuations in the training set. High variance causes the model to model random noise.

Var[f^(x)]=E[(f^(x)E[f^(x)])2]\text{Var}[\hat{f}(x)] = E\left[(\hat{f}(x) - E[\hat{f}(x)])^2\right]

The Tradeoff

  • Increasing model complexity → decreases bias, increases variance
  • Decreasing model complexity → increases bias, decreases variance
  • Optimal complexity → minimizes total error

Practical Implications

SymptomDiagnosisRemedies
High training errorHigh bias (underfitting)More features, more complex model, reduce regularization
Low training error, high test errorHigh variance (overfitting)More data, regularization, ensemble methods, feature selection
High training and test errorVery high biasFundamentally wrong model class

8. Learning Curves

Learning curves plot model performance as a function of training set size, revealing whether the model suffers from bias, variance, or both.

Learning CurvesTraining Set SizeError0LowMedHighTraining errorValidation errorGapHigh Variance (Overfitting)Large gap, training error low Learning Curves — High BiasTraining Set SizeErrorTraining errorValidation errorSmall gapHigh Bias (Underfitting)Both errors high, small gap

Reading Learning Curves

High Variance (overfitting):

  • Training error: Low and stays low
  • Validation error: Starts high, slowly decreases
  • Large gap between curves
  • Fix: More data, regularization, simpler model, feature reduction

High Bias (underfitting):

  • Training error: Starts moderate, stays moderate/high
  • Validation error: Starts high, converges toward training error
  • Small gap, both converge to high error
  • Fix: More complex model, more features, reduce regularization

The Irreducible Error

Both curves converge to a floor above zero — this is the irreducible error (σϵ2\sigma^2_\epsilon), the noise inherent in the data that no model can eliminate.


9. Summary Table

MetricTypeBest ForSensitive to Imbalance?
AccuracyClassificationBalanced classesYes (misleading)
PrecisionClassificationFP costlyNo
RecallClassificationFN costlyNo
F1-ScoreClassificationBalanced P/RModerate
AUC-ROCClassificationRanking qualityNo
AP / AUC-PRClassificationImbalanced positive classNo
MSE/RMSERegressionLarge error penaltyN/A
MAERegressionRobust to outliersN/A
R²RegressionVariance explainedN/A

10. Key Takeaways

  1. Never use accuracy alone with imbalanced classes — use precision, recall, F1, or AUC.
  2. ROC curves show the full threshold tradeoff; AUC summarizes ranking ability.
  3. PR curves are more informative than ROC when the positive class is rare.
  4. Bias-variance decomposition diagnoses underfitting vs. overfitting.
  5. Learning curves reveal whether more data or a more complex model will help.
  6. Choose metrics aligned with business costs — what matters more: false alarms or missed detections?

Next up: Module 8 covers hyperparameter tuning and model selection — how to systematically optimize the models you've learned to evaluate.

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement