Evaluation Metrics: Precision, Recall, F1, AUC-ROC & Confusion Matrix
Choosing the right metric for your business problem
Interview Question
"When would you optimize for precision vs recall? Explain the ROC curve and AUC. How do you evaluate models on imbalanced datasets?"
Difficulty: Medium | Frequently asked at Meta, Netflix, Amazon
Theoretical Foundation
Confusion Matrix
For binary classification:
- True Positive (TP): Correctly predicted positive
- False Positive (FP): Incorrectly predicted positive (Type I error)
- False Negative (FN): Incorrectly predicted negative (Type II error)
- True Negative (TN): Correctly predicted negative
Classification Metrics
Accuracy
- Intuitive but misleading for imbalanced datasets
- Example: 99% accuracy is trivial if 99% of samples are negative
Precision (Positive Predictive Value)
- "Of all predicted positives, how many are actually positive?"
- Important when FP is costly (spam detection, ad targeting)
Recall (Sensitivity, True Positive Rate)
- "Of all actual positives, how many did we catch?"
- Important when FN is costly (disease detection, fraud detection)
F1 Score
- Harmonic mean of precision and recall
- Balances precision and recall
- Range:
Specificity (True Negative Rate)
- "Of all actual negatives, how many did we correctly identify?"
When to Optimize Which Metric
| Metric | Optimize When | Example |
|---|---|---|
| Precision | FP is costly | Spam detection (don't mark legitimate emails as spam) |
| Recall | FN is costly | Disease detection (don't miss positive cases) |
| F1 | Both FP and FN are important | General classification |
| Accuracy | Classes are balanced | Balanced datasets |
ℹ️
Key Insight: Precision and recall are inversely related. Increasing the decision threshold increases precision but decreases recall. The optimal threshold depends on the business cost of FP vs FN.
ROC Curve and AUC
ROC Curve
Plots True Positive Rate (Recall) vs False Positive Rate at different thresholds:
Interpretation:
- Upper-left corner: Perfect classifier (TPR=1, FPR=0)
- Diagonal: Random classifier (TPR=FPR)
- Area under curve (AUC): Probability that classifier ranks a random positive higher than a random negative
AUC (Area Under ROC Curve)
- AUC = 1.0: Perfect classifier
- AUC = 0.5: Random classifier
- AUC < 0.5: Worse than random (flip predictions)
AUC Interpretation: AUC is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.
Precision-Recall Curve
For imbalanced datasets, precision-recall curve is more informative than ROC:
where and are recall and precision at threshold .
⚠️
Common Misconception: ROC curves can be misleading for imbalanced datasets. A classifier can have high AUC but poor precision. Always check the precision-recall curve for imbalanced problems.
Confusion Matrix Derivatives
Matthews Correlation Coefficient (MCC)
- Range:
- 1 = perfect, 0 = random, -1 = inverse
- Balanced even with imbalanced classes
Cohen's Kappa
where is observed agreement and is expected agreement.
Regression Metrics
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
R-Squared (Coefficient of Determination)
Code Implementation
Explanation of Code
-
Confusion Matrix: Shows TP, FP, FN, TN and derived metrics.
-
Precision vs Recall: Demonstrates the tradeoff at different thresholds.
-
ROC Curve: Plots TPR vs FPR and finds optimal threshold.
-
Precision-Recall Curve: Shows performance for imbalanced datasets.
-
Regression Metrics: Demonstrates MSE, RMSE, MAE, and R².
-
Metric Selection: Provides guidance on when to use each metric.
Real-World Applications
Meta: Content Moderation
Meta optimizes for:
- Recall: Catch all harmful content (minimize FN)
- Precision: Don't remove legitimate content (minimize FP)
- F1: Balance between the two
Netflix: Recommendation Ranking
Netflix uses:
- NDCG: Ranking quality (top recommendations matter)
- AUC-ROC: Classification of relevant vs irrelevant items
- Coverage: Ensure recommendations span all content
💡
Meta Interview Tip: Be prepared to discuss how business costs affect metric selection. For example, in content moderation, missing harmful content (FN) has higher cost than removing legitimate content (FP).
Common Follow-Up Questions
Q1: Why is accuracy misleading for imbalanced datasets?
If 99% of samples are negative, a classifier that predicts "negative" for everything achieves 99% accuracy but catches zero positives. Precision, recall, and F1 are more informative.
Q2: What is the difference between ROC-AUC and PR-AUC?
ROC-AUC uses TPR and FPR, which can be misleading for imbalanced datasets. PR-AUC uses precision and recall, which are more informative when the positive class is rare.
Q3: How do you choose the optimal classification threshold?
Consider business costs:
- Youden's J: Maximizes TPR - FPR (balanced)
- Cost-based: Minimize total cost of FP and FN
- F1-based: Maximize F1 score
Q4: Can you use accuracy for multi-class problems?
Yes, but consider:
- Macro-averaged F1: Treats all classes equally
- Weighted F1: Accounts for class imbalance
- Confusion matrix: Shows per-class performance
Company-Specific Tips
Meta Interview Tips
- Discuss multi-objective optimization (precision, recall, latency)
- Be ready to explain calibration of probabilities
- Mention fairness metrics across different groups
- Talk about online evaluation (A/B testing)
Netflix Interview Tips
- Focus on ranking metrics (NDCG, MRR)
- Discuss coverage and diversity metrics
- Be prepared to explain business-aligned metrics
- Mention user study design