Meta & Netflix Interview

Evaluation Metrics: Precision, Recall, F1, AUC-ROC & Confusion Matrix

Choosing the right metric for your business problem

Interview Question

"When would you optimize for precision vs recall? Explain the ROC curve and AUC. How do you evaluate models on imbalanced datasets?"

Difficulty: Medium | Frequently asked at Meta, Netflix, Amazon

Theoretical Foundation

Confusion Matrix

For binary classification:

\text{Confusion Matrix} = \begin{bmatrix} TP & FP \\ FN & TN \end{bmatrix}

True Positive (TP): Correctly predicted positive
False Positive (FP): Incorrectly predicted positive (Type I error)
False Negative (FN): Incorrectly predicted negative (Type II error)
True Negative (TN): Correctly predicted negative

Classification Metrics

Accuracy

\text{Accuracy} = \frac{TP + TN}{TP + FP + FN + TN}

Intuitive but misleading for imbalanced datasets
Example: 99% accuracy is trivial if 99% of samples are negative

Precision (Positive Predictive Value)

\text{Precision} = \frac{TP}{TP + FP}

"Of all predicted positives, how many are actually positive?"
Important when FP is costly (spam detection, ad targeting)

Recall (Sensitivity, True Positive Rate)

\text{Recall} = \frac{TP}{TP + FN}

"Of all actual positives, how many did we catch?"
Important when FN is costly (disease detection, fraud detection)

F1 Score

F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Harmonic mean of precision and recall
Balances precision and recall
Range: $[0, 1]$

Specificity (True Negative Rate)

\text{Specificity} = \frac{TN}{TN + FP}

"Of all actual negatives, how many did we correctly identify?"

When to Optimize Which Metric

Metric	Optimize When	Example
Precision	FP is costly	Spam detection (don't mark legitimate emails as spam)
Recall	FN is costly	Disease detection (don't miss positive cases)
F1	Both FP and FN are important	General classification
Accuracy	Classes are balanced	Balanced datasets

ℹ️

Key Insight: Precision and recall are inversely related. Increasing the decision threshold increases precision but decreases recall. The optimal threshold depends on the business cost of FP vs FN.

ROC Curve and AUC

ROC Curve

Plots True Positive Rate (Recall) vs False Positive Rate at different thresholds:

\text{TPR} = \frac{TP}{TP + FN}, \quad \text{FPR} = \frac{FP}{FP + TN}

Interpretation:

Upper-left corner: Perfect classifier (TPR=1, FPR=0)
Diagonal: Random classifier (TPR=FPR)
Area under curve (AUC): Probability that classifier ranks a random positive higher than a random negative

AUC (Area Under ROC Curve)

AUC = 1.0: Perfect classifier
AUC = 0.5: Random classifier
AUC < 0.5: Worse than random (flip predictions)

AUC Interpretation: AUC is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.

Precision-Recall Curve

For imbalanced datasets, precision-recall curve is more informative than ROC:

\text{Average Precision (AP)} = \sum_{k} (R_k - R_{k-1}) P_k

where $R_k$ and $P_k$ are recall and precision at threshold $k$ .

⚠️

Common Misconception: ROC curves can be misleading for imbalanced datasets. A classifier can have high AUC but poor precision. Always check the precision-recall curve for imbalanced problems.

Confusion Matrix Derivatives

Matthews Correlation Coefficient (MCC)

MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}

Range: $[-1, 1]$
1 = perfect, 0 = random, -1 = inverse
Balanced even with imbalanced classes

Cohen's Kappa

\kappa = \frac{p_o - p_e}{1 - p_e}

where $p_o$ is observed agreement and $p_e$ is expected agreement.

Regression Metrics

Mean Squared Error (MSE)

MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Root Mean Squared Error (RMSE)

RMSE = \sqrt{MSE}

Mean Absolute Error (MAE)

MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

R-Squared (Coefficient of Determination)

R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

Code Implementation

Explanation of Code

Confusion Matrix: Shows TP, FP, FN, TN and derived metrics.
Precision vs Recall: Demonstrates the tradeoff at different thresholds.
ROC Curve: Plots TPR vs FPR and finds optimal threshold.
Precision-Recall Curve: Shows performance for imbalanced datasets.
Regression Metrics: Demonstrates MSE, RMSE, MAE, and R².
Metric Selection: Provides guidance on when to use each metric.

Real-World Applications

Meta: Content Moderation

Meta optimizes for:

Recall: Catch all harmful content (minimize FN)
Precision: Don't remove legitimate content (minimize FP)
F1: Balance between the two

Netflix: Recommendation Ranking

Netflix uses:

NDCG: Ranking quality (top recommendations matter)
AUC-ROC: Classification of relevant vs irrelevant items
Coverage: Ensure recommendations span all content

💡

Meta Interview Tip: Be prepared to discuss how business costs affect metric selection. For example, in content moderation, missing harmful content (FN) has higher cost than removing legitimate content (FP).

Common Follow-Up Questions

Q1: Why is accuracy misleading for imbalanced datasets?

If 99% of samples are negative, a classifier that predicts "negative" for everything achieves 99% accuracy but catches zero positives. Precision, recall, and F1 are more informative.

Q2: What is the difference between ROC-AUC and PR-AUC?

ROC-AUC uses TPR and FPR, which can be misleading for imbalanced datasets. PR-AUC uses precision and recall, which are more informative when the positive class is rare.

Q3: How do you choose the optimal classification threshold?

Consider business costs:

Youden's J: Maximizes TPR - FPR (balanced)
Cost-based: Minimize total cost of FP and FN
F1-based: Maximize F1 score

Q4: Can you use accuracy for multi-class problems?

Yes, but consider:

Macro-averaged F1: Treats all classes equally
Weighted F1: Accounts for class imbalance
Confusion matrix: Shows per-class performance

Company-Specific Tips

Meta Interview Tips

Discuss multi-objective optimization (precision, recall, latency)
Be ready to explain calibration of probabilities
Mention fairness metrics across different groups
Talk about online evaluation (A/B testing)

Netflix Interview Tips

Focus on ranking metrics (NDCG, MRR)
Discuss coverage and diversity metrics
Be prepared to explain business-aligned metrics
Mention user study design

Evaluation Metrics: Precision, Recall, F1, AUC-ROC & Confusion Matrix

Evaluation Metrics: Precision, Recall, F1, AUC-ROC & Confusion Matrix

Interview Question

Theoretical Foundation

Confusion Matrix

Classification Metrics

Accuracy

Precision (Positive Predictive Value)

Recall (Sensitivity, True Positive Rate)

F1 Score

Specificity (True Negative Rate)

When to Optimize Which Metric

ROC Curve and AUC

ROC Curve

AUC (Area Under ROC Curve)

Precision-Recall Curve

Confusion Matrix Derivatives

Matthews Correlation Coefficient (MCC)

Cohen's Kappa

Regression Metrics

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

Mean Absolute Error (MAE)

R-Squared (Coefficient of Determination)

Code Implementation

Explanation of Code

Real-World Applications

Meta: Content Moderation

Netflix: Recommendation Ranking

Common Follow-Up Questions

Company-Specific Tips

Meta Interview Tips

Netflix Interview Tips

Related Topics