Why Accuracy Alone Lies to You
Imagine building a cancer detection model. Your dataset has 995 healthy patients and only 5 with cancer. A model that predicts "healthy" for everyone achieves 99.5% accuracy — yet it misses every single cancer case.
This is why we need a richer vocabulary for evaluating classifiers.
Core Insight: Accuracy answers "how often are we right?" — but precision, recall, and F1 answer what kind of right and wrong we are.
The Confusion Matrix — The Foundation of Everything
Every classification metric derives from the confusion matrix — a 2×2 table of prediction outcomes.
┌─────────────────────────────────────────────┐
│ PREDICTED LABEL │
│ Positive (+) Negative (-) │
┌──────────┼──────────────────────┬──────────────────────┤
ACTUAL │Positive │ TP (True Positive) │ FN (False Negative) │
LABEL │ (+) │ ✅ Correctly said │ ❌ Missed positive │
│ │ "YES" │ (Type II Error) │
├──────────┼──────────────────────┼──────────────────────┤
│Negative │ FP (False Positive) │ TN (True Negative) │
│ (-) │ ❌ False alarm │ ✅ Correctly said │
│ │ (Type I Error) │ "NO" │
└──────────┴──────────────────────┴──────────────────────┘
Medical Analogy
COVID Test Results
─────────────────────────────────────────────────
Patient actually HAS COVID → Test says POSITIVE → TP ✅
Patient actually HAS COVID → Test says NEGATIVE → FN ❌ (Dangerous! Sent home sick)
Patient DOESN'T have COVID → Test says POSITIVE → FP ❌ (Unnecessary quarantine)
Patient DOESN'T have COVID → Test says NEGATIVE → TN ✅
The Four Metrics Explained
1. Accuracy
"Out of ALL predictions, what fraction were correct?"
When to trust it: Balanced classes (roughly equal positives and negatives)
When it lies: Imbalanced classes (99% negative → predict all negative → 99% accuracy!)
2. Precision
"Of all the times we said POSITIVE, how often were we right?"
High Precision = Few False Alarms
Low Precision = Crying wolf too often
Use case: SPAM detection
→ We don't want legitimate emails flagged as spam (FP is costly)
→ Optimize for HIGH precision
3. Recall (Sensitivity / True Positive Rate)
"Of all ACTUAL positives, how many did we catch?"
High Recall = We catch nearly everything
Low Recall = We miss many positives
Use case: CANCER detection
→ We can't afford to miss a cancer case (FN is deadly)
→ Optimize for HIGH recall
4. F1 Score
"The harmonic mean of precision and recall — balanced single number."
Why harmonic mean (not arithmetic)?
If Precision = 1.0, Recall = 0.0:
Arithmetic mean = 0.5 ← falsely suggests decent performance
Harmonic mean = 0.0 ← correctly shows failure
F1 is LOW when either precision OR recall is low.
F1 only approaches 1.0 when BOTH are high.
Visual Intuition: The Precision-Recall Tradeoff
Precision ▲
1.0 │ ●
│ ●
0.8 │ ●
│ ● ← Sweet spot (high F1)
0.6 │ ●
│ ●
0.4 │ ●
│ ●
0.2 │ ●
│ ●
0.0 └───────────────────────── Recall ▶
0.0 0.2 0.4 0.6 0.8 1.0
Moving the decision threshold:
→ Lower threshold: Model says YES more often
→ Recall ↑ (catch more positives)
→ Precision ↓ (more false alarms)
→ Higher threshold: Model says YES less often
→ Precision ↑ (fewer false alarms)
→ Recall ↓ (miss more positives)
The AUC-PR (area under this curve) measures overall performance.
Numeric Example: Fraud Detection
A fraud detection model processes 10,000 transactions:
- 100 are actual fraud (positive class)
- 9,900 are legitimate (negative class)
Results: TP = 80, FP = 20, FN = 20, TN = 9,880
Confusion Matrix:
Predicted Fraud Predicted Legit
Actual Fraud │ 80 20 │ 100 total fraud
Actual Legit │ 20 9,880 │ 9,900 total legit
└─────────────────────────────────────
100 predicted 9,900 predicted
Computing the metrics:
Insight: Accuracy (99.6%) looks amazing, but that's mostly because 99% of transactions are legitimate. The model catches 80% of fraud but lets 20% slip through — that's the number that matters for the business.
Complete Python Implementation
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report,
precision_recall_curve, roc_auc_score
)
import warnings
warnings.filterwarnings('ignore')
# ─── 1. Simulate imbalanced fraud dataset ──────────────────────────
X, y = make_classification(
n_samples=10_000,
n_features=20,
n_informative=10,
weights=[0.99, 0.01], # 99% legit, 1% fraud
flip_y=0.01,
random_state=42
)
print(f"Class distribution: {np.bincount(y)}") # [9900, 100] approx
# ─── 2. Train/test split ───────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# ─── 3. Fit model ──────────────────────────────────────────────────
clf = LogisticRegression(class_weight='balanced', random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]
# ─── 4. All metrics in one place ──────────────────────────────────
print("\n" + "="*55)
print(" CLASSIFICATION METRICS REPORT")
print("="*55)
print(f" Accuracy : {accuracy_score(y_test, y_pred):.4f}")
print(f" Precision : {precision_score(y_test, y_pred):.4f}")
print(f" Recall : {recall_score(y_test, y_pred):.4f}")
print(f" F1 Score : {f1_score(y_test, y_pred):.4f}")
print(f" ROC-AUC : {roc_auc_score(y_test, y_proba):.4f}")
print("="*55)
# ─── 5. Detailed confusion matrix ─────────────────────────────────
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
print(f"\nConfusion Matrix:")
print(f" True Negatives (TN): {tn}")
print(f" False Positives (FP): {fp} ← False alarms")
print(f" False Negatives (FN): {fn} ← Missed fraud!")
print(f" True Positives (TP): {tp}")
# ─── 6. Full sklearn report ───────────────────────────────────────
print("\nFull Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Legit', 'Fraud']))
Expected output:
Class distribution: [9899 101]
=======================================================
CLASSIFICATION METRICS REPORT
=======================================================
Accuracy : 0.9760
Precision : 0.6000
Recall : 0.7500
F1 Score : 0.6667
ROC-AUC : 0.9412
=======================================================
Confusion Matrix:
True Negatives (TN): 1965
False Positives (FP): 14 ← False alarms
False Negatives (FN): 6 ← Missed fraud!
True Positives (TP): 15
Computing From Scratch (No sklearn)
def compute_all_metrics(y_true, y_pred):
"""Compute all classification metrics from scratch."""
y_true = np.array(y_true)
y_pred = np.array(y_pred)
TP = np.sum((y_pred == 1) & (y_true == 1))
TN = np.sum((y_pred == 0) & (y_true == 0))
FP = np.sum((y_pred == 1) & (y_true == 0))
FN = np.sum((y_pred == 0) & (y_true == 1))
accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP) if (TP + FP) > 0 else 0
recall = TP / (TP + FN) if (TP + FN) > 0 else 0
f1 = (2 * precision * recall / (precision + recall)
if (precision + recall) > 0 else 0)
specificity = TN / (TN + FP) if (TN + FP) > 0 else 0 # True Negative Rate
return {
"TP": int(TP), "TN": int(TN), "FP": int(FP), "FN": int(FN),
"Accuracy": round(accuracy, 4),
"Precision": round(precision, 4),
"Recall": round(recall, 4),
"F1": round(f1, 4),
"Specificity": round(specificity, 4),
}
# Test it
y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
y_pred = [1, 1, 1, 0, 0, 0, 0, 0, 1, 1]
metrics = compute_all_metrics(y_true, y_pred)
for k, v in metrics.items():
print(f" {k:<12}: {v}")
Tuning the Decision Threshold
By default, classifiers predict positive when probability ≥ 0.5. Changing this threshold shifts the precision-recall tradeoff.
# Try different thresholds
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
print(f"{'Threshold':>10} {'Precision':>10} {'Recall':>10} {'F1':>10}")
print("-" * 44)
for t in thresholds:
y_pred_t = (y_proba >= t).astype(int)
p = precision_score(y_test, y_pred_t, zero_division=0)
r = recall_score(y_test, y_pred_t, zero_division=0)
f = f1_score(y_test, y_pred_t, zero_division=0)
print(f"{t:>10.1f} {p:>10.4f} {r:>10.4f} {f:>10.4f}")
Threshold Precision Recall F1
--------------------------------------------
0.1 0.0500 1.0000 0.0952 ← Catches everything, terrible precision
0.3 0.1667 0.9000 0.2813
0.5 0.6000 0.7500 0.6667 ← Default
0.7 0.8571 0.6000 0.7059 ← Higher precision, lower recall
0.9 1.0000 0.1000 0.1818 ← Only predicts when very certain
Multi-Class Metrics
When there are 3+ classes, metrics are computed per class and aggregated:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# Three averaging strategies:
for avg in ['macro', 'weighted', 'micro']:
p = precision_score(y_test, y_pred, average=avg)
r = recall_score(y_test, y_pred, average=avg)
f = f1_score(y_test, y_pred, average=avg)
print(f" {avg:<10}: P={p:.3f} R={r:.3f} F1={f:.3f}")
Average Strategy Explained:
macro → Average across classes, equal weight per class
Best when class sizes are similar
weighted → Average weighted by class support (size)
Best for imbalanced classes
micro → Aggregate TP/FP/FN globally then compute
Equivalent to accuracy for overall performance
The Metrics Decision Guide
What is your cost of a False Negative (missing a positive)?
HIGH (e.g., cancer, fraud, critical failure)
→ Prioritize RECALL
→ Accept more false alarms to catch real positives
→ Lower the decision threshold
HIGH cost of False Positive (spam filter, loan approval)
→ Prioritize PRECISION
→ Better to miss some positives than cause false alarms
→ Raise the decision threshold
Balanced (general classification)
→ Use F1 SCORE
→ Harmonic mean balances both concerns
Imbalanced classes
→ Use PRECISION-RECALL AUC (PR-AUC)
→ Avoid accuracy — it's misleading
→ Consider Matthews Correlation Coefficient (MCC)
Quick Reference Summary
| Metric | Formula | Answers | Best When |
|---|---|---|---|
| Accuracy | (TP+TN)/Total | How often correct overall? | Balanced classes |
| Precision | TP/(TP+FP) | When we say YES, how often right? | Cost of false alarm is high |
| Recall | TP/(TP+FN) | Of all real positives, how many found? | Cost of missing a positive is high |
| F1 | 2×P×R/(P+R) | Single balanced score | Want to balance P and R |
| Specificity | TN/(TN+FP) | Of all negatives, how many identified? | Screening tests |
| ROC-AUC | Area under ROC curve | Discrimination across thresholds | Comparing models |
Key Takeaways
- Accuracy is misleading on imbalanced datasets — a 99% accuracy can mean a useless model
- Precision = quality of positive predictions; Recall = coverage of actual positives
- F1 score is the go-to when you want a single number that respects both P and R
- Always tune the threshold — the default 0.5 is rarely optimal for real problems
- Use the confusion matrix to understand your model's failure modes, not just overall numbers
- Match your metric to your business goal — missing cancer (FN) is far worse than a false alarm (FP)