Accuracy, Precision, Recall & F1 Score

Why Accuracy Alone Lies to You

Imagine building a cancer detection model. Your dataset has 995 healthy patients and only 5 with cancer. A model that predicts "healthy" for everyone achieves 99.5% accuracy — yet it misses every single cancer case.

This is why we need a richer vocabulary for evaluating classifiers.

Core Insight: Accuracy answers "how often are we right?" — but precision, recall, and F1 answer what kind of right and wrong we are.

The Confusion Matrix — The Foundation of Everything

Every classification metric derives from the confusion matrix — a 2×2 table of prediction outcomes.

                    ┌─────────────────────────────────────────────┐
                    │         PREDICTED LABEL                      │
                    │    Positive (+)        Negative (-)          │
         ┌──────────┼──────────────────────┬──────────────────────┤
  ACTUAL │Positive  │  TP (True Positive)  │  FN (False Negative) │
  LABEL  │  (+)     │  ✅ Correctly said   │  ❌ Missed positive   │
         │          │     "YES"            │     (Type II Error)   │
         ├──────────┼──────────────────────┼──────────────────────┤
         │Negative  │  FP (False Positive) │  TN (True Negative)  │
         │   (-)    │  ❌ False alarm       │  ✅ Correctly said   │
         │          │     (Type I Error)   │     "NO"             │
         └──────────┴──────────────────────┴──────────────────────┘

Medical Analogy

COVID Test Results
─────────────────────────────────────────────────
Patient actually HAS COVID  →  Test says POSITIVE  →  TP ✅
Patient actually HAS COVID  →  Test says NEGATIVE  →  FN ❌ (Dangerous! Sent home sick)
Patient DOESN'T have COVID  →  Test says POSITIVE  →  FP ❌ (Unnecessary quarantine)
Patient DOESN'T have COVID  →  Test says NEGATIVE  →  TN ✅

The Four Metrics Explained

1. Accuracy

$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

"Out of ALL predictions, what fraction were correct?"

When to trust it: Balanced classes (roughly equal positives and negatives)
When it lies:     Imbalanced classes (99% negative → predict all negative → 99% accuracy!)

2. Precision

$\text{Precision} = \frac{TP}{TP + FP}$

"Of all the times we said POSITIVE, how often were we right?"

High Precision = Few False Alarms
Low  Precision = Crying wolf too often

Use case: SPAM detection
  → We don't want legitimate emails flagged as spam (FP is costly)
  → Optimize for HIGH precision

3. Recall (Sensitivity / True Positive Rate)

$\text{Recall} = \frac{TP}{TP + FN}$

"Of all ACTUAL positives, how many did we catch?"

High Recall = We catch nearly everything
Low  Recall = We miss many positives

Use case: CANCER detection
  → We can't afford to miss a cancer case (FN is deadly)
  → Optimize for HIGH recall

4. F1 Score

$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}$

"The harmonic mean of precision and recall — balanced single number."

Why harmonic mean (not arithmetic)?
  If Precision = 1.0, Recall = 0.0:
    Arithmetic mean = 0.5  ← falsely suggests decent performance
    Harmonic mean   = 0.0  ← correctly shows failure

F1 is LOW when either precision OR recall is low.
F1 only approaches 1.0 when BOTH are high.

Visual Intuition: The Precision-Recall Tradeoff

         Precision ▲
       1.0 │ ●
           │   ●
       0.8 │     ●
           │       ●   ← Sweet spot (high F1)
       0.6 │         ●
           │           ●
       0.4 │             ●
           │               ●
       0.2 │                 ●
           │                   ●
       0.0 └───────────────────────── Recall ▶
           0.0  0.2  0.4  0.6  0.8  1.0

Moving the decision threshold:
  → Lower threshold: Model says YES more often
    → Recall ↑  (catch more positives)
    → Precision ↓ (more false alarms)

  → Higher threshold: Model says YES less often
    → Precision ↑ (fewer false alarms)
    → Recall ↓  (miss more positives)

The AUC-PR (area under this curve) measures overall performance.

Numeric Example: Fraud Detection

A fraud detection model processes 10,000 transactions:

100 are actual fraud (positive class)
9,900 are legitimate (negative class)

Results: TP = 80, FP = 20, FN = 20, TN = 9,880

Confusion Matrix:
                    Predicted Fraud    Predicted Legit
Actual Fraud   │        80                  20       │  100 total fraud
Actual Legit   │        20                9,880      │  9,900 total legit
               └─────────────────────────────────────
                      100 predicted        9,900 predicted

Computing the metrics:

$\text{Accuracy} = \frac{80 + 9880}{10000} = \frac{9960}{10000} = 99.6\%$

$\text{Precision} = \frac{80}{80 + 20} = \frac{80}{100} = 80\%$

$\text{Recall} = \frac{80}{80 + 20} = \frac{80}{100} = 80\%$

$\text{F1} = \frac{2 \times 0.80 \times 0.80}{0.80 + 0.80} = 0.80 = 80\%$

Insight: Accuracy (99.6%) looks amazing, but that's mostly because 99% of transactions are legitimate. The model catches 80% of fraud but lets 20% slip through — that's the number that matters for the business.

Complete Python Implementation

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report,
    precision_recall_curve, roc_auc_score
)
import warnings
warnings.filterwarnings('ignore')

# ─── 1. Simulate imbalanced fraud dataset ──────────────────────────
X, y = make_classification(
    n_samples=10_000,
    n_features=20,
    n_informative=10,
    weights=[0.99, 0.01],   # 99% legit, 1% fraud
    flip_y=0.01,
    random_state=42
)

print(f"Class distribution: {np.bincount(y)}")  # [9900, 100] approx

# ─── 2. Train/test split ───────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ─── 3. Fit model ──────────────────────────────────────────────────
clf = LogisticRegression(class_weight='balanced', random_state=42)
clf.fit(X_train, y_train)
y_pred  = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]

# ─── 4. All metrics in one place ──────────────────────────────────
print("\n" + "="*55)
print("          CLASSIFICATION METRICS REPORT")
print("="*55)
print(f"  Accuracy  : {accuracy_score(y_test, y_pred):.4f}")
print(f"  Precision : {precision_score(y_test, y_pred):.4f}")
print(f"  Recall    : {recall_score(y_test, y_pred):.4f}")
print(f"  F1 Score  : {f1_score(y_test, y_pred):.4f}")
print(f"  ROC-AUC   : {roc_auc_score(y_test, y_proba):.4f}")
print("="*55)

# ─── 5. Detailed confusion matrix ─────────────────────────────────
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()

print(f"\nConfusion Matrix:")
print(f"  True Negatives  (TN): {tn}")
print(f"  False Positives (FP): {fp}  ← False alarms")
print(f"  False Negatives (FN): {fn}  ← Missed fraud!")
print(f"  True Positives  (TP): {tp}")

# ─── 6. Full sklearn report ───────────────────────────────────────
print("\nFull Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Legit', 'Fraud']))

Expected output:

Class distribution: [9899  101]

=======================================================
          CLASSIFICATION METRICS REPORT
=======================================================
  Accuracy  : 0.9760
  Precision : 0.6000
  Recall    : 0.7500
  F1 Score  : 0.6667
  ROC-AUC   : 0.9412
=======================================================

Confusion Matrix:
  True Negatives  (TN): 1965
  False Positives (FP): 14  ← False alarms
  False Negatives (FN): 6   ← Missed fraud!
  True Positives  (TP): 15

Computing From Scratch (No sklearn)

def compute_all_metrics(y_true, y_pred):
    """Compute all classification metrics from scratch."""
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)

    TP = np.sum((y_pred == 1) & (y_true == 1))
    TN = np.sum((y_pred == 0) & (y_true == 0))
    FP = np.sum((y_pred == 1) & (y_true == 0))
    FN = np.sum((y_pred == 0) & (y_true == 1))

    accuracy  = (TP + TN) / (TP + TN + FP + FN)
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall    = TP / (TP + FN) if (TP + FN) > 0 else 0
    f1        = (2 * precision * recall / (precision + recall)
                 if (precision + recall) > 0 else 0)
    specificity = TN / (TN + FP) if (TN + FP) > 0 else 0  # True Negative Rate

    return {
        "TP": int(TP), "TN": int(TN), "FP": int(FP), "FN": int(FN),
        "Accuracy":    round(accuracy, 4),
        "Precision":   round(precision, 4),
        "Recall":      round(recall, 4),
        "F1":          round(f1, 4),
        "Specificity": round(specificity, 4),
    }

# Test it
y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
y_pred = [1, 1, 1, 0, 0, 0, 0, 0, 1, 1]

metrics = compute_all_metrics(y_true, y_pred)
for k, v in metrics.items():
    print(f"  {k:<12}: {v}")

Tuning the Decision Threshold

By default, classifiers predict positive when probability ≥ 0.5. Changing this threshold shifts the precision-recall tradeoff.

# Try different thresholds
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

print(f"{'Threshold':>10} {'Precision':>10} {'Recall':>10} {'F1':>10}")
print("-" * 44)

for t in thresholds:
    y_pred_t = (y_proba >= t).astype(int)
    p = precision_score(y_test, y_pred_t, zero_division=0)
    r = recall_score(y_test, y_pred_t, zero_division=0)
    f = f1_score(y_test, y_pred_t, zero_division=0)
    print(f"{t:>10.1f} {p:>10.4f} {r:>10.4f} {f:>10.4f}")

Threshold  Precision     Recall         F1
--------------------------------------------
       0.1     0.0500     1.0000     0.0952   ← Catches everything, terrible precision
       0.3     0.1667     0.9000     0.2813
       0.5     0.6000     0.7500     0.6667   ← Default
       0.7     0.8571     0.6000     0.7059   ← Higher precision, lower recall
       0.9     1.0000     0.1000     0.1818   ← Only predicts when very certain

Multi-Class Metrics

When there are 3+ classes, metrics are computed per class and aggregated:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Three averaging strategies:
for avg in ['macro', 'weighted', 'micro']:
    p = precision_score(y_test, y_pred, average=avg)
    r = recall_score(y_test, y_pred, average=avg)
    f = f1_score(y_test, y_pred, average=avg)
    print(f"  {avg:<10}: P={p:.3f}  R={r:.3f}  F1={f:.3f}")

Average Strategy Explained:
  macro    → Average across classes, equal weight per class
             Best when class sizes are similar
  weighted → Average weighted by class support (size)
             Best for imbalanced classes
  micro    → Aggregate TP/FP/FN globally then compute
             Equivalent to accuracy for overall performance

The Metrics Decision Guide

What is your cost of a False Negative (missing a positive)?

HIGH (e.g., cancer, fraud, critical failure)
  → Prioritize RECALL
  → Accept more false alarms to catch real positives
  → Lower the decision threshold

HIGH cost of False Positive (spam filter, loan approval)
  → Prioritize PRECISION
  → Better to miss some positives than cause false alarms
  → Raise the decision threshold

Balanced (general classification)
  → Use F1 SCORE
  → Harmonic mean balances both concerns

Imbalanced classes
  → Use PRECISION-RECALL AUC (PR-AUC)
  → Avoid accuracy — it's misleading
  → Consider Matthews Correlation Coefficient (MCC)

Quick Reference Summary

Metric	Formula	Answers	Best When
Accuracy	(TP+TN)/Total	How often correct overall?	Balanced classes
Precision	TP/(TP+FP)	When we say YES, how often right?	Cost of false alarm is high
Recall	TP/(TP+FN)	Of all real positives, how many found?	Cost of missing a positive is high
F1	2×P×R/(P+R)	Single balanced score	Want to balance P and R
Specificity	TN/(TN+FP)	Of all negatives, how many identified?	Screening tests
ROC-AUC	Area under ROC curve	Discrimination across thresholds	Comparing models

Key Takeaways

Accuracy is misleading on imbalanced datasets — a 99% accuracy can mean a useless model
Precision = quality of positive predictions; Recall = coverage of actual positives
F1 score is the go-to when you want a single number that respects both P and R
Always tune the threshold — the default 0.5 is rarely optimal for real problems
Use the confusion matrix to understand your model's failure modes, not just overall numbers
Match your metric to your business goal — missing cancer (FN) is far worse than a false alarm (FP)