Accuracy, Precision, Recall & F1 Score

Model EvaluationClassification MetricsFree Lesson

Advertisement

Why Accuracy Alone Lies to You

Imagine building a cancer detection model. Your dataset has 995 healthy patients and only 5 with cancer. A model that predicts "healthy" for everyone achieves 99.5% accuracy — yet it misses every single cancer case.

This is why we need a richer vocabulary for evaluating classifiers.

Core Insight: Accuracy answers "how often are we right?" — but precision, recall, and F1 answer what kind of right and wrong we are.


The Confusion Matrix — The Foundation of Everything

Every classification metric derives from the confusion matrix — a 2×2 table of prediction outcomes.

                    ┌─────────────────────────────────────────────┐
                    │         PREDICTED LABEL                      │
                    │    Positive (+)        Negative (-)          │
         ┌──────────┼──────────────────────┬──────────────────────┤
  ACTUAL │Positive  │  TP (True Positive)  │  FN (False Negative) │
  LABEL  │  (+)     │  ✅ Correctly said   │  ❌ Missed positive   │
         │          │     "YES"            │     (Type II Error)   │
         ├──────────┼──────────────────────┼──────────────────────┤
         │Negative  │  FP (False Positive) │  TN (True Negative)  │
         │   (-)    │  ❌ False alarm       │  ✅ Correctly said   │
         │          │     (Type I Error)   │     "NO"             │
         └──────────┴──────────────────────┴──────────────────────┘

Medical Analogy

COVID Test Results
─────────────────────────────────────────────────
Patient actually HAS COVID  →  Test says POSITIVE  →  TP ✅
Patient actually HAS COVID  →  Test says NEGATIVE  →  FN ❌ (Dangerous! Sent home sick)
Patient DOESN'T have COVID  →  Test says POSITIVE  →  FP ❌ (Unnecessary quarantine)
Patient DOESN'T have COVID  →  Test says NEGATIVE  →  TN ✅

The Four Metrics Explained

1. Accuracy

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

"Out of ALL predictions, what fraction were correct?"

When to trust it: Balanced classes (roughly equal positives and negatives)
When it lies:     Imbalanced classes (99% negative → predict all negative → 99% accuracy!)

2. Precision

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

"Of all the times we said POSITIVE, how often were we right?"

High Precision = Few False Alarms
Low  Precision = Crying wolf too often

Use case: SPAM detection
  → We don't want legitimate emails flagged as spam (FP is costly)
  → Optimize for HIGH precision

3. Recall (Sensitivity / True Positive Rate)

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

"Of all ACTUAL positives, how many did we catch?"

High Recall = We catch nearly everything
Low  Recall = We miss many positives

Use case: CANCER detection
  → We can't afford to miss a cancer case (FN is deadly)
  → Optimize for HIGH recall

4. F1 Score

F1=2×Precision×RecallPrecision+Recall=2TP2TP+FP+FN\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}

"The harmonic mean of precision and recall — balanced single number."

Why harmonic mean (not arithmetic)?
  If Precision = 1.0, Recall = 0.0:
    Arithmetic mean = 0.5  ← falsely suggests decent performance
    Harmonic mean   = 0.0  ← correctly shows failure

F1 is LOW when either precision OR recall is low.
F1 only approaches 1.0 when BOTH are high.

Visual Intuition: The Precision-Recall Tradeoff

         Precision ▲
       1.0 │ ●
           │   ●
       0.8 │     ●
           │       ●   ← Sweet spot (high F1)
       0.6 │         ●
           │           ●
       0.4 │             ●
           │               ●
       0.2 │                 ●
           │                   ●
       0.0 └───────────────────────── Recall ▶
           0.0  0.2  0.4  0.6  0.8  1.0

Moving the decision threshold:
  → Lower threshold: Model says YES more often
    → Recall ↑  (catch more positives)
    → Precision ↓ (more false alarms)

  → Higher threshold: Model says YES less often
    → Precision ↑ (fewer false alarms)
    → Recall ↓  (miss more positives)

The AUC-PR (area under this curve) measures overall performance.

Numeric Example: Fraud Detection

A fraud detection model processes 10,000 transactions:

  • 100 are actual fraud (positive class)
  • 9,900 are legitimate (negative class)

Results: TP = 80, FP = 20, FN = 20, TN = 9,880

Confusion Matrix:
                    Predicted Fraud    Predicted Legit
Actual Fraud   │        80                  20       │  100 total fraud
Actual Legit   │        20                9,880      │  9,900 total legit
               └─────────────────────────────────────
                      100 predicted        9,900 predicted

Computing the metrics:

Accuracy=80+988010000=996010000=99.6%\text{Accuracy} = \frac{80 + 9880}{10000} = \frac{9960}{10000} = 99.6\%

Precision=8080+20=80100=80%\text{Precision} = \frac{80}{80 + 20} = \frac{80}{100} = 80\%

Recall=8080+20=80100=80%\text{Recall} = \frac{80}{80 + 20} = \frac{80}{100} = 80\%

F1=2×0.80×0.800.80+0.80=0.80=80%\text{F1} = \frac{2 \times 0.80 \times 0.80}{0.80 + 0.80} = 0.80 = 80\%

Insight: Accuracy (99.6%) looks amazing, but that's mostly because 99% of transactions are legitimate. The model catches 80% of fraud but lets 20% slip through — that's the number that matters for the business.


Complete Python Implementation

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report,
    precision_recall_curve, roc_auc_score
)
import warnings
warnings.filterwarnings('ignore')

# ─── 1. Simulate imbalanced fraud dataset ──────────────────────────
X, y = make_classification(
    n_samples=10_000,
    n_features=20,
    n_informative=10,
    weights=[0.99, 0.01],   # 99% legit, 1% fraud
    flip_y=0.01,
    random_state=42
)

print(f"Class distribution: {np.bincount(y)}")  # [9900, 100] approx

# ─── 2. Train/test split ───────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ─── 3. Fit model ──────────────────────────────────────────────────
clf = LogisticRegression(class_weight='balanced', random_state=42)
clf.fit(X_train, y_train)
y_pred  = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]

# ─── 4. All metrics in one place ──────────────────────────────────
print("\n" + "="*55)
print("          CLASSIFICATION METRICS REPORT")
print("="*55)
print(f"  Accuracy  : {accuracy_score(y_test, y_pred):.4f}")
print(f"  Precision : {precision_score(y_test, y_pred):.4f}")
print(f"  Recall    : {recall_score(y_test, y_pred):.4f}")
print(f"  F1 Score  : {f1_score(y_test, y_pred):.4f}")
print(f"  ROC-AUC   : {roc_auc_score(y_test, y_proba):.4f}")
print("="*55)

# ─── 5. Detailed confusion matrix ─────────────────────────────────
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()

print(f"\nConfusion Matrix:")
print(f"  True Negatives  (TN): {tn}")
print(f"  False Positives (FP): {fp}  ← False alarms")
print(f"  False Negatives (FN): {fn}  ← Missed fraud!")
print(f"  True Positives  (TP): {tp}")

# ─── 6. Full sklearn report ───────────────────────────────────────
print("\nFull Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Legit', 'Fraud']))

Expected output:

Class distribution: [9899  101]

=======================================================
          CLASSIFICATION METRICS REPORT
=======================================================
  Accuracy  : 0.9760
  Precision : 0.6000
  Recall    : 0.7500
  F1 Score  : 0.6667
  ROC-AUC   : 0.9412
=======================================================

Confusion Matrix:
  True Negatives  (TN): 1965
  False Positives (FP): 14  ← False alarms
  False Negatives (FN): 6   ← Missed fraud!
  True Positives  (TP): 15

Computing From Scratch (No sklearn)

def compute_all_metrics(y_true, y_pred):
    """Compute all classification metrics from scratch."""
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)

    TP = np.sum((y_pred == 1) & (y_true == 1))
    TN = np.sum((y_pred == 0) & (y_true == 0))
    FP = np.sum((y_pred == 1) & (y_true == 0))
    FN = np.sum((y_pred == 0) & (y_true == 1))

    accuracy  = (TP + TN) / (TP + TN + FP + FN)
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall    = TP / (TP + FN) if (TP + FN) > 0 else 0
    f1        = (2 * precision * recall / (precision + recall)
                 if (precision + recall) > 0 else 0)
    specificity = TN / (TN + FP) if (TN + FP) > 0 else 0  # True Negative Rate

    return {
        "TP": int(TP), "TN": int(TN), "FP": int(FP), "FN": int(FN),
        "Accuracy":    round(accuracy, 4),
        "Precision":   round(precision, 4),
        "Recall":      round(recall, 4),
        "F1":          round(f1, 4),
        "Specificity": round(specificity, 4),
    }

# Test it
y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
y_pred = [1, 1, 1, 0, 0, 0, 0, 0, 1, 1]

metrics = compute_all_metrics(y_true, y_pred)
for k, v in metrics.items():
    print(f"  {k:<12}: {v}")

Tuning the Decision Threshold

By default, classifiers predict positive when probability ≥ 0.5. Changing this threshold shifts the precision-recall tradeoff.

# Try different thresholds
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

print(f"{'Threshold':>10} {'Precision':>10} {'Recall':>10} {'F1':>10}")
print("-" * 44)

for t in thresholds:
    y_pred_t = (y_proba >= t).astype(int)
    p = precision_score(y_test, y_pred_t, zero_division=0)
    r = recall_score(y_test, y_pred_t, zero_division=0)
    f = f1_score(y_test, y_pred_t, zero_division=0)
    print(f"{t:>10.1f} {p:>10.4f} {r:>10.4f} {f:>10.4f}")
Threshold  Precision     Recall         F1
--------------------------------------------
       0.1     0.0500     1.0000     0.0952   ← Catches everything, terrible precision
       0.3     0.1667     0.9000     0.2813
       0.5     0.6000     0.7500     0.6667   ← Default
       0.7     0.8571     0.6000     0.7059   ← Higher precision, lower recall
       0.9     1.0000     0.1000     0.1818   ← Only predicts when very certain

Multi-Class Metrics

When there are 3+ classes, metrics are computed per class and aggregated:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Three averaging strategies:
for avg in ['macro', 'weighted', 'micro']:
    p = precision_score(y_test, y_pred, average=avg)
    r = recall_score(y_test, y_pred, average=avg)
    f = f1_score(y_test, y_pred, average=avg)
    print(f"  {avg:<10}: P={p:.3f}  R={r:.3f}  F1={f:.3f}")
Average Strategy Explained:
  macro    → Average across classes, equal weight per class
             Best when class sizes are similar
  weighted → Average weighted by class support (size)
             Best for imbalanced classes
  micro    → Aggregate TP/FP/FN globally then compute
             Equivalent to accuracy for overall performance

The Metrics Decision Guide

What is your cost of a False Negative (missing a positive)?

HIGH (e.g., cancer, fraud, critical failure)
  → Prioritize RECALL
  → Accept more false alarms to catch real positives
  → Lower the decision threshold

HIGH cost of False Positive (spam filter, loan approval)
  → Prioritize PRECISION
  → Better to miss some positives than cause false alarms
  → Raise the decision threshold

Balanced (general classification)
  → Use F1 SCORE
  → Harmonic mean balances both concerns

Imbalanced classes
  → Use PRECISION-RECALL AUC (PR-AUC)
  → Avoid accuracy — it's misleading
  → Consider Matthews Correlation Coefficient (MCC)

Quick Reference Summary

MetricFormulaAnswersBest When
Accuracy(TP+TN)/TotalHow often correct overall?Balanced classes
PrecisionTP/(TP+FP)When we say YES, how often right?Cost of false alarm is high
RecallTP/(TP+FN)Of all real positives, how many found?Cost of missing a positive is high
F12×P×R/(P+R)Single balanced scoreWant to balance P and R
SpecificityTN/(TN+FP)Of all negatives, how many identified?Screening tests
ROC-AUCArea under ROC curveDiscrimination across thresholdsComparing models

Key Takeaways

  1. Accuracy is misleading on imbalanced datasets — a 99% accuracy can mean a useless model
  2. Precision = quality of positive predictions; Recall = coverage of actual positives
  3. F1 score is the go-to when you want a single number that respects both P and R
  4. Always tune the threshold — the default 0.5 is rarely optimal for real problems
  5. Use the confusion matrix to understand your model's failure modes, not just overall numbers
  6. Match your metric to your business goal — missing cancer (FN) is far worse than a false alarm (FP)

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement