Object Detection — YOLO, Faster R-CNN, Anchor Boxes & mAP

Computer VisionDetectionFree Lesson

Advertisement

Object Detection — YOLO, Faster R-CNN, Anchor Boxes & mAP

Object detection localizes and classifies objects in images, predicting both bounding boxes and class labels. It is one of the most impactful applications of deep learning.

See our CNN Architecture tutorial for the backbone architectures used in detection.


Detection vs. Classification

DfObject Detection

Object detection extends image classification by predicting:

  1. Bounding box: (x,y,w,h)(x, y, w, h) coordinates for each object
  2. Class label: What the object is
  3. Confidence score: How certain the model is

Input: Image → Output: Set of {(x,y,w,h,class,confidence)}\{(x, y, w, h, \text{class}, \text{confidence})\}


IoU (Intersection over Union)

DfIoU Metric

IoU measures the overlap between predicted and ground truth bounding boxes:

IoU=Area of IntersectionArea of Union=BpredBgtBpredBgt\text{IoU} = \frac{\text{Area of Intersection}}{\text{Area of Union}} = \frac{|B_{\text{pred}} \cap B_{\text{gt}}|}{|B_{\text{pred}} \cup B_{\text{gt}}|}
  • IoU = 1: Perfect overlap
  • IoU = 0: No overlap
  • IoU ≥ 0.5: Common threshold for "correct" detection

IoU (Intersection over Union)

IoU=BpredBgtBpredBgt\text{IoU} = \frac{|B_{\text{pred}} \cap B_{\text{gt}}|}{|B_{\text{pred}} \cup B_{\text{gt}}|}

Here,

  • BpredB_{\text{pred}}=Predicted bounding box
  • BgtB_{\text{gt}}=Ground truth bounding box
  • |\cdot|=Area of the box

Anchor Boxes

DfAnchor Boxes

Anchor boxes are pre-defined bounding box shapes (width/height ratios) that the network predicts relative to. Instead of predicting absolute coordinates, the network predicts offsets from anchor boxes:

x^=txwa+xa,y^=tyha+ya\hat{x} = t_x \cdot w_a + x_a, \quad \hat{y} = t_y \cdot h_a + y_a
w^=waetw,h^=haeth\hat{w} = w_a \cdot e^{t_w}, \quad \hat{h} = h_a \cdot e^{t_h}

Multiple anchor boxes at each spatial location capture objects of different shapes and aspect ratios.

Anchor Box Offset Prediction

x^=txwa+xa,w^=waetw\hat{x} = t_x \cdot w_a + x_a, \quad \hat{w} = w_a \cdot e^{t_w}

Here,

  • xa,ya,wa,hax_a, y_a, w_a, h_a=Anchor box center and dimensions
  • tx,ty,tw,tht_x, t_y, t_w, t_h=Predicted offsets
  • x^,y^,w^,h^\hat{x}, \hat{y}, \hat{w}, \hat{h}=Final predicted box

Non-Maximum Suppression (NMS)

DfNon-Maximum Suppression

NMS removes duplicate detections for the same object:

  1. Sort all detections by confidence score
  2. Take the detection with highest confidence
  3. Remove all other detections with IoU > threshold with it
  4. Repeat until no detections remain

Typical IoU threshold: 0.5


YOLO (You Only Look Once)

DfYOLO Architecture

YOLO treats detection as a single regression problem:

  1. Divide image into S×SS \times S grid
  2. Each cell predicts BB bounding boxes with confidence scores and CC class probabilities
  3. Output tensor: S×S×(B×5+C)S \times S \times (B \times 5 + C)
  4. Single forward pass — extremely fast (real-time)

Advantages: Real-time speed, global context, learns general representations Disadvantages: Struggles with small objects, grouped objects

L=λcoordLbox+Lconf+λnoobjLnoobj+Lclass\mathcal{L} = \lambda_{\text{coord}} \sum \mathcal{L}_{\text{box}} + \sum \mathcal{L}_{\text{conf}} + \lambda_{\text{noobj}} \sum \mathcal{L}_{\text{noobj}} + \sum \mathcal{L}_{\text{class}}

ℹ️ YOLO Family Evolution

  • YOLOv1: Original single-shot detector
  • YOLOv2/v3: Anchor boxes, multi-scale predictions, Darknet backbone
  • YOLOv4/v5: CSPNet, PANet, Mosaic augmentation
  • YOLOv8: Anchor-free, decoupled head, modern training tricks
  • RT-DETR: Transformer-based real-time detection (no NMS needed)

SSD (Single Shot Detector)

DfSSD

SSD predicts bounding boxes at multiple feature map scales:

  • Uses VGG-16 as backbone (truncated at conv5_3)
  • Adds auxiliary convolution layers at decreasing scales (38×38, 19×19, 10×10, 5×5, 3×3, 1×1)
  • Each scale predicts boxes with different anchor sizes
  • Multi-scale predictions handle objects of varying sizes

Faster R-CNN

DfFaster R-CNN

Two-stage detector with Region Proposal Network (RPN):

  1. Backbone: CNN extracts feature maps
  2. RPN: Proposes regions of interest (RoIs) that likely contain objects
  3. RoI Pooling: Extracts fixed-size features for each RoI
  4. Head: Classifies each RoI and refines bounding boxes

Advantages: High accuracy, strong on small objects Disadvantages: Slower than single-shot detectors, more complex

RPN Loss

LRPN=Lcls(pi,pi)+λpiLreg(ti,ti)\mathcal{L}_{\text{RPN}} = \mathcal{L}_{\text{cls}}(p_i, p_i^*) + \lambda \cdot p_i^* \cdot \mathcal{L}_{\text{reg}}(t_i, t_i^*)

Here,

  • pip_i=Predicted probability of being foreground
  • pip_i^*=Ground truth label (0 or 1)
  • tit_i=Predicted bounding box parameters
  • tit_i^*=Ground truth bounding box parameters

Evaluation: mAP

DfMean Average Precision (mAP)

mAP is the standard metric for object detection:

  1. For each class, compute Precision-Recall curve
  2. Compute Average Precision (AP) as area under PR curve
  3. mAP = mean of AP across all classes

COCO mAP: Average over IoU thresholds from 0.5 to 0.95 (step 0.05) Pascal VOC mAP: Uses IoU threshold of 0.5

💡 mAP Interpretation

  • mAP@0.5: How well does the model detect objects (loose localization)?
  • mAP@0.75: How well does the model localize objects precisely?
  • mAP@[0.5:0.95]: The primary COCO metric, balances detection and localization
  • A good detector has similar mAP@0.5 and mAP@0.75 (precise localization)

PyTorch Detection Example

📝Example: Simple Object Detection Pipeline

import torch
import torch.nn as nn
import torchvision

# ═══════════════════════════════════════════════════
# Pre-trained Faster R-CNN
# ═══════════════════════════════════════════════════

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(
    pretrained=True
)
model.eval()

# Dummy image (batch of 1, 3 channels, 480x640)
images = [torch.randn(3, 480, 640)]

with torch.no_grad():
    predictions = model(images)

# Results
pred = predictions[0]
print(f"Detected {len(pred['boxes'])} objects")
print(f"Boxes shape: {pred['boxes'].shape}")
print(f"Labels: {pred['labels']}")
print(f"Scores: {pred['scores']}")

# ═══════════════════════════════════════════════════
# IoU Computation
# ═══════════════════════════════════════════════════

def compute_iou(box1, box2):
    """Compute IoU between two boxes in (x1, y1, x2, y2) format."""
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    intersection = max(0, x2 - x1) * max(0, y2 - y1)
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection

    return intersection / union if union > 0 else 0

# Test
box1 = [0, 0, 100, 100]
box2 = [50, 50, 150, 150]
print(f"\nIoU: {compute_iou(box1, box2):.4f}")

# ═══════════════════════════════════════════════════
# Non-Maximum Suppression
# ═══════════════════════════════════════════════════

def nms(boxes, scores, iou_threshold=0.5):
    """Simple NMS implementation."""
    indices = scores.argsort(descending=True)
    keep = []

    while len(indices) > 0:
        idx = indices[0]
        keep.append(idx)

        if len(indices) == 1:
            break

        remaining = indices[1:]
        ious = torch.tensor([
            compute_iou(boxes[idx].tolist(), boxes[i].tolist())
            for i in remaining
        ])

        mask = ious <= iou_threshold
        indices = remaining[mask]

    return keep

# Test NMS
boxes = torch.tensor([
    [10, 10, 50, 50],
    [12, 12, 52, 52],
    [100, 100, 150, 150],
])
scores = torch.tensor([0.9, 0.85, 0.7])

keep = nms(boxes, scores, iou_threshold=0.5)
print(f"\nNMS kept boxes: {keep}")

Summary

📋Summary: Object Detection

  • Detection: Predict bounding boxes + class labels + confidence
  • IoU: Overlap metric for evaluating box accuracy (threshold 0.5)
  • Anchor boxes: Pre-defined box shapes, network predicts offsets
  • NMS: Removes duplicate detections (IoU > threshold)
  • YOLO: Single-shot, real-time, grid-based prediction
  • SSD: Multi-scale single-shot detection
  • Faster R-CNN: Two-stage, high accuracy, RPN proposes regions
  • mAP: Primary metric — AP averaged across classes
  • COCO mAP: Average over IoU thresholds 0.5 to 0.95

Practice Exercises

  1. Mathematical: Given two bounding boxes B1=(0,0,100,100)B_1 = (0, 0, 100, 100) and B2=(50,50,150,150)B_2 = (50, 50, 150, 150), compute the IoU. What happens if B2=(110,110,200,200)B_2 = (110, 110, 200, 200)?

  2. Coding: Implement YOLO's grid-based prediction: divide a 416×416 image into a 13×13 grid. For each cell, predict 3 anchor boxes with class probabilities.

  3. Experiment: Fine-tune a pre-trained Faster R-CNN on a custom dataset (e.g., pedestrians, cars). Compare mAP with and without transfer learning.

  4. Research: Read the YOLOv8 paper. What improvements were made over YOLOv5? How does the anchor-free approach differ?

  5. Evaluation: Compute mAP@0.5 and mAP@0.75 for a set of 10 predictions and 5 ground truth boxes. Implement the computation from scratch.

Advertisement

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement