Object Detection — YOLO, Faster R-CNN, Anchor Boxes & mAP
Object detection localizes and classifies objects in images, predicting both bounding boxes and class labels. It is one of the most impactful applications of deep learning.
See our CNN Architecture tutorial for the backbone architectures used in detection.
Detection vs. Classification
DfObject Detection
Object detection extends image classification by predicting:
- Bounding box: coordinates for each object
- Class label: What the object is
- Confidence score: How certain the model is
Input: Image → Output: Set of
IoU (Intersection over Union)
DfIoU Metric
IoU measures the overlap between predicted and ground truth bounding boxes:
- IoU = 1: Perfect overlap
- IoU = 0: No overlap
- IoU ≥ 0.5: Common threshold for "correct" detection
IoU (Intersection over Union)
Here,
- =Predicted bounding box
- =Ground truth bounding box
- =Area of the box
Anchor Boxes
DfAnchor Boxes
Anchor boxes are pre-defined bounding box shapes (width/height ratios) that the network predicts relative to. Instead of predicting absolute coordinates, the network predicts offsets from anchor boxes:
Multiple anchor boxes at each spatial location capture objects of different shapes and aspect ratios.
Anchor Box Offset Prediction
Here,
- =Anchor box center and dimensions
- =Predicted offsets
- =Final predicted box
Non-Maximum Suppression (NMS)
DfNon-Maximum Suppression
NMS removes duplicate detections for the same object:
- Sort all detections by confidence score
- Take the detection with highest confidence
- Remove all other detections with IoU > threshold with it
- Repeat until no detections remain
Typical IoU threshold: 0.5
YOLO (You Only Look Once)
DfYOLO Architecture
YOLO treats detection as a single regression problem:
- Divide image into grid
- Each cell predicts bounding boxes with confidence scores and class probabilities
- Output tensor:
- Single forward pass — extremely fast (real-time)
Advantages: Real-time speed, global context, learns general representations Disadvantages: Struggles with small objects, grouped objects
ℹ️ YOLO Family Evolution
- YOLOv1: Original single-shot detector
- YOLOv2/v3: Anchor boxes, multi-scale predictions, Darknet backbone
- YOLOv4/v5: CSPNet, PANet, Mosaic augmentation
- YOLOv8: Anchor-free, decoupled head, modern training tricks
- RT-DETR: Transformer-based real-time detection (no NMS needed)
SSD (Single Shot Detector)
DfSSD
SSD predicts bounding boxes at multiple feature map scales:
- Uses VGG-16 as backbone (truncated at conv5_3)
- Adds auxiliary convolution layers at decreasing scales (38×38, 19×19, 10×10, 5×5, 3×3, 1×1)
- Each scale predicts boxes with different anchor sizes
- Multi-scale predictions handle objects of varying sizes
Faster R-CNN
DfFaster R-CNN
Two-stage detector with Region Proposal Network (RPN):
- Backbone: CNN extracts feature maps
- RPN: Proposes regions of interest (RoIs) that likely contain objects
- RoI Pooling: Extracts fixed-size features for each RoI
- Head: Classifies each RoI and refines bounding boxes
Advantages: High accuracy, strong on small objects Disadvantages: Slower than single-shot detectors, more complex
RPN Loss
Here,
- =Predicted probability of being foreground
- =Ground truth label (0 or 1)
- =Predicted bounding box parameters
- =Ground truth bounding box parameters
Evaluation: mAP
DfMean Average Precision (mAP)
mAP is the standard metric for object detection:
- For each class, compute Precision-Recall curve
- Compute Average Precision (AP) as area under PR curve
- mAP = mean of AP across all classes
COCO mAP: Average over IoU thresholds from 0.5 to 0.95 (step 0.05) Pascal VOC mAP: Uses IoU threshold of 0.5
💡 mAP Interpretation
- mAP@0.5: How well does the model detect objects (loose localization)?
- mAP@0.75: How well does the model localize objects precisely?
- mAP@[0.5:0.95]: The primary COCO metric, balances detection and localization
- A good detector has similar mAP@0.5 and mAP@0.75 (precise localization)
PyTorch Detection Example
📝Example: Simple Object Detection Pipeline
import torch
import torch.nn as nn
import torchvision
# ═══════════════════════════════════════════════════
# Pre-trained Faster R-CNN
# ═══════════════════════════════════════════════════
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(
pretrained=True
)
model.eval()
# Dummy image (batch of 1, 3 channels, 480x640)
images = [torch.randn(3, 480, 640)]
with torch.no_grad():
predictions = model(images)
# Results
pred = predictions[0]
print(f"Detected {len(pred['boxes'])} objects")
print(f"Boxes shape: {pred['boxes'].shape}")
print(f"Labels: {pred['labels']}")
print(f"Scores: {pred['scores']}")
# ═══════════════════════════════════════════════════
# IoU Computation
# ═══════════════════════════════════════════════════
def compute_iou(box1, box2):
"""Compute IoU between two boxes in (x1, y1, x2, y2) format."""
x1 = max(box1[0], box2[0])
y1 = max(box1[1], box2[1])
x2 = min(box1[2], box2[2])
y2 = min(box1[3], box2[3])
intersection = max(0, x2 - x1) * max(0, y2 - y1)
area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
union = area1 + area2 - intersection
return intersection / union if union > 0 else 0
# Test
box1 = [0, 0, 100, 100]
box2 = [50, 50, 150, 150]
print(f"\nIoU: {compute_iou(box1, box2):.4f}")
# ═══════════════════════════════════════════════════
# Non-Maximum Suppression
# ═══════════════════════════════════════════════════
def nms(boxes, scores, iou_threshold=0.5):
"""Simple NMS implementation."""
indices = scores.argsort(descending=True)
keep = []
while len(indices) > 0:
idx = indices[0]
keep.append(idx)
if len(indices) == 1:
break
remaining = indices[1:]
ious = torch.tensor([
compute_iou(boxes[idx].tolist(), boxes[i].tolist())
for i in remaining
])
mask = ious <= iou_threshold
indices = remaining[mask]
return keep
# Test NMS
boxes = torch.tensor([
[10, 10, 50, 50],
[12, 12, 52, 52],
[100, 100, 150, 150],
])
scores = torch.tensor([0.9, 0.85, 0.7])
keep = nms(boxes, scores, iou_threshold=0.5)
print(f"\nNMS kept boxes: {keep}")
Summary
📋Summary: Object Detection
- Detection: Predict bounding boxes + class labels + confidence
- IoU: Overlap metric for evaluating box accuracy (threshold 0.5)
- Anchor boxes: Pre-defined box shapes, network predicts offsets
- NMS: Removes duplicate detections (IoU > threshold)
- YOLO: Single-shot, real-time, grid-based prediction
- SSD: Multi-scale single-shot detection
- Faster R-CNN: Two-stage, high accuracy, RPN proposes regions
- mAP: Primary metric — AP averaged across classes
- COCO mAP: Average over IoU thresholds 0.5 to 0.95
Practice Exercises
-
Mathematical: Given two bounding boxes and , compute the IoU. What happens if ?
-
Coding: Implement YOLO's grid-based prediction: divide a 416×416 image into a 13×13 grid. For each cell, predict 3 anchor boxes with class probabilities.
-
Experiment: Fine-tune a pre-trained Faster R-CNN on a custom dataset (e.g., pedestrians, cars). Compare mAP with and without transfer learning.
-
Research: Read the YOLOv8 paper. What improvements were made over YOLOv5? How does the anchor-free approach differ?
-
Evaluation: Compute mAP@0.5 and mAP@0.75 for a set of 10 predictions and 5 ground truth boxes. Implement the computation from scratch.