Object Detection: YOLO, Faster R-CNN, Anchor Boxes — Asked at Tesla & NVIDIA

🎯 The Interview Question

"Explain the difference between one-stage and two-stage object detectors. How does Faster R-CNN work, including the Region Proposal Network (RPN)? What is the YOLO approach and why is it faster? How do anchor boxes work, and what are the trade-offs between anchor-based and anchor-free detection?"

This question is critical for autonomous driving roles at Tesla and NVIDIA where real-time detection is essential.

📚 Detailed Answer

Object Detection Pipeline

Object detection combines:

Classification: What object is in the bounding box?
Localization: Where is the bounding box?

Loss function:

\mathcal{L} = \mathcal{L}_{cls} + \lambda_1 \mathcal{L}_{loc} + \lambda_2 \mathcal{L}_{obj}

where $\mathcal{L}_{cls}$ is classification loss, $\mathcal{L}_{loc}$ is bounding box regression loss (smooth L1), $\mathcal{L}_{obj}$ is objectness loss.

Two-Stage Detectors: Faster R-CNN

Architecture Overview

Architecture Diagram

Input → Backbone (ResNet) → Feature Map
              ↓
         RPN → Region Proposals
              ↓
         ROI Pooling → Fixed-size features
              ↓
         Classification + Bounding Box Regression

Region Proposal Network (RPN)

Slides a small network over feature map:

Anchor generation: At each spatial location, generate $k$ anchor boxes (different scales and aspect ratios)
Feature extraction: For each anchor, extract a $3 \times 3$ feature patch
Classification: $2k$ scores (object vs background for each anchor)
Regression: $4k$ coordinates (refined bounding box for each anchor)

Anchor box regression:

t_x = \frac{x - x_a}{w_a}, \quad t_y = \frac{y - y_a}{h_a}

t_w = \log\left(\frac{w}{w_a}\right), \quad t_h = \log\left(\frac{h}{h_a}\right)

where $(x_a, y_a, w_a, h_a)$ are anchor coordinates and $(x, y, w, h)$ are ground truth.

ROI Pooling

Converts variable-size proposals to fixed-size features:

Divide proposal into grid (e.g., 7×7)
Max pool each grid cell
Output: $C \times 7 \times 7$ feature map

Problem: Quantization error (rounding to grid cells)

ROI Align (Improved)

Uses bilinear interpolation to avoid quantization:

\text{ROIAlign}(p) = \sum_{i,j} w_{ij} \cdot \text{bilinear}(\text{grid}(p), (i,j))

💡

Faster R-CNN achieves ~5 FPS. For real-time detection (30+ FPS), use YOLO or SSD. For maximum accuracy (detection competitions), use two-stage or cascade detectors.

One-Stage Detectors: YOLO

YOLO v1

Directly predicts bounding boxes and class probabilities:

Divide image into $S \times S$ grid
Each cell predicts $B$ bounding boxes and $C$ class probabilities
Output tensor: $S \times S \times (B \times 5 + C)$

Prediction format per box: $(x, y, w, h, confidence)$

$(x, y)$ : center offset (0-1 relative to cell)
$(w, h)$ : size relative to image
$confidence = P(\text{object}) \times \text{IoU}$

Loss function:

\mathcal{L} = \lambda_{coord}\sum \text{box loss} + \sum \text{confidence loss} + \lambda_{class}\sum \text{classification loss}

YOLO v3-v8 Improvements

Version	Key Innovation	mAP	FPS
v1	Single-shot detection	63.4	45
v3	Multi-scale prediction, FPN	57.9	20
v5	Anchor-free, auto-anchor	68.9	140
v8	Decoupled head, anchor-free	71.0	80

Anchor Boxes

Traditional Approach (Anchor-Based)

Define fixed anchor boxes, predict offsets:

anchors = [
    (10, 13), (16, 30), (33, 23),  # Small
    (30, 61), (62, 45), (59, 119),  # Medium
    (116, 90), (156, 198), (373, 326)  # Large
]

Problem: Anchors are dataset-dependent, require clustering.

Anchor-Free Approach

Predict centers and sizes directly:

CenterNet: Predict heatmap of object centers, regress sizes.

FCOS: Predict distance from point to box edges:

\text{prediction} = (l, t, r, b) = (x - x_{min}, y - y_{min}, x_{max} - x, y_{max} - y)

Advantages:

No anchor hyperparameters
More flexible for unusual aspect ratios
Simpler training

Non-Maximum Suppression (NMS)

Post-processing to remove duplicate detections:

Sort boxes by confidence
Select box with highest confidence
Remove boxes with IoU > threshold (0.5)
Repeat until no boxes remain

Soft-NMS: Instead of removing, decay confidence:

s_i = s_i \cdot e^{-\frac{\text{IoU}(b, b_i)^2}{\sigma}}

Better for crowded scenes.

Evaluation Metrics

IoU (Intersection over Union):

\text{IoU} = \frac{|A \cap B|}{|A \cup B|}

mAP (mean Average Precision):

Compute AP for each class (precision-recall curve area), average across classes.

COCO metrics: mAP@0.5:0.95 (averaged over IoU thresholds)

Follow-Up Questions

Q: How do you handle small object detection? A: Use FPN (Feature Pyramid Network) for multi-scale features, higher input resolution, and attention mechanisms that focus on small regions.

Q: What is the difference between IoU loss and GIoU loss? A: IoU loss = $1 - \text{IoU}$ . GIoU adds a penalty for non-overlapping boxes: $\text{GIoU} = \text{IoU} - \frac{|C \setminus (A \cup B)|}{|C|}$ .

Q: How does NMS affect recall in crowded scenes? A: Standard NMS removes valid overlapping detections. Use Soft-NMS or learned NMS for crowded scenes.