🎯 The Interview Question
"Explain the difference between one-stage and two-stage object detectors. How does Faster R-CNN work, including the Region Proposal Network (RPN)? What is the YOLO approach and why is it faster? How do anchor boxes work, and what are the trade-offs between anchor-based and anchor-free detection?"
This question is critical for autonomous driving roles at Tesla and NVIDIA where real-time detection is essential.
📚 Detailed Answer
Object Detection Pipeline
Object detection combines:
- Classification: What object is in the bounding box?
- Localization: Where is the bounding box?
Loss function:
where is classification loss, is bounding box regression loss (smooth L1), is objectness loss.
Two-Stage Detectors: Faster R-CNN
Architecture Overview
Input → Backbone (ResNet) → Feature Map
↓
RPN → Region Proposals
↓
ROI Pooling → Fixed-size features
↓
Classification + Bounding Box Regression
Region Proposal Network (RPN)
Slides a small network over feature map:
-
Anchor generation: At each spatial location, generate anchor boxes (different scales and aspect ratios)
-
Feature extraction: For each anchor, extract a feature patch
-
Classification: scores (object vs background for each anchor)
-
Regression: coordinates (refined bounding box for each anchor)
Anchor box regression:
where are anchor coordinates and are ground truth.
ROI Pooling
Converts variable-size proposals to fixed-size features:
- Divide proposal into grid (e.g., 7×7)
- Max pool each grid cell
- Output: feature map
Problem: Quantization error (rounding to grid cells)
ROI Align (Improved)
Uses bilinear interpolation to avoid quantization:
💡
Faster R-CNN achieves ~5 FPS. For real-time detection (30+ FPS), use YOLO or SSD. For maximum accuracy (detection competitions), use two-stage or cascade detectors.
One-Stage Detectors: YOLO
YOLO v1
Directly predicts bounding boxes and class probabilities:
- Divide image into grid
- Each cell predicts bounding boxes and class probabilities
- Output tensor:
Prediction format per box:
- : center offset (0-1 relative to cell)
- : size relative to image
Loss function:
YOLO v3-v8 Improvements
| Version | Key Innovation | mAP | FPS |
|---|---|---|---|
| v1 | Single-shot detection | 63.4 | 45 |
| v3 | Multi-scale prediction, FPN | 57.9 | 20 |
| v5 | Anchor-free, auto-anchor | 68.9 | 140 |
| v8 | Decoupled head, anchor-free | 71.0 | 80 |
Anchor Boxes
Traditional Approach (Anchor-Based)
Define fixed anchor boxes, predict offsets:
anchors = [
(10, 13), (16, 30), (33, 23), # Small
(30, 61), (62, 45), (59, 119), # Medium
(116, 90), (156, 198), (373, 326) # Large
]
Problem: Anchors are dataset-dependent, require clustering.
Anchor-Free Approach
Predict centers and sizes directly:
CenterNet: Predict heatmap of object centers, regress sizes.
FCOS: Predict distance from point to box edges:
Advantages:
- No anchor hyperparameters
- More flexible for unusual aspect ratios
- Simpler training
Non-Maximum Suppression (NMS)
Post-processing to remove duplicate detections:
- Sort boxes by confidence
- Select box with highest confidence
- Remove boxes with IoU > threshold (0.5)
- Repeat until no boxes remain
Soft-NMS: Instead of removing, decay confidence:
Better for crowded scenes.
Evaluation Metrics
IoU (Intersection over Union):
mAP (mean Average Precision):
Compute AP for each class (precision-recall curve area), average across classes.
COCO metrics: mAP@0.5:0.95 (averaged over IoU thresholds)
Follow-Up Questions
Q: How do you handle small object detection? A: Use FPN (Feature Pyramid Network) for multi-scale features, higher input resolution, and attention mechanisms that focus on small regions.
Q: What is the difference between IoU loss and GIoU loss? A: IoU loss = . GIoU adds a penalty for non-overlapping boxes: .
Q: How does NMS affect recall in crowded scenes? A: Standard NMS removes valid overlapping detections. Use Soft-NMS or learned NMS for crowded scenes.