🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Object Detection: YOLO, Faster R-CNN, Anchor Boxes — Asked at Tesla & NVIDIA

Deep Learning Premium InterviewsObject Detection⭐ Premium

Advertisement

Tesla & NVIDIA

Object Detection: YOLO, Faster R-CNN & Anchor Boxes

Premium Interview Preparation — Object Detection Mastery

🎯 The Interview Question

"Explain the difference between one-stage and two-stage object detectors. How does Faster R-CNN work, including the Region Proposal Network (RPN)? What is the YOLO approach and why is it faster? How do anchor boxes work, and what are the trade-offs between anchor-based and anchor-free detection?"

This question is critical for autonomous driving roles at Tesla and NVIDIA where real-time detection is essential.


📚 Detailed Answer

Object Detection Pipeline

Object detection combines:

  1. Classification: What object is in the bounding box?
  2. Localization: Where is the bounding box?

Loss function:

L=Lcls+λ1Lloc+λ2Lobj\mathcal{L} = \mathcal{L}_{cls} + \lambda_1 \mathcal{L}_{loc} + \lambda_2 \mathcal{L}_{obj}

where Lcls\mathcal{L}_{cls} is classification loss, Lloc\mathcal{L}_{loc} is bounding box regression loss (smooth L1), Lobj\mathcal{L}_{obj} is objectness loss.

Two-Stage Detectors: Faster R-CNN

Architecture Overview

Architecture Diagram
Input → Backbone (ResNet) → Feature Map
              ↓
         RPN → Region Proposals
              ↓
         ROI Pooling → Fixed-size features
              ↓
         Classification + Bounding Box Regression

Region Proposal Network (RPN)

Slides a small network over feature map:

  1. Anchor generation: At each spatial location, generate kk anchor boxes (different scales and aspect ratios)

  2. Feature extraction: For each anchor, extract a 3×33 \times 3 feature patch

  3. Classification: 2k2k scores (object vs background for each anchor)

  4. Regression: 4k4k coordinates (refined bounding box for each anchor)

Anchor box regression:

tx=xxawa,ty=yyahat_x = \frac{x - x_a}{w_a}, \quad t_y = \frac{y - y_a}{h_a}
tw=log(wwa),th=log(hha)t_w = \log\left(\frac{w}{w_a}\right), \quad t_h = \log\left(\frac{h}{h_a}\right)

where (xa,ya,wa,ha)(x_a, y_a, w_a, h_a) are anchor coordinates and (x,y,w,h)(x, y, w, h) are ground truth.

ROI Pooling

Converts variable-size proposals to fixed-size features:

  1. Divide proposal into grid (e.g., 7×7)
  2. Max pool each grid cell
  3. Output: C×7×7C \times 7 \times 7 feature map

Problem: Quantization error (rounding to grid cells)

ROI Align (Improved)

Uses bilinear interpolation to avoid quantization:

ROIAlign(p)=i,jwijbilinear(grid(p),(i,j))\text{ROIAlign}(p) = \sum_{i,j} w_{ij} \cdot \text{bilinear}(\text{grid}(p), (i,j))

💡

Faster R-CNN achieves ~5 FPS. For real-time detection (30+ FPS), use YOLO or SSD. For maximum accuracy (detection competitions), use two-stage or cascade detectors.

One-Stage Detectors: YOLO

YOLO v1

Directly predicts bounding boxes and class probabilities:

  1. Divide image into S×SS \times S grid
  2. Each cell predicts BB bounding boxes and CC class probabilities
  3. Output tensor: S×S×(B×5+C)S \times S \times (B \times 5 + C)

Prediction format per box: (x,y,w,h,confidence)(x, y, w, h, confidence)

  • (x,y)(x, y): center offset (0-1 relative to cell)
  • (w,h)(w, h): size relative to image
  • confidence=P(object)×IoUconfidence = P(\text{object}) \times \text{IoU}

Loss function:

L=λcoordbox loss+confidence loss+λclassclassification loss\mathcal{L} = \lambda_{coord}\sum \text{box loss} + \sum \text{confidence loss} + \lambda_{class}\sum \text{classification loss}

YOLO v3-v8 Improvements

VersionKey InnovationmAPFPS
v1Single-shot detection63.445
v3Multi-scale prediction, FPN57.920
v5Anchor-free, auto-anchor68.9140
v8Decoupled head, anchor-free71.080

Anchor Boxes

Traditional Approach (Anchor-Based)

Define fixed anchor boxes, predict offsets:

anchors = [
    (10, 13), (16, 30), (33, 23),  # Small
    (30, 61), (62, 45), (59, 119),  # Medium
    (116, 90), (156, 198), (373, 326)  # Large
]

Problem: Anchors are dataset-dependent, require clustering.

Anchor-Free Approach

Predict centers and sizes directly:

CenterNet: Predict heatmap of object centers, regress sizes.

FCOS: Predict distance from point to box edges:

prediction=(l,t,r,b)=(xxmin,yymin,xmaxx,ymaxy)\text{prediction} = (l, t, r, b) = (x - x_{min}, y - y_{min}, x_{max} - x, y_{max} - y)

Advantages:

  • No anchor hyperparameters
  • More flexible for unusual aspect ratios
  • Simpler training

Non-Maximum Suppression (NMS)

Post-processing to remove duplicate detections:

  1. Sort boxes by confidence
  2. Select box with highest confidence
  3. Remove boxes with IoU > threshold (0.5)
  4. Repeat until no boxes remain

Soft-NMS: Instead of removing, decay confidence:

si=sieIoU(b,bi)2σs_i = s_i \cdot e^{-\frac{\text{IoU}(b, b_i)^2}{\sigma}}

Better for crowded scenes.

Evaluation Metrics

IoU (Intersection over Union):

IoU=ABAB\text{IoU} = \frac{|A \cap B|}{|A \cup B|}

mAP (mean Average Precision):

Compute AP for each class (precision-recall curve area), average across classes.

COCO metrics: mAP@0.5:0.95 (averaged over IoU thresholds)

Follow-Up Questions

Q: How do you handle small object detection? A: Use FPN (Feature Pyramid Network) for multi-scale features, higher input resolution, and attention mechanisms that focus on small regions.

Q: What is the difference between IoU loss and GIoU loss? A: IoU loss = 1IoU1 - \text{IoU}. GIoU adds a penalty for non-overlapping boxes: GIoU=IoUC(AB)C\text{GIoU} = \text{IoU} - \frac{|C \setminus (A \cup B)|}{|C|}.

Q: How does NMS affect recall in crowded scenes? A: Standard NMS removes valid overlapping detections. Use Soft-NMS or learned NMS for crowded scenes.

Related Topics

Advertisement