🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Semantic Segmentation: U-Net, Mask R-CNN, Panoptic — Asked at Google & Meta

Deep Learning Premium InterviewsImage Segmentation⭐ Premium

Advertisement

Google & Meta

Semantic Segmentation: U-Net, Mask R-CNN & Panoptic

Premium Interview Preparation — Segmentation Mastery

🎯 The Interview Question

"Explain the difference between semantic segmentation, instance segmentation, and panoptic segmentation. How does U-Net achieve precise segmentation with skip connections? How does Mask R-CNN extend Faster R-CNN for instance segmentation? What are the modern approaches using Transformers for segmentation?"

This question tests understanding of dense prediction tasks — critical for Google (maps, photos) and Meta (AR effects).


📚 Detailed Answer

Segmentation Types

TypeOutputExample
SemanticClass per pixel"This is a car" (all car pixels)
InstanceClass + instance per pixel"This is car #1, this is car #2"
PanopticSemantic + instanceAll classes with instances

Formally:

  • Semantic: f:RH×W×3{1,,K}H×Wf: \mathbb{R}^{H \times W \times 3} \rightarrow \{1, \ldots, K\}^{H \times W}
  • Instance: f:RH×W×3{(class,mask)}i=1Nf: \mathbb{R}^{H \times W \times 3} \rightarrow \{(class, mask)\}_{i=1}^N
  • Panoptic: Combination of both

U-Net: Encoder-Decoder with Skip Connections

Architecture

Architecture Diagram
Encoder (Contracting Path):
Input → Conv→ReLU→Conv→ReLU → MaxPool
        Conv→ReLU→Conv→ReLU → MaxPool
        Conv→ReLU→Conv→ReLU → MaxPool
        Conv→ReLU→Conv→ReLU → MaxPool
        Conv→ReLU→Conv→ReLU (Bottleneck)

Decoder (Expanding Path):
UpConv → Concat(skip) → Conv→ReLU→Conv→ReLU
UpConv → Concat(skip) → Conv→ReLU→Conv→ReLU
UpConv → Concat(skip) → Conv→ReLU→Conv→ReLU
UpConv → Concat(skip) → Conv→ReLU→Conv→ReLU
Conv 1×1 → Output

Skip Connections

Concatenate encoder features with decoder features:

yi=Conv([ui;si])\mathbf{y}_i = \text{Conv}([\mathbf{u}_i; \mathbf{s}_i])

where ui\mathbf{u}_i is upsampled feature, si\mathbf{s}_i is skip feature.

Why skip connections help:

  • Preserve spatial details lost during downsampling
  • Enable precise localization
  • Help gradient flow

U-Net Loss Function

Combines cross-entropy and Dice loss:

L=LCE+λLDice\mathcal{L} = \mathcal{L}_{CE} + \lambda \mathcal{L}_{Dice}
LDice=12ipigi+ϵipi+igi+ϵ\mathcal{L}_{Dice} = 1 - \frac{2\sum_i p_i g_i + \epsilon}{\sum_i p_i + \sum_i g_i + \epsilon}

💡

U-Net was designed for medical imaging where training data is scarce. The skip connections help recover fine details, and data augmentation is crucial. Modern variants use attention gates and dense connections.

Mask R-CNN

Extends Faster R-CNN for instance segmentation:

Architecture

Architecture Diagram
Input → Backbone → FPN → RPN → ROI Align
                                    ↓
                    ┌───────────────┼───────────────┐
                    ↓               ↓               ↓
              Classification   Box Regression   Mask Head
                    ↓               ↓               ↓
                 Class          Refined Box      Binary Mask

Key Innovation: ROI Align

ROI Align uses bilinear interpolation to avoid quantization errors:

ROIAlign(r)=i,jwijbilinear(x,(i,j))\text{ROIAlign}(r) = \sum_{i,j} w_{ij} \cdot \text{bilinear}(\mathbf{x}, (i, j))

This preserves spatial alignment, crucial for pixel-accurate masks.

Mask Head

Small FCN applied to each ROI:

class MaskHead(nn.Module):
    def __init__(self, in_channels=256, num_classes=80):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, 256, 3, padding=1)
        self.conv2 = nn.Conv2d(256, 256, 3, padding=1)
        self.conv3 = nn.Conv2d(256, 256, 3, padding=1)
        self.conv4 = nn.Conv2d(256, 256, 3, padding=1)
        self.deconv = nn.ConvTranspose2d(256, 256, 2, stride=2)
        self.mask = nn.Conv2d(256, num_classes, 1)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = torch.relu(self.conv3(x))
        x = torch.relu(self.conv4(x))
        x = torch.relu(self.deconv(x))
        return self.mask(x)

Panoptic Segmentation

Combines semantic and instance segmentation:

Panoptic=istuffsemantici+jthingsinstancej\text{Panoptic} = \sum_{i \in \text{stuff}} \text{semantic}_i + \sum_{j \in \text{things}} \text{instance}_j

Stuff classes: amorphous regions (sky, grass, road) Thing classes: countable objects (person, car, dog)

Panoptic FPN

Unified architecture for both tasks:

Architecture Diagram
Backbone → FPN → Semantic Head (per-pixel class)
               → Instance Head (Mask R-CNN)
               → Panoptic Fusion

Fusion algorithm:

  1. Predict semantic segmentation
  2. Predict instance segmentation
  3. For overlapping instances, use predicted confidence
  4. Assign stuff regions to unassigned pixels

Transformer-Based Segmentation

SegFormer

Encoder-decoder with Transformer:

  • Encoder: Hierarchical Transformer (like Swin)
  • Decoder: Lightweight MLP that aggregates multi-scale features

Advantages:

  • Global context (attention captures long-range dependencies)
  • No positional encoding needed (hierarchical structure)
  • Efficient inference

MaskFormer

Reframes segmentation as set prediction:

Prediction=maskiclassi\text{Prediction} = \text{mask}_i \cdot \text{class}_i

Uses Hungarian matching to assign predictions to ground truth.

Evaluation Metrics

IoU (Jaccard Index):

IoU=ABAB\text{IoU} = \frac{|A \cap B|}{|A \cup B|}

Dice Coefficient:

Dice=2ABA+B\text{Dice} = \frac{2|A \cap B|}{|A| + |B|}

Panoptic Quality (PQ):

PQ=(p,g)TPIoU(p,g)TPSegmentation Quality×TPTP+12FP+12FNRecognition QualityPQ = \underbrace{\frac{\sum_{(p,g) \in TP} \text{IoU}(p,g)}{|TP|}}_{\text{Segmentation Quality}} \times \underbrace{\frac{|TP|}{|TP| + \frac{1}{2}|FP| + \frac{1}{2}|FN|}}_{\text{Recognition Quality}}

Follow-Up Questions

Q: How do you handle class imbalance in segmentation? A: Use weighted cross-entropy, Dice loss, or focal loss. Apply class-weighted sampling during training.

Q: What is the difference between transposed convolution and bilinear upsampling? A: Transposed convolution (deconvolution) learns upsampling; bilinear is fixed. Transposed can cause checkerboard artifacts; bilinear is smoother.

Q: How do Transformer-based segmentors compare to CNN-based? A: Transformers capture global context better but need more data. CNNs are more efficient for local patterns. Hybrid models often work best.

Related Topics

Advertisement