Semantic Segmentation — FCN, U-Net, DeepLab & Medical Imaging
Semantic segmentation assigns a class label to every pixel in an image, enabling fine-grained scene understanding. It is critical for autonomous driving, medical imaging, and robotics.
See our CNN Architecture tutorial for the backbone architectures used in segmentation.
Segmentation Types
DfTypes of Segmentation
- Semantic Segmentation: Each pixel gets a class label (no distinction between instances)
- Instance Segmentation: Each instance of each class gets a separate mask
- Panoptic Segmentation: Combines semantic + instance segmentation
Image: Semantic: Instance: Panoptic:
🚗 🚶 🚗 🚗 🚶 🚗 🚗¹ 🚶 🚗² 🚗¹ 🚶 🚗²
(two cars, (car, person, (car₁, person, (all three
one person) car labels) car₂ labels) distinguished)
Fully Convolutional Network (FCN)
DfFCN
FCN replaces fully connected layers with 1×1 convolutions, producing pixel-wise predictions:
- Use a pre-trained CNN (VGG, ResNet) as encoder
- Replace FC layers with 1×1 convolutions
- Upsample to original resolution using transposed convolution
- Output: (where is number of classes)
U-Net Architecture
DfU-Net
U-Net is the dominant architecture for medical image segmentation:
- Encoder (contracting path): Downsamples with conv + max pool
- Decoder (expanding path): Upsamples with transposed convolution
- Skip connections: Concatenate encoder features with decoder features at each scale
Skip connections recover spatial detail lost during downsampling, enabling precise boundary localization.
U-Net Skip Connection
Here,
- =Encoder features at scale l
- =Upsampled features from deeper scale
- =Channel-wise concatenation
💡 U-Net for Medical Imaging
U-Net excels at medical imaging because:
- Small datasets: Skip connections provide strong regularization
- Precise boundaries: Multi-scale features capture fine details
- Efficient: Encoder-decoder structure is memory-efficient
- Data augmentation: Critical for small medical datasets
DeepLab Architecture
DfDeepLab
DeepLab uses three key innovations:
- Atrous (Dilated) Convolution: Expands receptive field without pooling
- ASPP (Atrous Spatial Pyramid Pooling): Multi-scale feature extraction
- CRF Refinement: Post-processing for crisp boundaries
Dilated convolution inserts zeros between kernel elements, increasing the receptive field while maintaining resolution.
Atrous (Dilated) Convolution
Here,
- =Dilation rate (spacing between kernel elements)
- =Input feature map
- =Convolution kernel
Loss Functions for Segmentation
Dice Loss
DfDice Loss
Dice loss directly optimizes the overlap metric:
It is class-imbalance resistant and differentiable. Often combined with cross-entropy for best results.
IoU Metric
DfSegmentation IoU (Jaccard Index)
IoU measures the overlap between predicted and ground truth masks:
Mean IoU (mIoU) averages IoU across all classes. This is the standard metric for segmentation evaluation.
Segmentation IoU
Here,
- =True Positives (correctly predicted foreground)
- =False Positives (background predicted as foreground)
- =False Negatives (foreground predicted as background)
PyTorch Implementation
📝Example: U-Net in PyTorch
import torch
import torch.nn as nn
class DoubleConv(nn.Module):
def __init__(self, in_ch, out_ch):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch),
nn.ReLU(inplace=True),
nn.Conv2d(out_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch),
nn.ReLU(inplace=True),
)
def forward(self, x):
return self.conv(x)
class UNet(nn.Module):
def __init__(self, in_channels=3, num_classes=1):
super().__init__()
# Encoder
self.enc1 = DoubleConv(in_channels, 64)
self.enc2 = DoubleConv(64, 128)
self.enc3 = DoubleConv(128, 256)
self.enc4 = DoubleConv(256, 512)
self.pool = nn.MaxPool2d(2)
# Bottleneck
self.bottleneck = DoubleConv(512, 1024)
# Decoder
self.up4 = nn.ConvTranspose2d(1024, 512, 2, stride=2)
self.dec4 = DoubleConv(1024, 512)
self.up3 = nn.ConvTranspose2d(512, 256, 2, stride=2)
self.dec3 = DoubleConv(512, 256)
self.up2 = nn.ConvTranspose2d(256, 128, 2, stride=2)
self.dec2 = DoubleConv(256, 128)
self.up1 = nn.ConvTranspose2d(128, 64, 2, stride=2)
self.dec1 = DoubleConv(128, 64)
# Output
self.out_conv = nn.Conv2d(64, num_classes, 1)
def forward(self, x):
# Encoder
e1 = self.enc1(x)
e2 = self.enc2(self.pool(e1))
e3 = self.enc3(self.pool(e2))
e4 = self.enc4(self.pool(e3))
# Bottleneck
b = self.bottleneck(self.pool(e4))
# Decoder with skip connections
d4 = self.dec4(torch.cat([self.up4(b), e4], dim=1))
d3 = self.dec3(torch.cat([self.up3(d4), e3], dim=1))
d2 = self.dec2(torch.cat([self.up2(d3), e2], dim=1))
d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))
return self.out_conv(d1)
# Test
model = UNet(in_channels=3, num_classes=1)
x = torch.randn(2, 3, 256, 256)
out = model(x)
print(f"Input: {x.shape}, Output: {out.shape}") # [2, 1, 256, 256]
# Dice + BCE combined loss
def dice_bce_loss(pred, target, alpha=0.5):
bce = nn.BCEWithLogitsLoss()(pred, target)
pred_sigmoid = torch.sigmoid(pred)
intersection = (pred_sigmoid * target).sum()
dice = 1 - (2. * intersection + 1) / (pred_sigmoid.sum() + target.sum() + 1)
return alpha * bce + (1 - alpha) * dice
📝Example: Evaluation Metrics
import torch
def compute_iou(pred, target, threshold=0.5):
"""Compute IoU for binary segmentation."""
pred_mask = (torch.sigmoid(pred) > threshold).float()
intersection = (pred_mask * target).sum()
union = pred_mask.sum() + target.sum() - intersection
return (intersection / (union + 1e-8)).item()
def compute_dice(pred, target, threshold=0.5):
"""Compute Dice coefficient for binary segmentation."""
pred_mask = (torch.sigmoid(pred) > threshold).float()
intersection = (pred_mask * target).sum()
return (2 * intersection / (pred_mask.sum() + target.sum() + 1e-8)).item()
# Test
pred = torch.randn(1, 1, 256, 256)
target = torch.randint(0, 2, (1, 1, 256, 256)).float()
print(f"IoU: {compute_iou(pred, target):.4f}")
print(f"Dice: {compute_dice(pred, target):.4f}")
Summary
📋Summary: Semantic Segmentation
- Semantic: Pixel-wise class labels, no instance distinction
- Instance: Distinguishes individual object instances
- Panoptic: Combines semantic + instance
- FCN: First fully convolutional architecture for segmentation
- U-Net: Encoder-decoder with skip connections, dominant for medical imaging
- DeepLab: Dilated convolution + ASPP for multi-scale features
- Loss: Dice loss + BCE combination is standard
- Metric: Mean IoU (mIoU) is the primary evaluation metric
- Applications: Autonomous driving, medical imaging, satellite imagery
Practice Exercises
-
Conceptual: Why do skip connections help segmentation? What information is lost during downsampling that skip connections recover?
-
Coding: Implement a simple FCN with a ResNet-18 backbone for Pascal VOC segmentation (21 classes). Use pre-trained weights for the encoder.
-
Experiment: Train U-Net on a medical imaging dataset (e.g., ISIC skin lesion). Compare Dice loss vs. BCE vs. combined. Which works best?
-
Research: Look up DeepLabv3+ architecture. How does the encoder-decoder structure improve over DeepLabv3?
-
Application: Build a panoptic segmentation pipeline using a pre-trained Mask R-CNN on COCO. How does panoptic quality differ from mIoU?