Semantic Segmentation — FCN, U-Net, DeepLab & Medical Imaging

Computer VisionSegmentationFree Lesson

Advertisement

Semantic Segmentation — FCN, U-Net, DeepLab & Medical Imaging

Semantic segmentation assigns a class label to every pixel in an image, enabling fine-grained scene understanding. It is critical for autonomous driving, medical imaging, and robotics.

See our CNN Architecture tutorial for the backbone architectures used in segmentation.


Segmentation Types

DfTypes of Segmentation

  1. Semantic Segmentation: Each pixel gets a class label (no distinction between instances)
  2. Instance Segmentation: Each instance of each class gets a separate mask
  3. Panoptic Segmentation: Combines semantic + instance segmentation
Architecture Diagram
Image:          Semantic:         Instance:         Panoptic:
🚗 🚶 🚗       🚗 🚶 🚗          🚗¹ 🚶 🚗²         🚗¹ 🚶 🚗²
 (two cars,     (car, person,     (car₁, person,     (all three
  one person)    car labels)       car₂ labels)       distinguished)

Fully Convolutional Network (FCN)

DfFCN

FCN replaces fully connected layers with 1×1 convolutions, producing pixel-wise predictions:

  1. Use a pre-trained CNN (VGG, ResNet) as encoder
  2. Replace FC layers with 1×1 convolutions
  3. Upsample to original resolution using transposed convolution
  4. Output: H×W×CH \times W \times C (where CC is number of classes)
Output size=(Input size1)×Stride2×Padding+Kernel size\text{Output size} = (\text{Input size} - 1) \times \text{Stride} - 2 \times \text{Padding} + \text{Kernel size}

U-Net Architecture

DfU-Net

U-Net is the dominant architecture for medical image segmentation:

  • Encoder (contracting path): Downsamples with conv + max pool
  • Decoder (expanding path): Upsamples with transposed convolution
  • Skip connections: Concatenate encoder features with decoder features at each scale

Skip connections recover spatial detail lost during downsampling, enabling precise boundary localization.

U-Net Skip Connection

hdecoder(l)=Conv(Concat(hencoder(l),hup(l+1)))\mathbf{h}^{(l)}_{\text{decoder}} = \text{Conv}\left(\text{Concat}(\mathbf{h}^{(l)}_{\text{encoder}}, \mathbf{h}^{(l+1)}_{\text{up}})\right)

Here,

  • hencoder(l)\mathbf{h}^{(l)}_{\text{encoder}}=Encoder features at scale l
  • hup(l+1)\mathbf{h}^{(l+1)}_{\text{up}}=Upsampled features from deeper scale
  • Concat\text{Concat}=Channel-wise concatenation

💡 U-Net for Medical Imaging

U-Net excels at medical imaging because:

  1. Small datasets: Skip connections provide strong regularization
  2. Precise boundaries: Multi-scale features capture fine details
  3. Efficient: Encoder-decoder structure is memory-efficient
  4. Data augmentation: Critical for small medical datasets

DeepLab Architecture

DfDeepLab

DeepLab uses three key innovations:

  1. Atrous (Dilated) Convolution: Expands receptive field without pooling
  2. ASPP (Atrous Spatial Pyramid Pooling): Multi-scale feature extraction
  3. CRF Refinement: Post-processing for crisp boundaries

Dilated convolution inserts zeros between kernel elements, increasing the receptive field while maintaining resolution.

Atrous (Dilated) Convolution

(IrK)(i,j)=mnI(i+rm,j+rn)K(m,n)(\mathbf{I} *_{r} \mathbf{K})(i,j) = \sum_m \sum_n \mathbf{I}(i + rm, j + rn) \cdot \mathbf{K}(m,n)

Here,

  • rr=Dilation rate (spacing between kernel elements)
  • I\mathbf{I}=Input feature map
  • K\mathbf{K}=Convolution kernel

Loss Functions for Segmentation

Dice Loss

DfDice Loss

Dice loss directly optimizes the overlap metric:

LDice=12ipigi+ϵipi+igi+ϵ\mathcal{L}_{\text{Dice}} = 1 - \frac{2 \sum_i p_i g_i + \epsilon}{\sum_i p_i + \sum_i g_i + \epsilon}

It is class-imbalance resistant and differentiable. Often combined with cross-entropy for best results.

LDice=12ipigi+ϵipi+igi+ϵ\mathcal{L}_{\text{Dice}} = 1 - \frac{2 \sum_i p_i g_i + \epsilon}{\sum_i p_i + \sum_i g_i + \epsilon}

IoU Metric

DfSegmentation IoU (Jaccard Index)

IoU measures the overlap between predicted and ground truth masks:

IoU=PGPG=TPTP+FP+FN\text{IoU} = \frac{|P \cap G|}{|P \cup G|} = \frac{\text{TP}}{\text{TP} + \text{FP} + \text{FN}}

Mean IoU (mIoU) averages IoU across all classes. This is the standard metric for segmentation evaluation.

Segmentation IoU

IoU=TPTP+FP+FN\text{IoU} = \frac{\text{TP}}{\text{TP} + \text{FP} + \text{FN}}

Here,

  • TPTP=True Positives (correctly predicted foreground)
  • FPFP=False Positives (background predicted as foreground)
  • FNFN=False Negatives (foreground predicted as background)

PyTorch Implementation

📝Example: U-Net in PyTorch

import torch
import torch.nn as nn

class DoubleConv(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
        )

    def forward(self, x):
        return self.conv(x)

class UNet(nn.Module):
    def __init__(self, in_channels=3, num_classes=1):
        super().__init__()
        # Encoder
        self.enc1 = DoubleConv(in_channels, 64)
        self.enc2 = DoubleConv(64, 128)
        self.enc3 = DoubleConv(128, 256)
        self.enc4 = DoubleConv(256, 512)

        self.pool = nn.MaxPool2d(2)

        # Bottleneck
        self.bottleneck = DoubleConv(512, 1024)

        # Decoder
        self.up4 = nn.ConvTranspose2d(1024, 512, 2, stride=2)
        self.dec4 = DoubleConv(1024, 512)
        self.up3 = nn.ConvTranspose2d(512, 256, 2, stride=2)
        self.dec3 = DoubleConv(512, 256)
        self.up2 = nn.ConvTranspose2d(256, 128, 2, stride=2)
        self.dec2 = DoubleConv(256, 128)
        self.up1 = nn.ConvTranspose2d(128, 64, 2, stride=2)
        self.dec1 = DoubleConv(128, 64)

        # Output
        self.out_conv = nn.Conv2d(64, num_classes, 1)

    def forward(self, x):
        # Encoder
        e1 = self.enc1(x)
        e2 = self.enc2(self.pool(e1))
        e3 = self.enc3(self.pool(e2))
        e4 = self.enc4(self.pool(e3))

        # Bottleneck
        b = self.bottleneck(self.pool(e4))

        # Decoder with skip connections
        d4 = self.dec4(torch.cat([self.up4(b), e4], dim=1))
        d3 = self.dec3(torch.cat([self.up3(d4), e3], dim=1))
        d2 = self.dec2(torch.cat([self.up2(d3), e2], dim=1))
        d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))

        return self.out_conv(d1)

# Test
model = UNet(in_channels=3, num_classes=1)
x = torch.randn(2, 3, 256, 256)
out = model(x)
print(f"Input: {x.shape}, Output: {out.shape}")  # [2, 1, 256, 256]

# Dice + BCE combined loss
def dice_bce_loss(pred, target, alpha=0.5):
    bce = nn.BCEWithLogitsLoss()(pred, target)
    pred_sigmoid = torch.sigmoid(pred)
    intersection = (pred_sigmoid * target).sum()
    dice = 1 - (2. * intersection + 1) / (pred_sigmoid.sum() + target.sum() + 1)
    return alpha * bce + (1 - alpha) * dice

📝Example: Evaluation Metrics

import torch

def compute_iou(pred, target, threshold=0.5):
    """Compute IoU for binary segmentation."""
    pred_mask = (torch.sigmoid(pred) > threshold).float()
    intersection = (pred_mask * target).sum()
    union = pred_mask.sum() + target.sum() - intersection
    return (intersection / (union + 1e-8)).item()

def compute_dice(pred, target, threshold=0.5):
    """Compute Dice coefficient for binary segmentation."""
    pred_mask = (torch.sigmoid(pred) > threshold).float()
    intersection = (pred_mask * target).sum()
    return (2 * intersection / (pred_mask.sum() + target.sum() + 1e-8)).item()

# Test
pred = torch.randn(1, 1, 256, 256)
target = torch.randint(0, 2, (1, 1, 256, 256)).float()

print(f"IoU: {compute_iou(pred, target):.4f}")
print(f"Dice: {compute_dice(pred, target):.4f}")

Summary

📋Summary: Semantic Segmentation

  • Semantic: Pixel-wise class labels, no instance distinction
  • Instance: Distinguishes individual object instances
  • Panoptic: Combines semantic + instance
  • FCN: First fully convolutional architecture for segmentation
  • U-Net: Encoder-decoder with skip connections, dominant for medical imaging
  • DeepLab: Dilated convolution + ASPP for multi-scale features
  • Loss: Dice loss + BCE combination is standard
  • Metric: Mean IoU (mIoU) is the primary evaluation metric
  • Applications: Autonomous driving, medical imaging, satellite imagery

Practice Exercises

  1. Conceptual: Why do skip connections help segmentation? What information is lost during downsampling that skip connections recover?

  2. Coding: Implement a simple FCN with a ResNet-18 backbone for Pascal VOC segmentation (21 classes). Use pre-trained weights for the encoder.

  3. Experiment: Train U-Net on a medical imaging dataset (e.g., ISIC skin lesion). Compare Dice loss vs. BCE vs. combined. Which works best?

  4. Research: Look up DeepLabv3+ architecture. How does the encoder-decoder structure improve over DeepLabv3?

  5. Application: Build a panoptic segmentation pipeline using a pre-trained Mask R-CNN on COCO. How does panoptic quality differ from mIoU?

Advertisement

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement