CW

CNNs for Image Data: Convolution, Pooling and Architectures

Module 13: Computer VisionFree Lesson

Advertisement

CNNs for Image Data: Convolution, Pooling and Architectures

Why CNNs for Images?

Fully connected networks treat images as flat vectors, destroying spatial structure. A 224×224×3 image flattened becomes 150,528 features — connecting each to a hidden layer of 1,000 neurons requires 150 million parameters just for the first layer. CNNs solve this through two key principles:

Parameter Sharing: A single filter (kernel) slides across the entire image, reusing the same weights at every spatial location. A 3×3 kernel has only 9 parameters per channel, regardless of image size.

Translation Equivalence: Because the same filter scans all positions, a feature detected at one location can be recognized anywhere. The network learns what to detect, not where.

FC NetworkInputHidden150M+ paramsNo spatial awarenessVSCNNConvPoolFC~1-25M paramsSpatial hierarchy preservedKey InsightsParameter sharingTranslation invarianceSpatial hierarchiesLocal connectivity

The Convolution Operation

Convolution is the core building block. A small kernel (filter) slides over the input, computing element-wise multiplications and summing the results at each position.

Mathematical Definition

For a 2D input XX and kernel KK, the output at position (i,j)(i, j) is:

Y[i,j]=m=0kh1n=0kw1X[i+m,j+n]K[m,n]Y[i, j] = \sum_{m=0}^{k_h-1} \sum_{n=0}^{k_w-1} X[i+m, \, j+n] \cdot K[m, n]

where khk_h and kwk_w are the kernel height and width.

Convolution Operation: 5×5 Input, 3×3 Kernel, Stride=1, Valid PaddingInput (5×5)1010101010101010101010101Kernel (3×3)101010101=Element-wise Multiply and Sum101010101↓Sum = 5Output (3×3)535353535Kernel slides across entire input, computing one output value per position→ → →

Stride and Padding

Stride (ss): How many pixels the kernel shifts at each step.

Padding (pp): Zero-padding added around the input border.

O=WK+2PS+1O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1

where WW = input size, KK = kernel size, PP = padding, SS = stride.

Padding and Stride EffectsValid (P=0)Input7×7 → 5×5(K=3, S=1)Same (P=1)Input7×7 → 7×7(K=3, S=1)Stride 2 (S=2)7×7 → 3×3(K=3, S=2)Output FormulaO = ⌊(W-K+2P)/S⌋+1W=input, K=kernelP=padding, S=stride

Multi-Channel Convolution

For an RGB input with CinC_{in} channels, each filter has shape K×K×CinK \times K \times C_{in}. The filter slides over all channels simultaneously, producing a single output channel:

Y[i,j]=c=0Cin1m=0K1n=0K1X[i+m,j+n,c]K[m,n,c]+bY[i, j] = \sum_{c=0}^{C_{in}-1} \sum_{m=0}^{K-1} \sum_{n=0}^{K-1} X[i+m, \, j+n, \, c] \cdot K[m, n, c] + b

A convolution layer with CoutC_{out} filters produces CoutC_{out} channels. Total parameters: Cout×(Cin×K×K+1)C_{out} \times (C_{in} \times K \times K + 1).


Pooling Layers

Pooling reduces spatial dimensions, providing translational invariance and reducing computation.

Max Pooling

Y[i,j]=maxm[0,Ph),n[0,Pw)X[iS+m,jS+n]Y[i, j] = \max_{m \in [0, P_h), \, n \in [0, P_w)} X[i \cdot S + m, \, j \cdot S + n]

Selects the maximum value within each pooling window. Preserves the strongest feature activations.

Average Pooling

Y[i,j]=1PhPwm=0Ph1n=0Pw1X[iS+m,jS+n]Y[i, j] = \frac{1}{P_h \cdot P_w} \sum_{m=0}^{P_h-1} \sum_{n=0}^{P_w-1} X[i \cdot S + m, \, j \cdot S + n]

Computes the mean value within each window. Global Average Pooling (GAP) averages each entire feature map to a single value, commonly replacing fully connected layers.

Pooling: 4×4 Input, 2×2 Pool, Stride=2Input (4×4)1324561232814135Max Pool64484×4 → 2×2Avg Pool3.82.32.54.34×4 → 2×2Pooling BenefitsMax Pool:Captures strongest featuresAvg Pool:Smooths feature mapsBoth: ↓ params, ↑ invarianceGlobal Average Pooling (GAP)7×7×512 → 1×1×512 = 512 valuesReplaces FC layers, reduces overfitting

CNN Architecture: The Pattern

The canonical CNN follows a repeating pattern:

Architecture Diagram
[Conv → ReLU → Pool] × N  →  [FC] × M  →  Output

Convolutional blocks extract hierarchical features:

  • Early layers: Low-level features (edges, textures, colors)
  • Middle layers: Mid-level features (patterns, parts, shapes)
  • Late layers: High-level features (objects, scenes, concepts)
CNN Architecture PatternInput Image224×224×3Conv 3×364 filtersReLUMax Pool2×2, S=2112×112×64Conv 3×3128 filtersReLUMax Pool2×2, S=256×56×128Conv 3×3256 filtersReLUMax Pool2×2, S=228×28×256Conv 3×3512 filtersReLUMax Pool2×2, S=214×14×512GlobalAvg Pool1×1×512FC1000SoftmaxFeature Hierarchy (Hierarchical Representation Learning)Edges/TexturesPatterns/PartsObjects/Scenes

Receptive Field

The receptive field is the region of the original input that influences a particular neuron. As we stack layers, the receptive field grows:

RFl=RFl1+(Kl1)×i=1l1SiRF_{l} = RF_{l-1} + (K_l - 1) \times \prod_{i=1}^{l-1} S_i
Receptive Field GrowthLayer 1RF: 3×3→Layer 2RF: 5×5→Layer 3RF: 7×7→Layer 4RF: 9×9→Deep LayerRF: Full imageRF_l = RF_{l-1} + (K_l - 1) × Π S_i

Famous Architectures

LeNet-5 (1998)

Pioneering CNN for handwritten digit recognition. Introduced the Conv→Pool→FC paradigm.

LayerOutputKernelFiltersParameters
Conv128×28×65×56156
Pool114×14×62×2—0
Conv210×10×165×5161,516
Pool25×5×162×2—0
FC1120——48,120
FC284——10,164
FC310——850

Total: ~60K parameters

AlexNet (2012)

Won ImageNet by a large margin. Key innovations: ReLU activation, dropout, data augmentation, GPU training.

  • 5 conv layers + 3 FC layers
  • ~60M parameters
  • ReLU instead of tanh → faster training
  • Overlapping pooling (3×3, S=2)

VGG (2014)

Demonstrated that depth matters. Used only 3×3 convolutions with stride 1 and padding 1.

Key insight: Two 3×3 conv layers have the same receptive field as one 5×5 layer but with fewer parameters:

2×(3×3×C2)=18C2<5×5×C2=25C22 \times (3 \times 3 \times C^2) = 18C^2 \quad < \quad 5 \times 5 \times C^2 = 25C^2

VGG-16: 13 conv layers + 3 FC layers = ~138M parameters.

ResNet (2015)

Introduced skip connections to solve the degradation problem — deeper networks shouldn't have higher training error.

ResNet Skip Connection (Residual Block)Input xConv 3×3BN → ReLUConv 3×3BN+ReLUOutput F(x)+xSkip / Identity ConnectionWhy Skip Connections?• Solves vanishing gradients• Enables 152+ layer networks• Identity mapping: F(x)→0• Easier to learn residuals• Degradation problem resolved

Architecture Comparison

ArchitectureYearDepthParametersTop-5 ErrorKey Innovation
LeNet-51998760K—First practical CNN
AlexNet2012860M16.4%ReLU, dropout, GPU
VGG-16201416138M7.3%Small 3×3 filters
GoogLeNet2014226.8M6.7%Inception modules
ResNet-5020155025.6M3.6%Skip connections
EfficientNet2019—5.3M2.9%Compound scaling

EfficientNet: Compound Scaling

EfficientNet scales three dimensions jointly using a compound coefficient ϕ\phi:

depth: d=αϕ,width: w=βϕ,resolution: r=γϕ\text{depth: } d = \alpha^\phi, \quad \text{width: } w = \beta^\phi, \quad \text{resolution: } r = \gamma^\phi

subject to αβ2γ22\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2. This achieves better accuracy-efficiency tradeoffs than scaling any single dimension.


Feature Visualization

CNNs learn interpretable feature hierarchies:

  • Layer 1: Edge detectors (Gabor-like filters), color blobs
  • Layer 2: Corners, textures, simple patterns
  • Layer 3: Object parts (eyes, wheels, textures)
  • Layer 4: Object-level features (faces, dogs, buildings)
  • Layer 5: Full objects and scenes
Feature Hierarchy VisualizationLayer 1Edges/\|-○●Layer 2Textures▦◆⬡△Layer 3Parts👁◉◎⬢Layer 4Objects🐕🏠🚗👤

Implementation in PyTorch

Basic CNN

import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((1, 1))
        )
        self.classifier = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return self.classifier(x)

Residual Block

class ResidualBlock(nn.Module):
    def __init__(self, channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, stride, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, 1, 1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)

        self.shortcut = nn.Sequential()
        if stride != 1:
            self.shortcut = nn.Sequential(
                nn.Conv2d(channels, channels, 1, stride, bias=False),
                nn.BatchNorm2d(channels)
            )

    def forward(self, x):
        out = torch.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        return torch.relu(out)

Transfer Learning

import torchvision.models as models

model = models.resnet50(pretrained=True)

for param in model.parameters():
    param.requires_grad = False

model.fc = nn.Linear(model.fc.in_features, num_classes)

optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

Transfer learning strategy:

  1. Load pretrained weights (ImageNet)
  2. Freeze early layers (feature extraction)
  3. Replace final layer for your task
  4. Fine-tune with small learning rate

Key Takeaways

Summary

  • CNNs exploit spatial structure through parameter sharing and local connectivity
  • Convolution extracts features; pooling provides invariance and dimensionality reduction
  • Deeper networks learn hierarchical features: edges → textures → parts → objects
  • Skip connections (ResNet) enable training of very deep networks (152+ layers)
  • Transfer learning from pretrained models is the dominant paradigm in practice
  • Output size: O=(WK+2P)/S+1O = \lfloor(W - K + 2P) / S\rfloor + 1

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement