CNNs for Image Data: Convolution, Pooling and Architectures

Why CNNs for Images?

Fully connected networks treat images as flat vectors, destroying spatial structure. A 224ײ24׳ image flattened becomes 150,528 features connecting each to a hidden layer of 1,000 neurons requires 150 million parameters just for the first layer. CNNs solve this through two key principles:

Parameter Sharing: A single filter (kernel) slides across the entire image, reusing the same weights at every spatial location. A 3׳ kernel has only 9 parameters per channel, regardless of image size.

Translation Equivalence: Because the same filter scans all positions, a feature detected at one location can be recognized anywhere. The network learns what to detect, not where.

How this diagram works: This comparison reveals why CNNs are essential for image data. The left side shows a fully connected (FC) network that flattens a 224ײ24׳ image into 150,528 features, requiring over 150 million parameters just for the first hidden layer making it computationally expensive and prone to overfitting. The right side shows a CNN that uses convolutional layers with small filters, preserving spatial structure while reducing parameters to 1-25 million. The key insights box highlights why CNNs work: parameter sharing means the same filter detects features everywhere, translation invariance allows recognizing objects regardless of position, and spatial hierarchies let the network learn increasingly complex patterns layer by layer.

The Convolution Operation

Convolution is the core building block. A small kernel (filter) slides over the input, computing element-wise multiplications and summing the results at each position.

Mathematical Definition

For a 2D input and kernel , the output at position is:

where and are the kernel height and width.

{Convolution Operation: 5�5 Input, 3�3 Kernel, Stride=1, Valid Padding Input (5�5)          Kernel (3�3)         Output (3�3) &#123;vals[r][c]&#125;    �    &#123;kvals[r][c]&#125;    =    &#123;outVals[r][c]&#125; Element-wise Multiply and Sum Kernel slides across entire input, computing one output value per position}

Stride and Padding

Stride (): How many pixels the kernel shifts at each step.

Padding (): Zero-padding added around the input border.

where = input size, = kernel size, = padding, = stride.

Padding and Stride Effects
Valid (P=0):  Input 7�7 ? 5�5 (K=3, S=1)
Same (P=1):  Input 7�7 ? 7�7 (K=3, S=1)
Stride 2 (S=2): Input 7�7 ? 3�3 (K=3, S=2)
Output Formula: O = floor((W-K+2P)/S)+1

Multi-Channel Convolution

For an RGB input with channels, each filter has shape . The filter slides over all channels simultaneously, producing a single output channel:

A convolution layer with filters produces channels. Total parameters: .

Pooling Layers

Pooling reduces spatial dimensions, providing translational invariance and reducing computation.

Max Pooling

Selects the maximum value within each pooling window. Preserves the strongest feature activations.

Average Pooling

Computes the mean value within each window. Global Average Pooling (GAP) averages each entire feature map to a single value, commonly replacing fully connected layers.

Pooling: 4�4 Input, 2�2 Pool, Stride=2
Input (4�4)         Max Pool            Avg Pool
{vals[r][c]}    ?   {maxVals[r][c]}  ?   {avgVals[r][c].toFixed(1)}
4�4 ? 2�2           Captures strongest   Smooths feature maps
Both: ? params, ? invariance
Global Average Pooling (GAP): 7�7�512 ? 1�1�512 = 512 values
Replaces FC layers, reduces overfitting

CNN Architecture: The Pattern

The canonical CNN follows a repeating pattern:

Architecture Diagram

[Conv ?� ReLU ?� Pool] נN  ?�  [FC] נM  ?�  Output

Convolutional blocks extract hierarchical features:

Early layers: Low-level features (edges, textures, colors)
Middle layers: Mid-level features (patterns, parts, shapes)
Late layers: High-level features (objects, scenes, concepts)

Receptive Field

The receptive field is the region of the original input that influences a particular neuron. As we stack layers, the receptive field grows:

Receptive Field Growth
Layer 1: RF=3�3 ? Layer 2: RF=5�5 ? Layer 3: RF=7�7 ? Layer 4: RF=9�9
Deep Layer: RF=Full image
RF_l = RF_{l-1} + (K_l - 1) � ? S_i

Famous Architectures

LeNet-5 (1998)

Pioneering CNN for handwritten digit recognition. Introduced the Conv?�Pool?�FC paradigm.

Layer	Output	Kernel	Filters	Parameters
Conv1	28ײ8׶	5׵	6	156
Pool1	14ױ4׶	2ײ		0
Conv2	10ױ0ױ6	5׵	16	1,516
Pool2	5׵ױ6	2ײ		0
FC1	120			48,120
FC2	84			10,164
FC3	10			850

Total: ~60K parameters

AlexNet (2012)

Won ImageNet by a large margin. Key innovations: ReLU activation, dropout, data augmentation, GPU training.

5 conv layers + 3 FC layers
~60M parameters
ReLU instead of tanh ?� faster training
Overlapping pooling (3׳, S=2)

VGG (2014)

Demonstrated that depth matters. Used only 3׳ convolutions with stride 1 and padding 1.

Key insight: Two 3׳ conv layers have the same receptive field as one 5׵ layer but with fewer parameters:

VGG-16: 13 conv layers + 3 FC layers = ~138M parameters.

ResNet (2015)

Introduced skip connections to solve the degradation problem deeper networks shouldn't have higher training error.

ResNet Skip Connection (Residual Block)
Input x ? Conv 3�3 ? BN ? ReLU ? Conv 3�3 ? BN ? + ReLU ? Output F(x)+x
                                                  ? Skip / Identity Connection
Why Skip Connections?
� Solves vanishing gradients
� Enables 152+ layer networks
� Identity mapping: F(x)?0
� Easier to learn residuals
� Degradation problem resolved

Architecture Comparison

Architecture	Year	Depth	Parameters	Top-5 Error	Key Innovation
LeNet-5	1998	7	60K		First practical CNN
AlexNet	2012	8	60M	16.4%	ReLU, dropout, GPU
VGG-16	2014	16	138M	7.3%	Small 3׳ filters
GoogLeNet	2014	22	6.8M	6.7%	Inception modules
ResNet-50	2015	50	25.6M	3.6%	Skip connections
EfficientNet	2019		5.3M	2.9%	Compound scaling

EfficientNet: Compound Scaling

EfficientNet scales three dimensions jointly using a compound coefficient :

subject to . This achieves better accuracy-efficiency tradeoffs than scaling any single dimension.

Feature Visualization

CNNs learn interpretable feature hierarchies:

Layer 1: Edge detectors (Gabor-like filters), color blobs
Layer 2: Corners, textures, simple patterns
Layer 3: Object parts (eyes, wheels, textures)
Layer 4: Object-level features (faces, dogs, buildings)
Layer 5: Full objects and scenes

Implementation in PyTorch

Basic CNN

import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((1, 1))
        )
        self.classifier = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return self.classifier(x)

Residual Block

class ResidualBlock(nn.Module):
    def __init__(self, channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, stride, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, 1, 1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)

        self.shortcut = nn.Sequential()
        if stride != 1:
            self.shortcut = nn.Sequential(
                nn.Conv2d(channels, channels, 1, stride, bias=False),
                nn.BatchNorm2d(channels)
            )

    def forward(self, x):
        out = torch.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        return torch.relu(out)

Transfer Learning

import torchvision.models as models

model = models.resnet50(pretrained=True)

for param in model.parameters():
    param.requires_grad = False

model.fc = nn.Linear(model.fc.in_features, num_classes)

optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

Transfer learning strategy:

Load pretrained weights (ImageNet)
Freeze early layers (feature extraction)
Replace final layer for your task
Fine-tune with small learning rate

Key Takeaways

Summary

CNNs exploit spatial structure through parameter sharing and local connectivity
Convolution extracts features; pooling provides invariance and dimensionality reduction
Deeper networks learn hierarchical features: edges ?� textures ?� parts ?� objects
Skip connections (ResNet) enable training of very deep networks (152+ layers)
Transfer learning from pretrained models is the dominant paradigm in practice
Output size:

CNNs for Image Data: Convolution, Pooling and Architectures

CNNs for Image Data: Convolution, Pooling and Architectures

Why CNNs for Images?

The Convolution Operation

Mathematical Definition

Stride and Padding

Multi-Channel Convolution

Pooling Layers

Max Pooling

Average Pooling

CNN Architecture: The Pattern

Receptive Field

Famous Architectures

LeNet-5 (1998)

AlexNet (2012)

VGG (2014)

ResNet (2015)

Architecture Comparison

EfficientNet: Compound Scaling

Feature Visualization

Implementation in PyTorch

Basic CNN

Residual Block

Transfer Learning

Key Takeaways

Need Expert Data Science Help?