CNNs for Image Data: Convolution, Pooling and Architectures
Why CNNs for Images?
Fully connected networks treat images as flat vectors, destroying spatial structure. A 224×224×3 image flattened becomes 150,528 features — connecting each to a hidden layer of 1,000 neurons requires 150 million parameters just for the first layer. CNNs solve this through two key principles:
Parameter Sharing: A single filter (kernel) slides across the entire image, reusing the same weights at every spatial location. A 3×3 kernel has only 9 parameters per channel, regardless of image size.
Translation Equivalence: Because the same filter scans all positions, a feature detected at one location can be recognized anywhere. The network learns what to detect, not where.
The Convolution Operation
Convolution is the core building block. A small kernel (filter) slides over the input, computing element-wise multiplications and summing the results at each position.
Mathematical Definition
For a 2D input and kernel , the output at position is:
where and are the kernel height and width.
Stride and Padding
Stride (): How many pixels the kernel shifts at each step.
Padding (): Zero-padding added around the input border.
where = input size, = kernel size, = padding, = stride.
Multi-Channel Convolution
For an RGB input with channels, each filter has shape . The filter slides over all channels simultaneously, producing a single output channel:
A convolution layer with filters produces channels. Total parameters: .
Pooling Layers
Pooling reduces spatial dimensions, providing translational invariance and reducing computation.
Max Pooling
Selects the maximum value within each pooling window. Preserves the strongest feature activations.
Average Pooling
Computes the mean value within each window. Global Average Pooling (GAP) averages each entire feature map to a single value, commonly replacing fully connected layers.
CNN Architecture: The Pattern
The canonical CNN follows a repeating pattern:
[Conv → ReLU → Pool] × N → [FC] × M → Output
Convolutional blocks extract hierarchical features:
- Early layers: Low-level features (edges, textures, colors)
- Middle layers: Mid-level features (patterns, parts, shapes)
- Late layers: High-level features (objects, scenes, concepts)
Receptive Field
The receptive field is the region of the original input that influences a particular neuron. As we stack layers, the receptive field grows:
Famous Architectures
LeNet-5 (1998)
Pioneering CNN for handwritten digit recognition. Introduced the Conv→Pool→FC paradigm.
| Layer | Output | Kernel | Filters | Parameters |
|---|---|---|---|---|
| Conv1 | 28×28×6 | 5×5 | 6 | 156 |
| Pool1 | 14×14×6 | 2×2 | — | 0 |
| Conv2 | 10×10×16 | 5×5 | 16 | 1,516 |
| Pool2 | 5×5×16 | 2×2 | — | 0 |
| FC1 | 120 | — | — | 48,120 |
| FC2 | 84 | — | — | 10,164 |
| FC3 | 10 | — | — | 850 |
Total: ~60K parameters
AlexNet (2012)
Won ImageNet by a large margin. Key innovations: ReLU activation, dropout, data augmentation, GPU training.
- 5 conv layers + 3 FC layers
- ~60M parameters
- ReLU instead of tanh → faster training
- Overlapping pooling (3×3, S=2)
VGG (2014)
Demonstrated that depth matters. Used only 3×3 convolutions with stride 1 and padding 1.
Key insight: Two 3×3 conv layers have the same receptive field as one 5×5 layer but with fewer parameters:
VGG-16: 13 conv layers + 3 FC layers = ~138M parameters.
ResNet (2015)
Introduced skip connections to solve the degradation problem — deeper networks shouldn't have higher training error.
Architecture Comparison
| Architecture | Year | Depth | Parameters | Top-5 Error | Key Innovation |
|---|---|---|---|---|---|
| LeNet-5 | 1998 | 7 | 60K | — | First practical CNN |
| AlexNet | 2012 | 8 | 60M | 16.4% | ReLU, dropout, GPU |
| VGG-16 | 2014 | 16 | 138M | 7.3% | Small 3×3 filters |
| GoogLeNet | 2014 | 22 | 6.8M | 6.7% | Inception modules |
| ResNet-50 | 2015 | 50 | 25.6M | 3.6% | Skip connections |
| EfficientNet | 2019 | — | 5.3M | 2.9% | Compound scaling |
EfficientNet: Compound Scaling
EfficientNet scales three dimensions jointly using a compound coefficient :
subject to . This achieves better accuracy-efficiency tradeoffs than scaling any single dimension.
Feature Visualization
CNNs learn interpretable feature hierarchies:
- Layer 1: Edge detectors (Gabor-like filters), color blobs
- Layer 2: Corners, textures, simple patterns
- Layer 3: Object parts (eyes, wheels, textures)
- Layer 4: Object-level features (faces, dogs, buildings)
- Layer 5: Full objects and scenes
Implementation in PyTorch
Basic CNN
import torch
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.AdaptiveAvgPool2d((1, 1))
)
self.classifier = nn.Linear(128, num_classes)
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
return self.classifier(x)
Residual Block
class ResidualBlock(nn.Module):
def __init__(self, channels, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, stride, 1, bias=False)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, 1, 1, bias=False)
self.bn2 = nn.BatchNorm2d(channels)
self.shortcut = nn.Sequential()
if stride != 1:
self.shortcut = nn.Sequential(
nn.Conv2d(channels, channels, 1, stride, bias=False),
nn.BatchNorm2d(channels)
)
def forward(self, x):
out = torch.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += self.shortcut(x)
return torch.relu(out)
Transfer Learning
import torchvision.models as models
model = models.resnet50(pretrained=True)
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Linear(model.fc.in_features, num_classes)
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
Transfer learning strategy:
- Load pretrained weights (ImageNet)
- Freeze early layers (feature extraction)
- Replace final layer for your task
- Fine-tune with small learning rate
Key Takeaways
Summary
- CNNs exploit spatial structure through parameter sharing and local connectivity
- Convolution extracts features; pooling provides invariance and dimensionality reduction
- Deeper networks learn hierarchical features: edges → textures → parts → objects
- Skip connections (ResNet) enable training of very deep networks (152+ layers)
- Transfer learning from pretrained models is the dominant paradigm in practice
- Output size: