CNNs for Image Data
๐ก Convolutional Neural Networks (CNNs) are the backbone of modern computer vision. This lesson covers the convolution operation, pooling, landmark architectures (LeNet, VGG, ResNet), and transfer learning with PyTorch.
1. Why CNNs for Images?
Fully connected networks treat every pixel independently โ they ignore spatial structure. A 224ร224ร3 image flattened is 150,528 inputs; a single dense layer with 1000 neurons creates 150 million parameters. CNNs exploit two key priors:
| Property | Benefit |
|---|---|
| Local connectivity | Each neuron connects to a small spatial patch, not the whole image |
| Weight sharing | The same filter slides across the entire image |
| Translation equivariance | A cat in the top-left or bottom-right activates the same filters |
This reduces parameters dramatically: a 3ร3 filter over a 224ร224ร3 image needs only 27 weights (plus bias), regardless of image size.
2. The Convolution Operation
Given an input image (height , width , channels ) and a kernel (size , channels ), the 2D convolution at position is:
2D Convolution Operation
Here,
- =Input feature map (H x W x C)
- =Kernel/filter (k x k x C)
- =Spatial dimensions of input
- =Number of input channels
- =Kernel spatial size
โน๏ธ Convolution as Matrix Multiplication
A convolution operation can be expressed as a matrix multiplication by unrolling the input into a column matrix (im2col) and the kernel into a row matrix. This is how GPUs efficiently implement convolutions โ as highly optimized matrix multiplications (GEMM operations). The convolution theorem also states that convolution in spatial domain equals multiplication in frequency domain.
For a filter bank with output channels, the output tensor has shape:
CNN Output Shape
Here,
- =Input height and width
- =Kernel size
- =Padding
- =Stride
- =Number of output filters
Visual: 3ร3 Convolution on a 5ร5 Input (No Padding, Stride=1)
Input (5ร5): Kernel (3ร3): Output (3ร3):
โโโโโฌโโโโฌโโโโฌโโโโฌโโโโ โโโโโฌโโโโฌโโโโ โโโโโโฌโโโโโฌโโโโโ
โ 1 โ 2 โ 3 โ 0 โ 1 โ โ 1 โ 0 โ 1 โ โ 4 โ 3 โ 4 โ
โโโโโผโโโโผโโโโผโโโโผโโโโค โโโโโผโโโโผโโโโค โโโโโโผโโโโโผโโโโโค
โ 0 โ 1 โ 2 โ 1 โ 0 โ โ 0 โ 1 โ 0 โ โ 2 โ 4 โ 3 โ
โโโโโผโโโโผโโโโผโโโโผโโโโค โโโโโผโโโโผโโโโค โโโโโโผโโโโโผโโโโโค
โ 1 โ 0 โ 1 โ 2 โ 1 โ โ 1 โ 0 โ 1 โ โ 4 โ 3 โ 4 โ
โโโโโผโโโโผโโโโผโโโโผโโโโค โโโโโดโโโโดโโโโ โโโโโโดโโโโโดโโโโโ
โ 2 โ 1 โ 0 โ 1 โ 2 โ
โโโโโผโโโโผโโโโผโโโโผโโโโค Position (0,0):
โ 0 โ 2 โ 1 โ 0 โ 1 โ 1ยท1 + 2ยท0 + 3ยท1 + = 1+0+3+0+1+0+1+0+1 = 4
โโโโโดโโโโดโโโโดโโโโดโโโโ 0ยท0 + 1ยท1 + 2ยท0 +
1ยท1 + 0ยท0 + 1ยท1
Key Parameters
| Parameter | Typical Values | Effect |
|---|---|---|
| Kernel size | 3ร3, 5ร5 | Larger kernels capture broader patterns |
| Stride | 1, 2 | Larger stride reduces output size |
| Padding | 'same' (0-filled) | Preserves spatial dimensions |
| Dilation | 1, 2 | Expands receptive field without more params |
3. Activation Function: ReLU
After convolution, apply a non-linear activation:
ReLU Activation
Here,
- =Input value
- =Output value (0 if x < 0, x otherwise)
ReLU is preferred because it:
- Computationally cheap (single comparison)
- Avoids vanishing gradient for positive values
- Induces sparsity in activations
Variants include Leaky ReLU ( with ) and GELU (used in transformers).
4. Pooling Layers
Pooling reduces spatial dimensions, providing translation invariance and reducing computation.
Max Pooling
Max Pooling
Here,
- =Input feature map
- =Pooling region at position (i,j)
Input (4ร4): MaxPool 2ร2, stride=2: Output (2ร2):
โโโโโโฌโโโโโฌโโโโโฌโโโโโ โโโโโโโโโโโโฌโโโโโโโโโโโ โโโโโโฌโโโโโ
โ 1 โ 3 โ 2 โ 1 โ โ [1,3] โ [2,1] โ โ 3 โ 3 โ
โโโโโโผโโโโโผโโโโโผโโโโโค โ [0,1] โ [4,2] โ โโโโโโผโโโโโค
โ 0 โ 1 โ 4 โ 2 โ โโโโโโโโโโโโดโโโโโโโโโโโ โ 1 โ 4 โ
โโโโโโผโโโโโผโโโโโผโโโโโค โโโโโโดโโโโโ
โ 5 โ 1 โ 2 โ 3 โ
โโโโโโผโโโโโผโโโโโผโโโโโค MaxPool picks largest value
โ 3 โ 0 โ 4 โ 1 โ in each 2ร2 window
โโโโโโดโโโโโดโโโโโดโโโโโ
Average Pooling
Average Pooling
Here,
- =Input feature map
- =Pooling region at position (i,j)
- =Number of elements in the pooling region
๐ก Global Average Pooling (GAP)
Global Average Pooling averages each channel into a single value, reducing a feature map to . This eliminates the need for fully connected layers at the end of the network, significantly reducing parameters. GAP is a key component of modern architectures like ResNet and EfficientNet.
5. Building a Complete CNN
Architecture Pipeline
Input Image โ [Conv โ ReLU โ Pool] ร N โ Flatten โ FC โ Output
3ร32ร32 32ร32ร32 32ร16ร16 256 10
Complete CNN in PyTorch
import torch
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
# Block 1: 3 โ 32 channels
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # 32ร32ร32 โ 16ร16ร32
# Block 2: 32 โ 64 channels
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # 16ร16ร64 โ 8ร8ร64
# Block 3: 64 โ 128 channels
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # 8ร8ร128 โ 4ร4ร128
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128 * 4 * 4, 256),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(256, num_classes),
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
model = SimpleCNN(num_classes=10)
x = torch.randn(1, 3, 32, 32)
print(model(x).shape) # torch.Size([1, 10])
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
6. Landmark CNN Architectures
LeNet-5 (1998)
The first successful CNN for handwritten digit recognition.
Input โ Conv(6) โ AvgPool โ Conv(16) โ AvgPool โ FC(120) โ FC(84) โ FC(10)
32ร1ร1 6ร28ร28 6ร14ร14 16ร10ร10 16ร5ร5
๐ก LeNet-5 Key Insight
Alternating convolution and subsampling progressively extracts higher-level features.
VGG-16 (2014)
Very deep network using only 3ร3 convolutions.
Block 1: Conv(64) ร2 โ MaxPool โ 224โ112
Block 2: Conv(128) ร2 โ MaxPool โ 112โ56
Block 3: Conv(256) ร3 โ MaxPool โ 56โ28
Block 4: Conv(512) ร3 โ MaxPool โ 28โ14
Block 5: Conv(512) ร3 โ MaxPool โ 14โ7
โ Flatten โ FC(4096) โ FC(4096) โ FC(1000)
๐ก VGG-16 Key Insight
Depth matters. Two 3ร3 convolutions have the same receptive field as one 5ร5, but with fewer parameters and more non-linearity.
โน๏ธ Parameter Efficiency Comparison
VGG-16 has 138M parameters, mostly in the fully connected layers. In contrast, ResNet-50 achieves better accuracy with only 25.6M parameters by using 1x1 convolutions for dimensionality reduction and global average pooling instead of FC layers. This demonstrates that architecture design matters more than raw parameter count.
ResNet (2015)
Introduces skip connections to train very deep networks (50, 101, 152 layers).
ResNet Skip Connection
Here,
- =Input to the residual block
- =Residual function (convolution layers)
- =Output of the residual block
Input โโโโโโโโโโโโโโโโโโโ
โ โ
Conv โ BN โ ReLU โ (skip connection)
Conv โ BN โ (+) โ ReLU โ Output
๐ก ResNet Key Insight
Skip connections solve the degradation problem โ deeper networks can learn identity mappings, ensuring performance doesn't degrade with depth.
ThResNet Skip Connection Gradient Flow
With skip connections, the gradient at layer has a direct path to layer 0: . The identity term ensures that gradients never vanish through skip connections, enabling training of networks with 100+ layers.
๐Receptive Field Growth
In a VGG-style network with only 3x3 convolutions:
- After 1 layer: receptive field is 3x3
- After 2 layers: receptive field is 5x5
- After 3 layers: receptive field is 7x7
In general, stacked 3x3 convolutions have receptive field . This is why VGG uses stacks of 3x3 convolutions โ they achieve the same receptive field as larger kernels but with fewer parameters and more non-linearities. Two 3x3 convolutions have parameters vs. one 5x5 convolution with parameters.
Parameter Comparison
| Architecture | Depth | Parameters | Top-5 Error (ImageNet) |
|---|---|---|---|
| LeNet-5 | 5 | 60K | N/A |
| VGG-16 | 16 | 138M | 7.3% |
| ResNet-50 | 50 | 25.6M | 5.3% |
| ResNet-152 | 152 | 60.2M | 4.5% |
7. Training a CNN on CIFAR-10
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
# Data augmentation + normalization
transform_train = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2470, 0.2435, 0.2616)),
])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465),
(0.2470, 0.2435, 0.2616)),
])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform_train)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform_test)
trainloader = DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)
testloader = DataLoader(testset, batch_size=128, shuffle=False, num_workers=2)
# Model, loss, optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimpleCNN(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
# Training loop
for epoch in range(50):
model.train()
running_loss, correct, total = 0.0, 0, 0
for images, labels in trainloader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
_, predicted = outputs.max(1)
total += labels.size(0)
correct += predicted.eq(labels).sum().item()
scheduler.step()
train_acc = 100. * correct / total
# Evaluation
model.eval()
correct, total = 0, 0
with torch.no_grad():
for images, labels in testloader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
_, predicted = outputs.max(1)
total += labels.size(0)
correct += predicted.eq(labels).sum().item()
test_acc = 100. * correct / total
print(f"Epoch {epoch+1:2d} | Loss: {running_loss/len(trainloader):.3f} | "
f"Train: {train_acc:.1f}% | Test: {test_acc:.1f}%")
8. Transfer Learning
Instead of training from scratch, use a pre-trained model:
import torchvision.models as models
# Load pre-trained ResNet-18 (trained on ImageNet)
model = models.resnet18(pretrained=True)
# Replace final layer for CIFAR-10
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 10)
model = model.to(device)
# Freeze all layers except the new classifier
for param in model.parameters():
param.requires_grad = False
for param in model.fc.parameters():
param.requires_grad = True
# Fine-tune with lower learning rate
optimizer = optim.Adam(model.fc.parameters(), lr=1e-3)
9. Data Augmentation Strategies
Original: Flipped: Rotated: Cropped: Color Jitter:
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ ๐ฑ โ โ ๐ฑ โ โ ๐ฑ โ โ ๐ฑ โ โ ๐ฑ โ
โ โ โ โ โ / โ โ(cropped) โ โ(shifted โ
โ โ โ โ โ/ โ โ โ โ color) โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
Common augmentations: horizontal flip, random crop, color jitter, rotation, cutout, random erasing.
10. Key Takeaways
๐Summary: CNNs for Image Data
- Convolution extracts local spatial features using shared-weight filters; the output shape is determined by
- Pooling (max, average, global average) reduces spatial dimensions, adds translation invariance, and reduces computation
- Deeper networks learn hierarchical features (edges โ textures โ parts โ objects); stacking 3x3 convolutions is more efficient than using larger kernels
- ResNet skip connections solve the degradation problem by enabling gradient flow through identity shortcuts, enabling 100+ layer networks
- Batch Normalization stabilizes training by normalizing activations, allowing higher learning rates
- Transfer learning leverages pre-trained features for new tasks with limited data; lower layers capture universal features, higher layers capture task-specific features
- Data augmentation (random crop, flip, color jitter) is critical for regularization and prevents overfitting on small datasets
11. Practice Exercises
Exercise 1: Build a CNN from Scratch
# TODO: Build a CNN that achieves >85% accuracy on CIFAR-10
# Requirements:
# - At least 3 convolutional blocks
# - Use BatchNorm and Dropout
# - Train for 30+ epochs
class YourCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
# Your code here
pass
def forward(self, x):
# Your code here
pass
Exercise 2: Experiment with Architectures
- Replace max pooling with strided convolutions
- Try different kernel sizes (1ร1, 3ร3, 5ร5)
- Add squeeze-and-excitation blocks
- Compare parameter counts and accuracy
Exercise 3: Visualize Learned Filters
# TODO: Extract and visualize first-layer filters
first_conv = model.features[0]
filters = first_conv.weight.data.cpu()
# Plot the 3ร3 filters using matplotlib
Exercise 4: Grad-CAM Visualization
# TODO: Implement Grad-CAM to see which regions
# the model focuses on for classification
# Hint: Use hooks to capture intermediate gradients