CNNs for Image Data

💡 Convolutional Neural Networks (CNNs) are the backbone of modern computer vision. This lesson covers the convolution operation, pooling, landmark architectures (LeNet, VGG, ResNet), and transfer learning with PyTorch.

1. Why CNNs for Images?

Fully connected networks treat every pixel independently — they ignore spatial structure. A 224×224×3 image flattened is 150,528 inputs; a single dense layer with 1000 neurons creates 150 million parameters. CNNs exploit two key priors:

Property	Benefit
Local connectivity	Each neuron connects to a small spatial patch, not the whole image
Weight sharing	The same filter slides across the entire image
Translation equivariance	A cat in the top-left or bottom-right activates the same filters

This reduces parameters dramatically: a 3×3 filter over a 224×224×3 image needs only 27 weights (plus bias), regardless of image size.

2. The Convolution Operation

Given an input image $I$ (height $H$ , width $W$ , channels $C$ ) and a kernel $K$ (size $k \times k$ , channels $C$ ), the 2D convolution at position $(i,j)$ is:

2D Convolution Operation

(I * K)(i,j) = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1}\sum_{c=0}^{C-1} I(i+m,\; j+n,\; c) \cdot K(m,\; n,\; c)

Here,

$I$ =Input feature map (H x W x C)
$K$ =Kernel/filter (k x k x C)
$H, W$ =Spatial dimensions of input
$C$ =Number of input channels
$k$ =Kernel spatial size

\text{Output} = \left\lfloor\frac{H - k + 2p}{s} + 1\right\rfloor \times \left\lfloor\frac{W - k + 2p}{s} + 1\right\rfloor \times F

ℹ️ Convolution as Matrix Multiplication

A convolution operation can be expressed as a matrix multiplication by unrolling the input into a column matrix (im2col) and the kernel into a row matrix. This is how GPUs efficiently implement convolutions — as highly optimized matrix multiplications (GEMM operations). The convolution theorem also states that convolution in spatial domain equals multiplication in frequency domain.

For a filter bank with $F$ output channels, the output tensor has shape:

CNN Output Shape

\text{Output shape} = \left\lfloor\frac{H - k + 2p}{s} + 1\right\rfloor \times \left\lfloor\frac{W - k + 2p}{s} + 1\right\rfloor \times F

Here,

$H, W$ =Input height and width
$k$ =Kernel size
$p$ =Padding
$s$ =Stride
$F$ =Number of output filters

Visual: 3×3 Convolution on a 5×5 Input (No Padding, Stride=1)

Architecture Diagram

Input (5×5):              Kernel (3×3):          Output (3×3):
┌───┬───┬───┬───┬───┐     ┌───┬───┬───┐          ┌────┬────┬────┐
│ 1 │ 2 │ 3 │ 0 │ 1 │     │ 1 │ 0 │ 1 │          │  4 │  3 │  4 │
├───┼───┼───┼───┼───┤     ├───┼───┼───┤          ├────┼────┼────┤
│ 0 │ 1 │ 2 │ 1 │ 0 │     │ 0 │ 1 │ 0 │          │  2 │  4 │  3 │
├───┼───┼───┼───┼───┤     ├───┼───┼───┤          ├────┼────┼────┤
│ 1 │ 0 │ 1 │ 2 │ 1 │     │ 1 │ 0 │ 1 │          │  4 │  3 │  4 │
├───┼───┼───┼───┼───┤     └───┴───┴───┘          └────┴────┴────┘
│ 2 │ 1 │ 0 │ 1 │ 2 │
├───┼───┼───┼───┼───┤     Position (0,0):
│ 0 │ 2 │ 1 │ 0 │ 1 │     1·1 + 2·0 + 3·1 +      = 1+0+3+0+1+0+1+0+1 = 4
└───┴───┴───┴───┴───┘     0·0 + 1·1 + 2·0 +
                           1·1 + 0·0 + 1·1

Key Parameters

Parameter	Typical Values	Effect
Kernel size	3×3, 5×5	Larger kernels capture broader patterns
Stride	1, 2	Larger stride reduces output size
Padding	'same' (0-filled)	Preserves spatial dimensions
Dilation	1, 2	Expands receptive field without more params

3. Activation Function: ReLU

After convolution, apply a non-linear activation:

ReLU Activation

\text{ReLU}(x) = \max(0, x)

Here,

$x$ =Input value
$\text{ReLU}(x)$ =Output value (0 if x < 0, x otherwise)

ReLU is preferred because it:

Computationally cheap (single comparison)
Avoids vanishing gradient for positive values
Induces sparsity in activations

Variants include Leaky ReLU ( $\max(\alpha x, x)$ with $\alpha \approx 0.01$ ) and GELU (used in transformers).

4. Pooling Layers

Pooling reduces spatial dimensions, providing translation invariance and reducing computation.

Max Pooling

\text{MaxPool}(X)_{i,j} = \max_{(m,n) \in \mathcal{R}_{i,j}} X(m,n)

Here,

$X$ =Input feature map
$\mathcal{R}_{i,j}$ =Pooling region at position (i,j)

Architecture Diagram

Input (4×4):              MaxPool 2×2, stride=2:    Output (2×2):
┌────┬────┬────┬────┐     ┌──────────┬──────────┐   ┌────┬────┐
│  1 │  3 │  2 │  1 │     │ [1,3]    │ [2,1]    │   │  3 │  3 │
├────┼────┼────┼────┤     │ [0,1]    │ [4,2]    │   ├────┼────┤
│  0 │  1 │  4 │  2 │     └──────────┴──────────┘   │  1 │  4 │
├────┼────┼────┼────┤                                └────┴────┘
│  5 │  1 │  2 │  3 │
├────┼────┼────┼────┤     MaxPool picks largest value
│  3 │  0 │  4 │  1 │     in each 2×2 window
└────┴────┴────┴────┘

Average Pooling

\text{AvgPool}(X)_{i,j} = \frac{1}{|\mathcal{R}_{i,j}|} \sum_{(m,n) \in \mathcal{R}_{i,j}} X(m,n)

Here,

$X$ =Input feature map
$\mathcal{R}_{i,j}$ =Pooling region at position (i,j)
$|\mathcal{R}_{i,j}|$ =Number of elements in the pooling region

💡 Global Average Pooling (GAP)

Global Average Pooling averages each channel into a single value, reducing a $C \times H \times W$ feature map to $C \times 1 \times 1$ . This eliminates the need for fully connected layers at the end of the network, significantly reducing parameters. GAP is a key component of modern architectures like ResNet and EfficientNet.

5. Building a Complete CNN

Architecture Pipeline

Architecture Diagram

Input Image → [Conv → ReLU → Pool] × N → Flatten → FC → Output
   3×32×32       32×32×32  32×16×16          256    10

Complete CNN in PyTorch

import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: 3 → 32 channels
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 32×32×32 → 16×16×32

            # Block 2: 32 → 64 channels
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 16×16×64 → 8×8×64

            # Block 3: 64 → 128 channels
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 8×8×128 → 4×4×128
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

model = SimpleCNN(num_classes=10)
x = torch.randn(1, 3, 32, 32)
print(model(x).shape)  # torch.Size([1, 10])
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

6. Landmark CNN Architectures

LeNet-5 (1998)

The first successful CNN for handwritten digit recognition.

Architecture Diagram

Input → Conv(6) → AvgPool → Conv(16) → AvgPool → FC(120) → FC(84) → FC(10)
32×1×1   6×28×28   6×14×14   16×10×10   16×5×5

💡 LeNet-5 Key Insight

Alternating convolution and subsampling progressively extracts higher-level features.

VGG-16 (2014)

Very deep network using only 3×3 convolutions.

Architecture Diagram

Block 1: Conv(64) ×2  → MaxPool  → 224→112
Block 2: Conv(128) ×2 → MaxPool  → 112→56
Block 3: Conv(256) ×3 → MaxPool  → 56→28
Block 4: Conv(512) ×3 → MaxPool  → 28→14
Block 5: Conv(512) ×3 → MaxPool  → 14→7
         → Flatten → FC(4096) → FC(4096) → FC(1000)

💡 VGG-16 Key Insight

Depth matters. Two 3×3 convolutions have the same receptive field as one 5×5, but with fewer parameters and more non-linearity.

\text{VGG-16 parameters:} \quad \approx 138\text{M}

ℹ️ Parameter Efficiency Comparison

VGG-16 has 138M parameters, mostly in the fully connected layers. In contrast, ResNet-50 achieves better accuracy with only 25.6M parameters by using 1x1 convolutions for dimensionality reduction and global average pooling instead of FC layers. This demonstrates that architecture design matters more than raw parameter count.

ResNet (2015)

Introduces skip connections to train very deep networks (50, 101, 152 layers).

ResNet Skip Connection

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}

Here,

$\mathbf{x}$ =Input to the residual block
$\mathcal{F}(\mathbf{x}, \{W_i\})$ =Residual function (convolution layers)
$\mathbf{y}$ =Output of the residual block

Architecture Diagram

Input ──────────────────┐
  │                      │
  Conv → BN → ReLU      │ (skip connection)
  Conv → BN      → (+) → ReLU → Output

💡 ResNet Key Insight

Skip connections solve the degradation problem — deeper networks can learn identity mappings, ensuring performance doesn't degrade with depth.

ThResNet Skip Connection Gradient Flow

With skip connections, the gradient at layer $l$ has a direct path to layer 0: $\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L} \left(1 + \frac{\partial}{\partial \mathbf{x}_l} \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i)\right)$ . The identity term ensures that gradients never vanish through skip connections, enabling training of networks with 100+ layers.

📝Receptive Field Growth

In a VGG-style network with only 3x3 convolutions:

After 1 layer: receptive field is 3x3
After 2 layers: receptive field is 5x5
After 3 layers: receptive field is 7x7

In general, $n$ stacked 3x3 convolutions have receptive field $(2n+1) \times (2n+1)$ . This is why VGG uses stacks of 3x3 convolutions — they achieve the same receptive field as larger kernels but with fewer parameters and more non-linearities. Two 3x3 convolutions have $2 \times 3^2 = 18$ parameters vs. one 5x5 convolution with $5^2 = 25$ parameters.

Parameter Comparison

Architecture	Depth	Parameters	Top-5 Error (ImageNet)
LeNet-5	5	60K	N/A
VGG-16	16	138M	7.3%
ResNet-50	50	25.6M	5.3%
ResNet-152	152	60.2M	4.5%

7. Training a CNN on CIFAR-10

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

# Data augmentation + normalization
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2470, 0.2435, 0.2616)),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2470, 0.2435, 0.2616)),
])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                         download=True, transform=transform_train)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                        download=True, transform=transform_test)

trainloader = DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)
testloader = DataLoader(testset, batch_size=128, shuffle=False, num_workers=2)

# Model, loss, optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimpleCNN(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

# Training loop
for epoch in range(50):
    model.train()
    running_loss, correct, total = 0.0, 0, 0

    for images, labels in trainloader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    scheduler.step()
    train_acc = 100. * correct / total

    # Evaluation
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for images, labels in testloader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()

    test_acc = 100. * correct / total
    print(f"Epoch {epoch+1:2d} | Loss: {running_loss/len(trainloader):.3f} | "
          f"Train: {train_acc:.1f}% | Test: {test_acc:.1f}%")

8. Transfer Learning

Instead of training from scratch, use a pre-trained model:

import torchvision.models as models

# Load pre-trained ResNet-18 (trained on ImageNet)
model = models.resnet18(pretrained=True)

# Replace final layer for CIFAR-10
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 10)
model = model.to(device)

# Freeze all layers except the new classifier
for param in model.parameters():
    param.requires_grad = False
for param in model.fc.parameters():
    param.requires_grad = True

# Fine-tune with lower learning rate
optimizer = optim.Adam(model.fc.parameters(), lr=1e-3)

9. Data Augmentation Strategies

Architecture Diagram

Original:     Flipped:      Rotated:      Cropped:      Color Jitter:
┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
│  🐱     │   │     🐱  │   │   🐱    │   │  🐱     │   │  🐱     │
│         │   │         │   │ /       │   │(cropped) │   │(shifted │
│         │   │         │   │/        │   │         │   │ color)  │
└─────────┘   └─────────┘   └─────────┘   └─────────┘   └─────────┘

Common augmentations: horizontal flip, random crop, color jitter, rotation, cutout, random erasing.

10. Key Takeaways

📋Summary: CNNs for Image Data

Convolution extracts local spatial features using shared-weight filters; the output shape is determined by $\lfloor(H - k + 2p)/s + 1\rfloor$
Pooling (max, average, global average) reduces spatial dimensions, adds translation invariance, and reduces computation
Deeper networks learn hierarchical features (edges → textures → parts → objects); stacking 3x3 convolutions is more efficient than using larger kernels
ResNet skip connections solve the degradation problem by enabling gradient flow through identity shortcuts, enabling 100+ layer networks
Batch Normalization stabilizes training by normalizing activations, allowing higher learning rates
Transfer learning leverages pre-trained features for new tasks with limited data; lower layers capture universal features, higher layers capture task-specific features
Data augmentation (random crop, flip, color jitter) is critical for regularization and prevents overfitting on small datasets

11. Practice Exercises

Exercise 1: Build a CNN from Scratch

# TODO: Build a CNN that achieves >85% accuracy on CIFAR-10
# Requirements:
# - At least 3 convolutional blocks
# - Use BatchNorm and Dropout
# - Train for 30+ epochs
class YourCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        # Your code here
        pass

    def forward(self, x):
        # Your code here
        pass

Exercise 2: Experiment with Architectures

Replace max pooling with strided convolutions
Try different kernel sizes (1×1, 3×3, 5×5)
Add squeeze-and-excitation blocks
Compare parameter counts and accuracy

Exercise 3: Visualize Learned Filters

# TODO: Extract and visualize first-layer filters
first_conv = model.features[0]
filters = first_conv.weight.data.cpu()
# Plot the 3×3 filters using matplotlib

Exercise 4: Grad-CAM Visualization

# TODO: Implement Grad-CAM to see which regions
# the model focuses on for classification
# Hint: Use hooks to capture intermediate gradients

CNNs for Image Data