CNNs for Image Data

Module 3: Advanced ML + Deep LearningFree Lesson

Advertisement

CNNs for Image Data

๐Ÿ’ก Convolutional Neural Networks (CNNs) are the backbone of modern computer vision. This lesson covers the convolution operation, pooling, landmark architectures (LeNet, VGG, ResNet), and transfer learning with PyTorch.


1. Why CNNs for Images?

Fully connected networks treat every pixel independently โ€” they ignore spatial structure. A 224ร—224ร—3 image flattened is 150,528 inputs; a single dense layer with 1000 neurons creates 150 million parameters. CNNs exploit two key priors:

PropertyBenefit
Local connectivityEach neuron connects to a small spatial patch, not the whole image
Weight sharingThe same filter slides across the entire image
Translation equivarianceA cat in the top-left or bottom-right activates the same filters

This reduces parameters dramatically: a 3ร—3 filter over a 224ร—224ร—3 image needs only 27 weights (plus bias), regardless of image size.


2. The Convolution Operation

Given an input image II (height HH, width WW, channels CC) and a kernel KK (size kร—kk \times k, channels CC), the 2D convolution at position (i,j)(i,j) is:

2D Convolution Operation

(Iโˆ—K)(i,j)=โˆ‘m=0kโˆ’1โˆ‘n=0kโˆ’1โˆ‘c=0Cโˆ’1I(i+m,โ€…โ€Šj+n,โ€…โ€Šc)โ‹…K(m,โ€…โ€Šn,โ€…โ€Šc)(I * K)(i,j) = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1}\sum_{c=0}^{C-1} I(i+m,\; j+n,\; c) \cdot K(m,\; n,\; c)

Here,

  • II=Input feature map (H x W x C)
  • KK=Kernel/filter (k x k x C)
  • H,WH, W=Spatial dimensions of input
  • CC=Number of input channels
  • kk=Kernel spatial size
Output=โŒŠHโˆ’k+2ps+1โŒ‹ร—โŒŠWโˆ’k+2ps+1โŒ‹ร—F\text{Output} = \left\lfloor\frac{H - k + 2p}{s} + 1\right\rfloor \times \left\lfloor\frac{W - k + 2p}{s} + 1\right\rfloor \times F

โ„น๏ธ Convolution as Matrix Multiplication

A convolution operation can be expressed as a matrix multiplication by unrolling the input into a column matrix (im2col) and the kernel into a row matrix. This is how GPUs efficiently implement convolutions โ€” as highly optimized matrix multiplications (GEMM operations). The convolution theorem also states that convolution in spatial domain equals multiplication in frequency domain.

For a filter bank with FF output channels, the output tensor has shape:

CNN Output Shape

Outputย shape=โŒŠHโˆ’k+2ps+1โŒ‹ร—โŒŠWโˆ’k+2ps+1โŒ‹ร—F\text{Output shape} = \left\lfloor\frac{H - k + 2p}{s} + 1\right\rfloor \times \left\lfloor\frac{W - k + 2p}{s} + 1\right\rfloor \times F

Here,

  • H,WH, W=Input height and width
  • kk=Kernel size
  • pp=Padding
  • ss=Stride
  • FF=Number of output filters

Visual: 3ร—3 Convolution on a 5ร—5 Input (No Padding, Stride=1)

Architecture Diagram
Input (5ร—5):              Kernel (3ร—3):          Output (3ร—3):
โ”Œโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”          โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
โ”‚ 1 โ”‚ 2 โ”‚ 3 โ”‚ 0 โ”‚ 1 โ”‚     โ”‚ 1 โ”‚ 0 โ”‚ 1 โ”‚          โ”‚  4 โ”‚  3 โ”‚  4 โ”‚
โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค     โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค          โ”œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ค
โ”‚ 0 โ”‚ 1 โ”‚ 2 โ”‚ 1 โ”‚ 0 โ”‚     โ”‚ 0 โ”‚ 1 โ”‚ 0 โ”‚          โ”‚  2 โ”‚  4 โ”‚  3 โ”‚
โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค     โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค          โ”œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ค
โ”‚ 1 โ”‚ 0 โ”‚ 1 โ”‚ 2 โ”‚ 1 โ”‚     โ”‚ 1 โ”‚ 0 โ”‚ 1 โ”‚          โ”‚  4 โ”‚  3 โ”‚  4 โ”‚
โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค     โ””โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜          โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜
โ”‚ 2 โ”‚ 1 โ”‚ 0 โ”‚ 1 โ”‚ 2 โ”‚
โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค     Position (0,0):
โ”‚ 0 โ”‚ 2 โ”‚ 1 โ”‚ 0 โ”‚ 1 โ”‚     1ยท1 + 2ยท0 + 3ยท1 +      = 1+0+3+0+1+0+1+0+1 = 4
โ””โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜     0ยท0 + 1ยท1 + 2ยท0 +
                           1ยท1 + 0ยท0 + 1ยท1

Key Parameters

ParameterTypical ValuesEffect
Kernel size3ร—3, 5ร—5Larger kernels capture broader patterns
Stride1, 2Larger stride reduces output size
Padding'same' (0-filled)Preserves spatial dimensions
Dilation1, 2Expands receptive field without more params

3. Activation Function: ReLU

After convolution, apply a non-linear activation:

ReLU Activation

ReLU(x)=maxโก(0,x)\text{ReLU}(x) = \max(0, x)

Here,

  • xx=Input value
  • ReLU(x)\text{ReLU}(x)=Output value (0 if x < 0, x otherwise)

ReLU is preferred because it:

  • Computationally cheap (single comparison)
  • Avoids vanishing gradient for positive values
  • Induces sparsity in activations

Variants include Leaky ReLU (maxโก(ฮฑx,x)\max(\alpha x, x) with ฮฑโ‰ˆ0.01\alpha \approx 0.01) and GELU (used in transformers).


4. Pooling Layers

Pooling reduces spatial dimensions, providing translation invariance and reducing computation.

Max Pooling

Max Pooling

MaxPool(X)i,j=maxโก(m,n)โˆˆRi,jX(m,n)\text{MaxPool}(X)_{i,j} = \max_{(m,n) \in \mathcal{R}_{i,j}} X(m,n)

Here,

  • XX=Input feature map
  • Ri,j\mathcal{R}_{i,j}=Pooling region at position (i,j)
Architecture Diagram
Input (4ร—4):              MaxPool 2ร—2, stride=2:    Output (2ร—2):
โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
โ”‚  1 โ”‚  3 โ”‚  2 โ”‚  1 โ”‚     โ”‚ [1,3]    โ”‚ [2,1]    โ”‚   โ”‚  3 โ”‚  3 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ค     โ”‚ [0,1]    โ”‚ [4,2]    โ”‚   โ”œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ค
โ”‚  0 โ”‚  1 โ”‚  4 โ”‚  2 โ”‚     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚  1 โ”‚  4 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ค                                โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜
โ”‚  5 โ”‚  1 โ”‚  2 โ”‚  3 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ค     MaxPool picks largest value
โ”‚  3 โ”‚  0 โ”‚  4 โ”‚  1 โ”‚     in each 2ร—2 window
โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜

Average Pooling

Average Pooling

AvgPool(X)i,j=1โˆฃRi,jโˆฃโˆ‘(m,n)โˆˆRi,jX(m,n)\text{AvgPool}(X)_{i,j} = \frac{1}{|\mathcal{R}_{i,j}|} \sum_{(m,n) \in \mathcal{R}_{i,j}} X(m,n)

Here,

  • XX=Input feature map
  • Ri,j\mathcal{R}_{i,j}=Pooling region at position (i,j)
  • โˆฃRi,jโˆฃ|\mathcal{R}_{i,j}|=Number of elements in the pooling region

๐Ÿ’ก Global Average Pooling (GAP)

Global Average Pooling averages each channel into a single value, reducing a Cร—Hร—WC \times H \times W feature map to Cร—1ร—1C \times 1 \times 1. This eliminates the need for fully connected layers at the end of the network, significantly reducing parameters. GAP is a key component of modern architectures like ResNet and EfficientNet.


5. Building a Complete CNN

Architecture Pipeline

Architecture Diagram
Input Image โ†’ [Conv โ†’ ReLU โ†’ Pool] ร— N โ†’ Flatten โ†’ FC โ†’ Output
   3ร—32ร—32       32ร—32ร—32  32ร—16ร—16          256    10

Complete CNN in PyTorch

import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: 3 โ†’ 32 channels
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 32ร—32ร—32 โ†’ 16ร—16ร—32

            # Block 2: 32 โ†’ 64 channels
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 16ร—16ร—64 โ†’ 8ร—8ร—64

            # Block 3: 64 โ†’ 128 channels
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),  # 8ร—8ร—128 โ†’ 4ร—4ร—128
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

model = SimpleCNN(num_classes=10)
x = torch.randn(1, 3, 32, 32)
print(model(x).shape)  # torch.Size([1, 10])
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

6. Landmark CNN Architectures

LeNet-5 (1998)

The first successful CNN for handwritten digit recognition.

Architecture Diagram
Input โ†’ Conv(6) โ†’ AvgPool โ†’ Conv(16) โ†’ AvgPool โ†’ FC(120) โ†’ FC(84) โ†’ FC(10)
32ร—1ร—1   6ร—28ร—28   6ร—14ร—14   16ร—10ร—10   16ร—5ร—5

๐Ÿ’ก LeNet-5 Key Insight

Alternating convolution and subsampling progressively extracts higher-level features.

VGG-16 (2014)

Very deep network using only 3ร—3 convolutions.

Architecture Diagram
Block 1: Conv(64) ร—2  โ†’ MaxPool  โ†’ 224โ†’112
Block 2: Conv(128) ร—2 โ†’ MaxPool  โ†’ 112โ†’56
Block 3: Conv(256) ร—3 โ†’ MaxPool  โ†’ 56โ†’28
Block 4: Conv(512) ร—3 โ†’ MaxPool  โ†’ 28โ†’14
Block 5: Conv(512) ร—3 โ†’ MaxPool  โ†’ 14โ†’7
         โ†’ Flatten โ†’ FC(4096) โ†’ FC(4096) โ†’ FC(1000)

๐Ÿ’ก VGG-16 Key Insight

Depth matters. Two 3ร—3 convolutions have the same receptive field as one 5ร—5, but with fewer parameters and more non-linearity.

VGG-16ย parameters:โ‰ˆ138M\text{VGG-16 parameters:} \quad \approx 138\text{M}

โ„น๏ธ Parameter Efficiency Comparison

VGG-16 has 138M parameters, mostly in the fully connected layers. In contrast, ResNet-50 achieves better accuracy with only 25.6M parameters by using 1x1 convolutions for dimensionality reduction and global average pooling instead of FC layers. This demonstrates that architecture design matters more than raw parameter count.

ResNet (2015)

Introduces skip connections to train very deep networks (50, 101, 152 layers).

ResNet Skip Connection

y=F(x,{Wi})+x\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}

Here,

  • x\mathbf{x}=Input to the residual block
  • F(x,{Wi})\mathcal{F}(\mathbf{x}, \{W_i\})=Residual function (convolution layers)
  • y\mathbf{y}=Output of the residual block
Architecture Diagram
Input โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚                      โ”‚
  Conv โ†’ BN โ†’ ReLU      โ”‚ (skip connection)
  Conv โ†’ BN      โ†’ (+) โ†’ ReLU โ†’ Output

๐Ÿ’ก ResNet Key Insight

Skip connections solve the degradation problem โ€” deeper networks can learn identity mappings, ensuring performance doesn't degrade with depth.

ThResNet Skip Connection Gradient Flow

With skip connections, the gradient at layer ll has a direct path to layer 0: โˆ‚Lโˆ‚xl=โˆ‚Lโˆ‚xL(1+โˆ‚โˆ‚xlโˆ‘i=lLโˆ’1F(xi))\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L} \left(1 + \frac{\partial}{\partial \mathbf{x}_l} \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i)\right). The identity term ensures that gradients never vanish through skip connections, enabling training of networks with 100+ layers.

๐Ÿ“Receptive Field Growth

In a VGG-style network with only 3x3 convolutions:

  • After 1 layer: receptive field is 3x3
  • After 2 layers: receptive field is 5x5
  • After 3 layers: receptive field is 7x7

In general, nn stacked 3x3 convolutions have receptive field (2n+1)ร—(2n+1)(2n+1) \times (2n+1). This is why VGG uses stacks of 3x3 convolutions โ€” they achieve the same receptive field as larger kernels but with fewer parameters and more non-linearities. Two 3x3 convolutions have 2ร—32=182 \times 3^2 = 18 parameters vs. one 5x5 convolution with 52=255^2 = 25 parameters.

Parameter Comparison

ArchitectureDepthParametersTop-5 Error (ImageNet)
LeNet-5560KN/A
VGG-1616138M7.3%
ResNet-505025.6M5.3%
ResNet-15215260.2M4.5%

7. Training a CNN on CIFAR-10

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

# Data augmentation + normalization
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2470, 0.2435, 0.2616)),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2470, 0.2435, 0.2616)),
])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                         download=True, transform=transform_train)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                        download=True, transform=transform_test)

trainloader = DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)
testloader = DataLoader(testset, batch_size=128, shuffle=False, num_workers=2)

# Model, loss, optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimpleCNN(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

# Training loop
for epoch in range(50):
    model.train()
    running_loss, correct, total = 0.0, 0, 0

    for images, labels in trainloader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    scheduler.step()
    train_acc = 100. * correct / total

    # Evaluation
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for images, labels in testloader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()

    test_acc = 100. * correct / total
    print(f"Epoch {epoch+1:2d} | Loss: {running_loss/len(trainloader):.3f} | "
          f"Train: {train_acc:.1f}% | Test: {test_acc:.1f}%")

8. Transfer Learning

Instead of training from scratch, use a pre-trained model:

import torchvision.models as models

# Load pre-trained ResNet-18 (trained on ImageNet)
model = models.resnet18(pretrained=True)

# Replace final layer for CIFAR-10
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 10)
model = model.to(device)

# Freeze all layers except the new classifier
for param in model.parameters():
    param.requires_grad = False
for param in model.fc.parameters():
    param.requires_grad = True

# Fine-tune with lower learning rate
optimizer = optim.Adam(model.fc.parameters(), lr=1e-3)

9. Data Augmentation Strategies

Architecture Diagram
Original:     Flipped:      Rotated:      Cropped:      Color Jitter:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ๐Ÿฑ     โ”‚   โ”‚     ๐Ÿฑ  โ”‚   โ”‚   ๐Ÿฑ    โ”‚   โ”‚  ๐Ÿฑ     โ”‚   โ”‚  ๐Ÿฑ     โ”‚
โ”‚         โ”‚   โ”‚         โ”‚   โ”‚ /       โ”‚   โ”‚(cropped) โ”‚   โ”‚(shifted โ”‚
โ”‚         โ”‚   โ”‚         โ”‚   โ”‚/        โ”‚   โ”‚         โ”‚   โ”‚ color)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Common augmentations: horizontal flip, random crop, color jitter, rotation, cutout, random erasing.


10. Key Takeaways

๐Ÿ“‹Summary: CNNs for Image Data

  • Convolution extracts local spatial features using shared-weight filters; the output shape is determined by โŒŠ(Hโˆ’k+2p)/s+1โŒ‹\lfloor(H - k + 2p)/s + 1\rfloor
  • Pooling (max, average, global average) reduces spatial dimensions, adds translation invariance, and reduces computation
  • Deeper networks learn hierarchical features (edges โ†’ textures โ†’ parts โ†’ objects); stacking 3x3 convolutions is more efficient than using larger kernels
  • ResNet skip connections solve the degradation problem by enabling gradient flow through identity shortcuts, enabling 100+ layer networks
  • Batch Normalization stabilizes training by normalizing activations, allowing higher learning rates
  • Transfer learning leverages pre-trained features for new tasks with limited data; lower layers capture universal features, higher layers capture task-specific features
  • Data augmentation (random crop, flip, color jitter) is critical for regularization and prevents overfitting on small datasets

11. Practice Exercises

Exercise 1: Build a CNN from Scratch

# TODO: Build a CNN that achieves >85% accuracy on CIFAR-10
# Requirements:
# - At least 3 convolutional blocks
# - Use BatchNorm and Dropout
# - Train for 30+ epochs
class YourCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        # Your code here
        pass

    def forward(self, x):
        # Your code here
        pass

Exercise 2: Experiment with Architectures

  • Replace max pooling with strided convolutions
  • Try different kernel sizes (1ร—1, 3ร—3, 5ร—5)
  • Add squeeze-and-excitation blocks
  • Compare parameter counts and accuracy

Exercise 3: Visualize Learned Filters

# TODO: Extract and visualize first-layer filters
first_conv = model.features[0]
filters = first_conv.weight.data.cpu()
# Plot the 3ร—3 filters using matplotlib

Exercise 4: Grad-CAM Visualization

# TODO: Implement Grad-CAM to see which regions
# the model focuses on for classification
# Hint: Use hooks to capture intermediate gradients

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement