CNN Architecture Deep Dive — LeNet to ResNet to EfficientNet
Convolutional Neural Networks are the foundation of computer vision. This tutorial covers the convolution operation in depth and the evolution of CNN architectures.
See our CNN tutorial for a general overview of convolutional networks.
The Convolution Operation
Df2D Convolution
Given an input image (height , width , channels ) and a kernel (size , input channels, output channels), the convolution at position is:
Each output channel is a different filter applied to all input channels, summed together.
Output Size Formula
Here,
- =Input spatial dimensions
- =Kernel size
- =Zero-padding
- =Stride
- =Number of output filters
Padding and Stride
DfPadding
- Valid: No padding, output is smaller than input
- Same: Pad so output has same spatial size as input ( for stride 1)
- Zero-padding: Pad with zeros (most common)
DfStride
Stride is the step size of the convolution. Stride 2 reduces spatial dimensions by approximately half, acting as a learnable downsampling.
Stride 1: Stride 2:
[1][2][3][4][5] → [■][■][■] [1][2][3][4][5] → [■][■]
stride moves 1 position stride moves 2 positions
Pooling
DfPooling Layers
Pooling reduces spatial dimensions and provides translation invariance:
- Max Pooling: — keeps the strongest activation
- Average Pooling: — smooths activations
- Global Average Pooling (GAP): Averages each feature map to a single value — replaces fully connected layers
💡 Global Average Pooling (GAP)
GAP reduces each feature map to a single value by averaging. For a tensor of shape , GAP produces . This dramatically reduces parameters and is used in modern architectures (ResNet, EfficientNet) to replace fully connected layers.
Architecture Evolution
LeNet-5 (1998)
DfLeNet-5
Yann LeCun's pioneering CNN for digit recognition:
Input (32x32) → [Conv 5x5, 6] → [AvgPool 2x2] →
[Conv 5x5, 16] → [AvgPool 2x2] → [FC 120] → [FC 84] → [FC 10]
- 5 layers (2 conv, 3 FC), ~60K parameters
- First practical CNN, used for check reading
AlexNet (2012)
DfAlexNet
The architecture that launched the deep learning revolution:
Input (224x224x3) → [Conv 11x11, 96] → [MaxPool 3x3] →
[Conv 5x5, 256] → [MaxPool 3x3] →
[Conv 3x3, 384] → [Conv 3x3, 384] → [Conv 3x3, 256] →
[MaxPool 3x3] → [FC 4096] → [FC 4096] → [FC 1000]
- 8 layers, 60M parameters
- Used ReLU, dropout, data augmentation, GPU training
- Won ImageNet 2012 with top-5 error of 15.3% (vs. 26.2% runner-up)
VGGNet (2014)
DfVGGNet
Very deep networks using only 3×3 convolutions:
Key insight: Two 3×3 convolutions have the same receptive field as one 5×5, but with fewer parameters and more non-linearity.
- VGG-16: 16 layers, 138M parameters
- VGG-19: 19 layers, 144M parameters
- All convolutions are 3×3 with stride 1 and padding 1
- Max pooling with stride 2 for downsampling
ThReceptive Field of Stacked 3×3 Convolutions
Two stacked 3×3 convolutions have a receptive field of 5×5. Three stacked 3×3 convolutions have a receptive field of 7×7. The number of parameters for layers of 3×3 convolutions is , which is less than a single layer with a larger kernel covering the same receptive field.
ResNet (2015)
DfResNet (Residual Network)
ResNet introduced skip connections to enable training of very deep networks:
where is the residual function (typically 2-3 conv layers). Skip connections allow gradients to flow directly through the identity shortcut, solving the vanishing gradient problem.
ThResNet Skip Connection Gradient Flow
With skip connections, the gradient at layer contains:
The identity term ensures the gradient never vanishes, regardless of how deep the network is. This enables training networks with 100+ layers.
EfficientNet (2019)
DfEfficientNet
EfficientNet uses compound scaling to uniformly scale width, depth, and resolution:
subject to . With as a compound coefficient, EfficientNet achieves better accuracy-efficiency tradeoff than scaling any single dimension.
PyTorch ResNet Example
📝Example: Custom ResNet Block in PyTorch
import torch
import torch.nn as nn
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
stride=1, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
# Shortcut connection
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1,
stride=stride, bias=False),
nn.BatchNorm2d(out_channels)
)
def forward(self, x):
out = torch.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += self.shortcut(x) # Skip connection
out = torch.relu(out)
return out
class MiniResNet(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.prep = nn.Sequential(
nn.Conv2d(3, 64, 3, padding=1, bias=False),
nn.BatchNorm2d(64),
nn.ReLU()
)
self.layer1 = self._make_layer(64, 64, 2, stride=1)
self.layer2 = self._make_layer(64, 128, 2, stride=2)
self.layer3 = self._make_layer(128, 256, 2, stride=2)
self.avg_pool = nn.AdaptiveAvgPool2d(1)
self.fc = nn.Linear(256, num_classes)
def _make_layer(self, in_ch, out_ch, num_blocks, stride):
layers = [ResidualBlock(in_ch, out_ch, stride)]
for _ in range(1, num_blocks):
layers.append(ResidualBlock(out_ch, out_ch, 1))
return nn.Sequential(*layers)
def forward(self, x):
x = self.prep(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.avg_pool(x)
x = x.view(x.size(0), -1)
return self.fc(x)
# Test
model = MiniResNet(num_classes=10)
x = torch.randn(4, 3, 32, 32)
out = model(x)
print(f"Output shape: {out.shape}") # [4, 10]
params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {params:,}")
Summary
📋Summary: CNN Architecture Deep Dive
- Convolution: Local pattern detection with parameter sharing and translation equivariance
- Output shape:
- Architecture evolution: LeNet → AlexNet → VGG → ResNet → EfficientNet
- VGG insight: Stacked 3×3 convolutions have larger receptive fields with fewer parameters
- ResNet: Skip connections solve vanishing gradients, enable 100+ layer networks
- EfficientNet: Compound scaling of width, depth, and resolution
- Modern practice: Use pre-trained models (transfer learning) as the starting point
- Trend: Vision Transformers (ViT) are challenging CNN dominance
Practice Exercises
-
Mathematical: Compute the output shape of a CNN with input through: Conv2d(3, 64, 7, stride=2, padding=3) → MaxPool2d(3, stride=2, padding=1) → Conv2d(64, 128, 3, stride=1, padding=1).
-
Coding: Implement a VGG-16-like architecture from scratch. Count parameters and compare with the original VGG-16. How many parameters are in the fully connected layers vs. convolutional layers?
-
Experiment: Train ResNet-18 on CIFAR-10 with and without skip connections. Compare convergence speed and final accuracy. Plot the training curves.
-
Research: Look up the EfficientNet paper (Tan & Le, 2019). How is the compound coefficient determined? What are the scaling coefficients for EfficientNet-B0 through B7?
-
Architecture Design: Design a CNN for 100×100 color images with 50 classes. Use the principles from this tutorial (3×3 convolutions, skip connections, GAP). Justify each design choice.