CNN Architecture Deep Dive — LeNet to ResNet to EfficientNet

Computer VisionCNNsFree Lesson

Advertisement

CNN Architecture Deep Dive — LeNet to ResNet to EfficientNet

Convolutional Neural Networks are the foundation of computer vision. This tutorial covers the convolution operation in depth and the evolution of CNN architectures.

See our CNN tutorial for a general overview of convolutional networks.


The Convolution Operation

Df2D Convolution

Given an input image I\mathbf{I} (height HH, width WW, channels CC) and a kernel K\mathbf{K} (size k×kk \times k, CC input channels, FF output channels), the convolution at position (i,j)(i, j) is:

O(i,j)=m=0k1n=0k1c=0C1I(i+m,j+n,c)K(m,n,c)\mathbf{O}(i, j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} \sum_{c=0}^{C-1} \mathbf{I}(i+m, j+n, c) \cdot \mathbf{K}(m, n, c)

Each output channel is a different filter applied to all input channels, summed together.

Output=Hk+2ps+1×Wk+2ps+1×F\text{Output} = \left\lfloor\frac{H - k + 2p}{s} + 1\right\rfloor \times \left\lfloor\frac{W - k + 2p}{s} + 1\right\rfloor \times F

Output Size Formula

Output size=Input sizeKernel size+2×PaddingStride+1\text{Output size} = \left\lfloor\frac{\text{Input size} - \text{Kernel size} + 2 \times \text{Padding}}{\text{Stride}} + 1\right\rfloor

Here,

  • H,WH, W=Input spatial dimensions
  • kk=Kernel size
  • pp=Zero-padding
  • ss=Stride
  • FF=Number of output filters

Padding and Stride

DfPadding

  • Valid: No padding, output is smaller than input
  • Same: Pad so output has same spatial size as input (p=k/2p = \lfloor k/2 \rfloor for stride 1)
  • Zero-padding: Pad with zeros (most common)

DfStride

Stride is the step size of the convolution. Stride 2 reduces spatial dimensions by approximately half, acting as a learnable downsampling.

Architecture Diagram
Stride 1:                          Stride 2:
[1][2][3][4][5] → [■][■][■]      [1][2][3][4][5] → [■][■]
 stride moves 1 position           stride moves 2 positions

Pooling

DfPooling Layers

Pooling reduces spatial dimensions and provides translation invariance:

  • Max Pooling: MaxPool(x)=max(i,j)windowxi,j\text{MaxPool}(x) = \max_{(i,j) \in \text{window}} x_{i,j} — keeps the strongest activation
  • Average Pooling: AvgPool(x)=1W(i,j)windowxi,j\text{AvgPool}(x) = \frac{1}{|W|} \sum_{(i,j) \in \text{window}} x_{i,j} — smooths activations
  • Global Average Pooling (GAP): Averages each feature map to a single value — replaces fully connected layers

💡 Global Average Pooling (GAP)

GAP reduces each feature map to a single value by averaging. For a tensor of shape (B,C,H,W)(B, C, H, W), GAP produces (B,C)(B, C). This dramatically reduces parameters and is used in modern architectures (ResNet, EfficientNet) to replace fully connected layers.


Architecture Evolution

LeNet-5 (1998)

DfLeNet-5

Yann LeCun's pioneering CNN for digit recognition:

Architecture Diagram
Input (32x32) → [Conv 5x5, 6] → [AvgPool 2x2] →
[Conv 5x5, 16] → [AvgPool 2x2] → [FC 120] → [FC 84] → [FC 10]
  • 5 layers (2 conv, 3 FC), ~60K parameters
  • First practical CNN, used for check reading

AlexNet (2012)

DfAlexNet

The architecture that launched the deep learning revolution:

Architecture Diagram
Input (224x224x3) → [Conv 11x11, 96] → [MaxPool 3x3] →
[Conv 5x5, 256] → [MaxPool 3x3] →
[Conv 3x3, 384] → [Conv 3x3, 384] → [Conv 3x3, 256] →
[MaxPool 3x3] → [FC 4096] → [FC 4096] → [FC 1000]
  • 8 layers, 60M parameters
  • Used ReLU, dropout, data augmentation, GPU training
  • Won ImageNet 2012 with top-5 error of 15.3% (vs. 26.2% runner-up)

VGGNet (2014)

DfVGGNet

Very deep networks using only 3×3 convolutions:

Key insight: Two 3×3 convolutions have the same receptive field as one 5×5, but with fewer parameters and more non-linearity.

  • VGG-16: 16 layers, 138M parameters
  • VGG-19: 19 layers, 144M parameters
  • All convolutions are 3×3 with stride 1 and padding 1
  • Max pooling with stride 2 for downsampling

ThReceptive Field of Stacked 3×3 Convolutions

Two stacked 3×3 convolutions have a receptive field of 5×5. Three stacked 3×3 convolutions have a receptive field of 7×7. The number of parameters for LL layers of 3×3 convolutions is L(CinCout9)L \cdot (C_{\text{in}} \cdot C_{\text{out}} \cdot 9), which is less than a single layer with a larger kernel covering the same receptive field.

ResNet (2015)

DfResNet (Residual Network)

ResNet introduced skip connections to enable training of very deep networks:

h(l+1)=h(l)+F(h(l))\mathbf{h}^{(l+1)} = \mathbf{h}^{(l)} + \mathcal{F}(\mathbf{h}^{(l)})

where F\mathcal{F} is the residual function (typically 2-3 conv layers). Skip connections allow gradients to flow directly through the identity shortcut, solving the vanishing gradient problem.

h(l+1)=h(l)+F(h(l))\mathbf{h}^{(l+1)} = \mathbf{h}^{(l)} + \mathcal{F}(\mathbf{h}^{(l)})

ThResNet Skip Connection Gradient Flow

With skip connections, the gradient at layer ll contains:

Lh(l)=Lh(L)(1+Fh(l))\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(L)}} \left(1 + \frac{\partial \mathcal{F}}{\partial \mathbf{h}^{(l)}}\right)

The identity term 11 ensures the gradient never vanishes, regardless of how deep the network is. This enables training networks with 100+ layers.

EfficientNet (2019)

DfEfficientNet

EfficientNet uses compound scaling to uniformly scale width, depth, and resolution:

width:w=αϕ,depth:d=βϕ,resolution:r=γϕ\text{width}: w = \alpha^\phi, \quad \text{depth}: d = \beta^\phi, \quad \text{resolution}: r = \gamma^\phi

subject to αβ2γ22\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2. With ϕ\phi as a compound coefficient, EfficientNet achieves better accuracy-efficiency tradeoff than scaling any single dimension.


PyTorch ResNet Example

📝Example: Custom ResNet Block in PyTorch

import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        # Shortcut connection
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1,
                         stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        out = torch.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)  # Skip connection
        out = torch.relu(out)
        return out

class MiniResNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.prep = nn.Sequential(
            nn.Conv2d(3, 64, 3, padding=1, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU()
        )
        self.layer1 = self._make_layer(64, 64, 2, stride=1)
        self.layer2 = self._make_layer(64, 128, 2, stride=2)
        self.layer3 = self._make_layer(128, 256, 2, stride=2)
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(256, num_classes)

    def _make_layer(self, in_ch, out_ch, num_blocks, stride):
        layers = [ResidualBlock(in_ch, out_ch, stride)]
        for _ in range(1, num_blocks):
            layers.append(ResidualBlock(out_ch, out_ch, 1))
        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.prep(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.avg_pool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)

# Test
model = MiniResNet(num_classes=10)
x = torch.randn(4, 3, 32, 32)
out = model(x)
print(f"Output shape: {out.shape}")  # [4, 10]
params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {params:,}")

Summary

📋Summary: CNN Architecture Deep Dive

  • Convolution: Local pattern detection with parameter sharing and translation equivariance
  • Output shape: (Hk+2p)/s+1×(Wk+2p)/s+1×F\lfloor(H - k + 2p)/s + 1\rfloor \times \lfloor(W - k + 2p)/s + 1\rfloor \times F
  • Architecture evolution: LeNet → AlexNet → VGG → ResNet → EfficientNet
  • VGG insight: Stacked 3×3 convolutions have larger receptive fields with fewer parameters
  • ResNet: Skip connections solve vanishing gradients, enable 100+ layer networks
  • EfficientNet: Compound scaling of width, depth, and resolution
  • Modern practice: Use pre-trained models (transfer learning) as the starting point
  • Trend: Vision Transformers (ViT) are challenging CNN dominance

Practice Exercises

  1. Mathematical: Compute the output shape of a CNN with input (3,224,224)(3, 224, 224) through: Conv2d(3, 64, 7, stride=2, padding=3) → MaxPool2d(3, stride=2, padding=1) → Conv2d(64, 128, 3, stride=1, padding=1).

  2. Coding: Implement a VGG-16-like architecture from scratch. Count parameters and compare with the original VGG-16. How many parameters are in the fully connected layers vs. convolutional layers?

  3. Experiment: Train ResNet-18 on CIFAR-10 with and without skip connections. Compare convergence speed and final accuracy. Plot the training curves.

  4. Research: Look up the EfficientNet paper (Tan & Le, 2019). How is the compound coefficient ϕ\phi determined? What are the scaling coefficients for EfficientNet-B0 through B7?

  5. Architecture Design: Design a CNN for 100×100 color images with 50 classes. Use the principles from this tutorial (3×3 convolutions, skip connections, GAP). Justify each design choice.

Advertisement

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement