Convolutional Neural Networks — Complete Guide
CNNs are the backbone of computer vision — they automatically learn spatial features from images.
How CNNs Work
Image → [Conv → ReLU → Pool] × N → Flatten → FC → Output
Conv Layer:
├─ Applies filters/kernels to detect features
├─ Edge detection, texture, shapes
├─ Local receptive field
└─ Parameter sharing (same filter everywhere)
Pooling Layer:
├─ Reduces spatial dimensions
├─ Max pooling: keep maximum value
├─ Reduces computation
└─ Adds translation invariance
Fully Connected Layer:
├─ Combines features
└─ Makes final prediction
Convolution Operation
Input (5×5): Filter (3×3): Output (3×3):
1 2 3 4 5 1 0 1 4 3 4
6 7 8 9 0 * 0 1 0 = 2 3 2
1 2 3 4 5 1 0 1 4 3 4
6 7 8 9 0
1 2 3 4 5
Output[i,j] = Σ Σ Input[i+k, j+l] × Filter[k, l]
Architecture
import torch.nn as nn
class CNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(64, 128, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2)
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128 * 4 * 4, 256),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(256, 10)
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
Key Takeaways
- CNNs automatically learn spatial features from images
- Convolution detects local patterns (edges, textures)
- Pooling reduces dimensions and adds invariance
- Transfer learning with pre-trained models is standard practice
- ResNet enables training very deep networks
- Data augmentation is crucial for small datasets
- CNNs achieve superhuman performance on image classification
- Modern architectures: EfficientNet, ConvNeXt, Vision Transformer