Deep Learning

Convolutional Neural Networks — How Computers See Images

Master CNNs and learn how computers extract visual features through convolution, pooling, and learned filters.

Convolution operations — detect edges, textures, and shapes
Pooling layers — reduce spatial dimensions efficiently
Modern architectures — ResNet, EfficientNet, and beyond

A picture is worth a thousand words — and a CNN learns all of them.

Convolutional Neural Networks — Complete Guide

CNNs exploit the spatial structure of images through two key principles: local connectivity (each neuron connects to a small region) and weight sharing (same filter applied everywhere). This yields parameters instead of for fully connected layers.

Convolution Operation

The discrete 2D convolution (cross-correlation in practice) slides a learnable kernel over the input:

How this diagram works: The convolution operation slides a 3×3 kernel across the 5×5 input grid. At each position, the kernel performs element-wise multiplication with the overlapping input region, then sums all products to produce a single output value. The highlighted yellow region shows the current receptive field — the 3×3 area the kernel is currently processing. The computation panel on the right breaks down how output[0,0] = 4 is calculated: each row of the kernel multiplies the corresponding input row, producing partial sums (2 + 1 + 2 = 4). With no padding and stride 1, a 5×5 input with a 3×3 kernel produces a 3×3 output, shrinking spatial dimensions by k-1 = 2 pixels. This is the fundamental building block — every CNN starts with this operation.

Pooling

Pooling reduces spatial dimensions, providing translation invariance and reducing computation.

How pooling works: Pooling reduces the spatial size of feature maps while retaining the most important information. Max Pooling (left) divides the 4×4 input into 2×2 regions and keeps only the maximum value from each — this captures the strongest activation (e.g., the strongest edge or texture detected). Average Pooling (center) computes the mean of each region, providing a smoother summary. Global Average Pooling (right) collapses an entire feature map (C×H×W) into a single value per channel by averaging all spatial positions — this replaces fully connected layers in modern architectures like ResNet, drastically reducing parameters. The comparison table shows that max pooling preserves the strongest feature and routes gradients only to the winning neuron, while average pooling distributes gradients evenly. Pooling also provides translation invariance — small shifts in the input don't change the pooled output significantly.

CNN Architecture

A typical CNN follows the pattern: Conv → ReLU → Pool repeated times, followed by Flatten → FC → Output.

How this architecture flows: This diagram shows a classic CNN pipeline processing a 32×32×3 color image through progressively deeper layers. The input passes through three convolutional blocks (blue), each followed by max pooling (green) that halves spatial dimensions. Notice the key pattern: as spatial dimensions decrease (32→30→15→13→6→4), channel depth increases (3→32→64→128) — this trades spatial resolution for feature richness. The flatten layer converts the 4×4×128 3D feature map into a 1D vector of 2,048 values, which feeds into two fully connected layers (pink) for classification into 10 classes. Early layers learn low-level features (edges, colors), middle layers learn mid-level patterns (textures, shapes), and deep layers learn high-level concepts (objects, parts). The bottom annotations track how dimensions change at each stage.

ResNet and Skip Connections

The residual connection addresses the degradation problem: deeper networks should perform at least as well as shallower ones. Instead of learning directly, learn the residual :

How skip connections solve vanishing gradients: The red dashed line is the key innovation — it creates a "shortcut" that copies the input x directly to the addition operation, bypassing the two convolution layers. Instead of learning the full transformation H(x), the network only needs to learn the residual F(x) = H(x) - x. The output becomes y = F(x) + x. This works because if the optimal transformation is close to identity (i.e., the layer doesn't need to change anything), F(x) learns to be near zero, which is much easier than learning an identity mapping from scratch. For backpropagation, the gradient now has two paths: it flows through F(x) AND through the identity shortcut, ensuring gradients never vanish completely — even in 152+ layer networks. The identity path acts as a gradient highway, enabling training of networks that would otherwise be impossible.

PyTorch Implementation

CNN Architecture Comparison

Key Takeaways

What to Learn Next

-> Vision Transformers Apply Transformer architecture to vision tasks.

-> Transfer Learning Leverage pre-trained models for new tasks.

-> Object Detection Find and locate objects in images.

-> Neural Networks Understand the foundation of deep learning.

-> Semantic Segmentation Classify every pixel in an image.

-> Training Deep Networks Master optimizers, batch norm, and regularization.

Convolutional Neural Networks — Complete Guide for Vision