Convolutional Neural Networks: Filters, Pooling, Architectures — Asked at NVIDIA & Meta

🎯 The Interview Question

"Explain the convolution operation mathematically, including how different filter sizes affect feature extraction. How does pooling work, and what are the trade-offs between max pooling and average pooling? Walk us through the evolution of CNN architectures from LeNet to EfficientNet."

This question tests your understanding of the building blocks of computer vision — critical for roles at NVIDIA (hardware-optimized CV) and Meta (social media vision systems).

📚 Detailed Answer

The Convolution Operation

A 2D convolution applies a kernel (filter) $\mathbf{K} \in \mathbb{R}^{k \times k}$ to an input feature map $\mathbf{X} \in \mathbb{R}^{H \times W}$ :

(\mathbf{X} * \mathbf{K})(i,j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} \mathbf{X}(i+m, j+n) \cdot \mathbf{K}(m, n)

For multi-channel inputs (e.g., RGB images), the convolution sums across all input channels:

\mathbf{Y}_{c_{out}} = \sum_{c_{in}=0}^{C_{in}-1} \mathbf{X}_{c_{in}} * \mathbf{K}_{c_{out}, c_{in}} + b_{c_{out}}

💡

The number of parameters in a convolutional layer is $C_{out} \times C_{in} \times k \times k + C_{out}$ , which is dramatically less than a fully connected layer with the same receptive field. This parameter sharing is what makes CNNs practical for images.

Output Size Calculation

The output size of a convolution with input $H_{in} \times W_{in}$ , kernel $k \times k$ , stride $s$ , and padding $p$ :

H_{out} = \left\lfloor \frac{H_{in} + 2p - k}{s} \right\rfloor + 1

W_{out} = \left\lfloor \frac{W_{in} + 2p - k}{s} \right\rfloor + 1

Common configurations:

$k=3, s=1, p=1$ : Preserves spatial dimensions (same convolution)
$k=3, s=2, p=1$ : Downsamples by 2×
$k=1, s=1, p=0$ : 1×1 convolution for channel mixing

Filter Behavior: What Do Different Sizes Detect?

Filter Size	Receptive Field	Typical Use
1×1	Single pixel	Channel mixing, dimensionality reduction
3×3	Small local patterns	Edges, textures (most common)
5×5	Medium patterns	Larger textures, simple shapes
7×7	Large patterns	Coarse features, initial layers

Modern architectures prefer stacking 3×3 convolutions over using larger kernels because:

Same receptive field with fewer parameters: $2 \times (3 \times 3) = 18$ parameters vs. $5 \times 5 = 25$
More non-linearities (two ReLUs vs. one)
Better gradient flow

Pooling: Downsampling Strategies

Max Pooling:

y_{i,j} = \max_{(m,n) \in \mathcal{R}_{i,j}} x_{m,n}

where $\mathcal{R}_{i,j}$ is the pooling region. Max pooling selects the maximum value in each window, providing:

Translation invariance (small shifts don't change output)
Robustness to noise
Dimensionality reduction

Average Pooling:

y_{i,j} = \frac{1}{|\mathcal{R}_{i,j}|} \sum_{(m,n) \in \mathcal{R}_{i,j}} x_{m,n}

Average pooling computes the mean, providing:

Smoother gradients
Better preservation of spatial information
Used in final layers (Global Average Pooling)

Global Average Pooling (GAP):

y_c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} x_{c,i,j}

GAP replaces fully connected layers, reducing parameters and overfitting.

Evolution of CNN Architectures

LeNet-5 (1998)

First practical CNN for digit recognition:

2 conv layers (5×5, 1×1)
2 pooling layers (2×2 average)
3 fully connected layers
~60K parameters

AlexNet (2012)

Breakthrough on ImageNet:

5 conv layers, 3 FC layers
ReLU activation (first to use it)
Dropout regularization
Data augmentation
~60M parameters

VGGNet (2014)

Deep, uniform architecture:

16-19 layers of 3×3 convolutions
2×2 max pooling for downsampling
All 3×3 filters
~138M parameters
Key insight: depth matters

GoogLeNet/Inception (2014)

Multi-scale processing:

Inception modules: parallel 1×1, 3×3, 5×5 convolutions
1×1 convolutions for dimensionality reduction
Auxiliary classifiers for training
~6.8M parameters (much more efficient)

ResNet (2015)

Skip connections for very deep networks:

Identity shortcut: $\mathbf{y} = f(\mathbf{x}) + \mathbf{x}$
Enabled training of 50, 101, 152+ layer networks
Batch normalization after each conv

EfficientNet (2019)

Compound scaling:

Scales depth, width, and resolution together
Uses mobile inverted bottleneck (MBConv)
State-of-the-art with fewer parameters
Compound coefficient $\phi$ : depth $d = \alpha^\phi$ , width $w = \beta^\phi$ , resolution $r = \gamma^\phi$

Modern Trends

NVIDIA-Specific Considerations

Tensor Cores: Optimized for $4 \times 4$ matrix multiplications; use mixed precision (FP16/BF16)
cuDNN: Library-optimized convolution algorithms; auto-tune with torch.backends.cudnn.benchmark
Memory Bandwidth: Limit memory transfers; fuse operations where possible

Meta-Specific Considerations

Efficient Inference: Models must serve billions of users with low latency
ONNX Runtime: Cross-platform inference optimization
Model Pruning: Remove redundant parameters for mobile deployment

Follow-Up Questions

Q: What is depthwise separable convolution and why is it faster? A: Separates spatial and channel-wise computation. Depthwise conv ( $C_{in}$ filters) + pointwise conv (1×1). Reduces parameters from $k^2 \cdot C_{in} \cdot C_{out}$ to $k^2 \cdot C_{in} + C_{in} \cdot C_{out}$ .

Q: How do you handle variable-size images in CNNs? A: Global Average Pooling produces fixed-size output regardless of spatial dimensions. Adaptive pooling layers can output any desired size.

Q: What is the relationship between CNNs and Vision Transformers? A: ViTs treat image patches as tokens, using self-attention instead of convolutions. Hybrid models combine both. ViTs excel at capturing global relationships; CNNs are better at local patterns.