Transformers

Vision Transformers — Applying Transformers to Images

Vision Transformers bring the power of attention to computer vision by treating image patches as tokens. With patch embedding, positional encoding, and self-attention, ViT captures global relationships from the first layer — outperforming CNNs when trained on sufficient data.

Key point 1 — Patch embedding converts images into sequences of flattened patches
Key point 2 — No inductive bias means ViT needs large datasets to learn locality
Key point 3 — DeiT and Swin Transformer solve data efficiency and scalability

"When images become sequences, attention becomes vision."

Vision Transformers — ViT and Beyond

Vision Transformers (ViT) apply the Transformer architecture to image recognition by treating image patches as tokens, achieving competitive or superior performance to CNNs.

From NLP to Vision

Patch Embedding

Positional Encoding

ViT Architecture

How Vision Transformers process images: This diagram shows the complete ViT pipeline. A 224×224 input image is first split into 196 patches of 16×16 pixels each (14×14 grid). The Patch Embedding layer (green) is actually a Conv2d with kernel=16 and stride=16 — it cuts the image into patches and projects each to D=768 dimensions in one operation. A special [CLS] token (yellow) is prepended to the sequence — this token's final representation is used for classification. Positional embeddings (purple) are added to give spatial information (since patches are unordered). The sequence enters the Transformer Encoder (pink) with L=12 layers of multi-head self-attention + feed-forward networks. The key advantage over CNNs: self-attention captures global relationships from layer 1 — a patch in the top-left can directly attend to a patch in the bottom-right, without needing to stack many convolutional layers. The Classification Head (red) maps the CLS token output to 1000 class probabilities.

ViT vs CNN

Aspect	CNN	ViT
Inductive bias	Local connectivity, translation equivariance	None (global attention)
Data requirement	Works with less data	Needs large dataset (JFT-300M)
Computational cost	per layer	self-attention
Patch size	N/A	Determines sequence length
Position info	Built-in (conv)	Explicit positional encoding
Transfer learning	Excellent	Excellent with pretraining

DeiT (Data-efficient Image Transformers)

Swin Transformer

PyTorch Implementation

Practice Exercises

Patch visualization: Split an image into patches and visualize the embedding space using t-SNE.
Position embedding analysis: Train a ViT and visualize learned positional embeddings. Do they capture spatial structure?
Swin Transformer: Implement shifted window attention. Verify linear complexity on large images.
Data augmentation: Compare ViT performance with and without DeiT-style augmentation on ImageNet-1K.

Key Takeaways

What to Learn Next

-> Attention Mechanisms Discover how attention solves the information bottleneck in sequence models.

-> CNN Architecture Deep Dive Master convolutional layers, pooling, and modern CNN architectures.

-> Object Detection Learn to detect and localize objects in images with YOLO and Faster R-CNN.

-> Semantic Segmentation Classify every pixel in an image for detailed scene understanding.

-> DL Systems Design Master distributed training, monitoring, and production deployment of deep learning models.

-> Model Compression Make deep learning models fast and efficient for production deployment.

Vision Transformers — ViT and Beyond

Vision Transformers — Applying Transformers to Images

Vision Transformers — ViT and Beyond

From NLP to Vision

Patch Embedding

Positional Encoding

ViT Architecture

ViT vs CNN

DeiT (Data-efficient Image Transformers)

Swin Transformer

PyTorch Implementation

Practice Exercises

Key Takeaways

What to Learn Next

Need Expert Deep Learning Help?