PyTorch Fundamentals
Introduction
PyTorch is an open-source deep learning framework known for its dynamic computation graphs, Pythonic design, and strong GPU acceleration. It's the preferred framework for research and increasingly used in production.
PyTorch Ecosystem:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PyTorch Core β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β Tensors β βAutograd β β nn β βOptim β β
β β β β β β Module β β β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββ ββββββββββββ βββββββββββββββ ββββββββββββ
β CUDA β β Data β β torchvisionβ β torchaudioβ
β (GPU) β β Loading β β (Vision) β β (Audio) β
βββββββββββββββ ββββββββββββ βββββββββββββββ ββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tensors
Creating Tensors
DfTensor
A tensor is a multi-dimensional array, the fundamental data structure in PyTorch. Tensors are similar to NumPy arrays but can run on GPUs for accelerated computing. Formally, a tensor of order is a multi-linear map from vector spaces to the real numbers.
βΉοΈ Tensor vs. NumPy Array
The key difference between a PyTorch tensor and a NumPy array is that tensors support automatic differentiation (autograd) and can reside on GPU memory. The flag determines whether PyTorch tracks operations on the tensor for gradient computation.
import torch
import numpy as np
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Tensor Creation Methods
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# From Python lists
t1 = torch.tensor([1, 2, 3, 4])
print(f"From list: {t1}")
# From NumPy array
np_array = np.array([[1, 2], [3, 4]])
t2 = torch.from_numpy(np_array)
print(f"From NumPy: {t2}")
# Common creation functions
zeros = torch.zeros(3, 4) # 3x4 zeros
ones = torch.ones(3, 4) # 3x4 ones
rand = torch.rand(3, 4) # Uniform [0, 1)
randn = torch.randn(3, 4) # Standard normal
eye = torch.eye(3) # Identity matrix
arange = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = torch.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1]
print(f"\nZeros:\n{zeros}")
print(f"\nRandom:\n{randn}")
print(f"\nArange: {arange}")
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Tensor Properties
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
t = torch.randn(3, 4, 5)
print(f"\nShape: {t.shape}") # torch.Size([3, 4, 5])
print(f"Dimensions: {t.ndim}") # 3
print(f"Numel: {t.numel()}") # 60
print(f"Dtype: {t.dtype}") # torch.float32
print(f"Device: {t.device}") # cpu
print(f"Requires grad: {t.requires_grad}") # False
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Tensor Operations
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Element-wise operations
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(f"\nAddition: {a + b}")
print(f"Multiplication: {a * b}")
print(f"Power: {a ** 2}")
print(f"Square root: {torch.sqrt(a)}")
# Matrix operations
A = torch.randn(3, 4)
B = torch.randn(4, 5)
# Matrix multiplication
C = torch.mm(A, B) # 3x5
C_alt = A @ B # Same as mm
print(f"\nMatrix multiply shape: {C.shape}")
# Dot product
v1 = torch.randn(5)
v2 = torch.randn(5)
dot = torch.dot(v1, v2)
print(f"Dot product: {dot}")
# Broadcasting
x = torch.randn(3, 4)
y = torch.randn(4)
result = x + y # y broadcasts to (3, 4)
print(f"\nBroadcasting result shape: {result.shape}")
Matrix Multiplication (Core Operation)
Here,
- =Input matrix of shape (m x n)
- =Input matrix of shape (n x p)
- =Output matrix of shape (m x p)
- =Inner dimension (must match)
π‘ Tensor Operations in PyTorch
PyTorch provides multiple ways to perform matrix multiplication: , , and the operator. For batched operations, performs batched matrix multiplication. The operator is the most Pythonic and recommended approach.
Tensor Manipulation
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Reshaping Operations
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
t = torch.arange(12)
print(f"Original: {t.shape} -> {t}")
# Reshape
t_reshaped = t.reshape(3, 4)
print(f"Reshape (3,4): {t_reshaped.shape}")
# View (shares memory)
t_view = t.view(4, 3)
print(f"View (4,3): {t_view.shape}")
# Transpose
t_transposed = t_reshaped.T
print(f"Transpose: {t_transposed.shape}")
# Permute
t_permuted = torch.randn(2, 3, 4).permute(2, 0, 1)
print(f"Permute (2,3,4) -> (4,2,3): {t_permuted.shape}")
# Flatten
t_flat = t_reshaped.flatten()
print(f"Flatten: {t_flat.shape}")
# Squeeze/Unsqueeze
t_squeeze = torch.randn(1, 3, 1, 4).squeeze()
print(f"Squeeze: {t_squeeze.shape}")
t_unsqueeze = t_squeeze.unsqueeze(0)
print(f"Unsqueeze: {t_unsqueeze.shape}")
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Indexing and Slicing
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
t = torch.arange(24).reshape(2, 3, 4)
print(f"\nOriginal shape: {t.shape}")
print(f"t[0]: {t[0].shape}") # First batch
print(f"t[:, 1]: {t[:, 1].shape}") # Second row
print(f"t[0, :, 2]: {t[0, :, 2].shape}") # Third column
# Boolean indexing
mask = t > 15
filtered = t[mask]
print(f"\nFiltered (>15): {filtered}")
# Fancy indexing
indices = torch.tensor([0, 2])
selected = t[:, indices, :]
print(f"Fancy indexing: {selected.shape}")
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Device Management (CPU/GPU)
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"\nUsing device: {device}")
# Move tensor to GPU
t_gpu = t.to(device)
print(f"Tensor device: {t_gpu.device}")
# Create directly on GPU
t_on_gpu = torch.randn(3, 4, device=device)
print(f"Created on GPU: {t_on_gpu.device}")
# Check GPU memory
if torch.cuda.is_available():
print(f"GPU Memory: {torch.cuda.get_device_name(0)}")
print(f"Memory Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
Autograd (Automatic Differentiation)
Computation Graphs
DfAutograd (Automatic Differentiation)
Autograd is PyTorch's automatic differentiation engine. It records operations on tensors to build a dynamic computation graph (DAG), then computes gradients via backpropagation when is called. This is a form of reverse-mode automatic differentiation.
ThChain Rule in Autograd
For a composition of functions , the derivative is computed as . Autograd implements this automatically by traversing the computation graph in reverse topological order, applying the chain rule at each node.
βΉοΈ Dynamic vs. Static Graphs
PyTorch uses dynamic computation graphs (define-by-run): the graph is built on-the-fly during each forward pass. This differs from TensorFlow 1.x's static graphs, which required pre-defining the entire computation. Dynamic graphs enable natural Python control flow (if/else, loops) within the model, making debugging and experimentation much easier.
Autograd Computation Graph:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Forward Pass:
βββββββββββββ
x (requires_grad=True)
β
βΌ
βββββββ
β * β β y = x * 2
β 2 β
ββββ¬βββ
β
βΌ
βββββββ
β + β β z = y + 3
β 3 β
ββββ¬βββ
β
βΌ
L = zΒ² (loss)
Backward Pass (Auto):
βββββββββββββββββββββ
β²
β
βββββββ
β dL β β dL/dz = 2z
β /dz β
ββββ¬βββ
β
βΌ
βββββββ
β dL β β dL/dy = 2z * 1
β /dy β
ββββ¬βββ
β
βΌ
βββββββ
β dL β β dL/dx = 2z * 2
β /dx β
βββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Autograd Basics
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Create tensor with gradient tracking
x = torch.tensor(2.0, requires_grad=True)
print(f"x: {x}")
print(f"x.grad_fn: {x.grad_fn}") # None for leaf tensors
# Forward pass
y = x ** 2 + 3 * x + 1
print(f"\ny = xΒ² + 3x + 1")
print(f"y: {y}")
print(f"y.grad_fn: {y.grad_fn}") # AddBackward
# Backward pass (compute gradients)
y.backward()
# dy/dx = 2x + 3 = 2(2) + 3 = 7
print(f"\ndy/dx at x=2: {x.grad}") # tensor(7.)
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Gradient Computation Example
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
x = torch.randn(3, requires_grad=True)
w = torch.randn(3, requires_grad=True)
print(f"\nx: {x}")
print(f"w: {w}")
# Forward pass
y = torch.dot(x, w)
z = y ** 2
print(f"\ny = x Β· w = {y}")
print(f"z = yΒ² = {z}")
# Backward pass
z.backward()
print(f"\ndz/dx = 2y Β· w = {x.grad}")
print(f"dz/dw = 2y Β· x = {w.grad}")
# Verify manually
manual_grad_x = 2 * y.detach() * w
manual_grad_w = 2 * y.detach() * x
print(f"\nManual dz/dx: {manual_grad_x}")
print(f"Manual dz/dw: {manual_grad_w}")
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Gradient Control
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Zero gradients (important before each step!)
x.grad.zero_()
# Stop gradient tracking
with torch.no_grad():
y = x * 2
print(f"\nNo grad - y: {y}")
print(f"y.grad_fn: {y.grad_fn}") # None
# Detach from graph
y_detached = y.detach()
print(f"Detached: {y_detached}")
# Gradient accumulation (default behavior)
x.grad.zero_()
for _ in range(3):
y = x * 2
y.backward()
print(f"Gradient after step: {x.grad}")
# Gradients accumulate! Must zero manually
Gradient of Composite Function
Here,
- =Output loss function
- =Intermediate variable
- =Input variable
π‘ Zeroing Gradients
Gradients accumulate by default in PyTorch. You must call before each backward pass to prevent gradient accumulation across mini-batches. Alternatively, use to zero gradients for a specific model.
nn.Module and Layers
Building Custom Layers
import torch.nn as nn
import torch.nn.functional as F
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Custom Linear Layer
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
class CustomLinear(nn.Module):
def __init__(self, in_features, out_features, bias=True):
super().__init__()
self.weight = nn.Parameter(torch.randn(out_features, in_features))
if bias:
self.bias = nn.Parameter(torch.randn(out_features))
else:
self.bias = None
# Initialize weights
nn.init.kaiming_normal_(self.weight)
if self.bias is not None:
nn.init.zeros_(self.bias)
def forward(self, x):
return F.linear(x, self.weight, self.bias)
# Test custom layer
custom_linear = CustomLinear(10, 5)
x = torch.randn(32, 10) # Batch of 32
y = custom_linear(x)
print(f"Custom Linear output: {y.shape}")
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Built-in Layers
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Linear (fully connected)
linear = nn.Linear(10, 5)
print(f"\nLinear: {linear}")
# Convolutional
conv2d = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
x_img = torch.randn(1, 3, 32, 32) # Batch, channels, H, W
y_img = conv2d(x_img)
print(f"Conv2d: {y_img.shape}") # [1, 16, 32, 32]
# Pooling
pool = nn.MaxPool2d(2, 2)
y_pool = pool(y_img)
print(f"MaxPool2d: {y_pool.shape}") # [1, 16, 16, 16]
# Batch Normalization
bn = nn.BatchNorm2d(16)
y_bn = bn(y_img)
print(f"BatchNorm2d: {y_bn.shape}")
# Dropout
dropout = nn.Dropout(0.5)
y_drop = dropout(y_img)
print(f"Dropout: {y_drop.shape}")
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Complete Model Architecture
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
class CNNClassifier(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
# Feature extractor
self.features = nn.Sequential(
# Block 1: 3 -> 32 channels
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.Conv2d(32, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
nn.Dropout2d(0.25),
# Block 2: 32 -> 64 channels
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.Conv2d(64, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
nn.Dropout2d(0.25),
# Block 3: 64 -> 128 channels
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
nn.Dropout2d(0.25)
)
# Classifier
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128 * 4 * 4, 256),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(256, 128),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(128, num_classes)
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
# Initialize model
model = CNNClassifier(num_classes=10)
print(f"\nModel parameters: {sum(p.numel() for p in model.parameters()):,}")
# Test forward pass
x = torch.randn(8, 3, 32, 32)
y = model(x)
print(f"Output shape: {y.shape}")
πAutograd Gradient Computation
Given , compute .
Forward pass:
Manual derivative:
PyTorch autograd:
x = torch.tensor(2.0, requires_grad=True)
y = x**2 + 3*x + 1
y.backward()
print(x.grad) # tensor(7.)
Autograd automatically computes the same result by tracking the computation graph and applying the chain rule in reverse.
Dataset and DataLoader
from torch.utils.data import Dataset, DataLoader, random_split
from torchvision import transforms
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Custom Dataset
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
class CustomDataset(Dataset):
def __init__(self, num_samples=1000, transform=None):
self.X = torch.randn(num_samples, 10)
self.y = (self.X.sum(dim=1) > 0).long()
self.transform = transform
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
x = self.X[idx]
y = self.y[idx]
if self.transform:
x = self.transform(x)
return x, y
# Create dataset
dataset = CustomDataset(num_samples=1000)
print(f"Dataset size: {len(dataset)}")
# Sample
x, y = dataset[0]
print(f"Sample x: {x.shape}, y: {y}")
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Train/Validation Split
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
print(f"Train: {len(train_dataset)}, Val: {len(val_dataset)}")
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# DataLoader
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
train_loader = DataLoader(
train_dataset,
batch_size=32,
shuffle=True,
num_workers=0,
pin_memory=True,
drop_last=True
)
val_loader = DataLoader(
val_dataset,
batch_size=64,
shuffle=False,
num_workers=0
)
# Iterate through batches
for batch_idx, (x_batch, y_batch) in enumerate(train_loader):
print(f"\nBatch {batch_idx}: x={x_batch.shape}, y={y_batch.shape}")
if batch_idx >= 2:
break
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Image Dataset with Transforms
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
transform = transforms.Compose([
transforms.Resize((32, 32)),
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.4914, 0.4822, 0.4465],
std=[0.2470, 0.2435, 0.2616]
)
])
# Using torchvision datasets
from torchvision.datasets import CIFAR10
cifar_train = CIFAR10(
root='./data',
train=True,
download=True,
transform=transform
)
cifar_loader = DataLoader(
cifar_train,
batch_size=64,
shuffle=True,
num_workers=2
)
print(f"\nCIFAR-10 training samples: {len(cifar_train)}")
Cross-Entropy Loss (Classification)
Here,
- =Number of classes
- =True label (one-hot encoded)
- =Predicted probability for class c
βΉοΈ Why Cross-Entropy with Logits
PyTorch provides which combines and in one operation. This is numerically more stable than applying softmax separately and then computing cross-entropy, because the log-softmax computation avoids numerical overflow.
Complete Training Loop
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Training Loop Implementation
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
def train_epoch(model, loader, criterion, optimizer, device):
model.train()
total_loss = 0
correct = 0
total = 0
for batch_idx, (data, target) in enumerate(loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
total_loss += loss.item()
_, predicted = output.max(1)
total += target.size(0)
correct += predicted.eq(target).sum().item()
accuracy = 100. * correct / total
avg_loss = total_loss / len(loader)
return avg_loss, accuracy
def validate(model, loader, criterion, device):
model.eval()
total_loss = 0
correct = 0
total = 0
with torch.no_grad():
for data, target in loader:
data, target = data.to(device), target.to(device)
output = model(data)
loss = criterion(output, target)
total_loss += loss.item()
_, predicted = output.max(1)
total += target.size(0)
correct += predicted.eq(target).sum().item()
accuracy = 100. * correct / total
avg_loss = total_loss / len(loader)
return avg_loss, accuracy
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Full Training Pipeline
# βββββββββββββββββββββββββββββββββββββββββββββββββββ
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Initialize model, loss, optimizer
model = CNNClassifier(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = CosineAnnealingLR(optimizer, T_max=50)
# Training loop
num_epochs = 50
history = {'train_loss': [], 'val_loss': [],
'train_acc': [], 'val_acc': []}
for epoch in range(num_epochs):
train_loss, train_acc = train_epoch(
model, train_loader, criterion, optimizer, device
)
val_loss, val_acc = validate(
model, val_loader, criterion, device
)
scheduler.step()
history['train_loss'].append(train_loss)
history['val_loss'].append(val_loss)
history['train_acc'].append(train_acc)
history['val_acc'].append(val_acc)
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}/{num_epochs}:")
print(f" Train Loss: {train_loss:.4f}, Acc: {train_acc:.2f}%")
print(f" Val Loss: {val_loss:.4f}, Acc: {val_acc:.2f}%")
# Save model
torch.save({
'epoch': num_epochs,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'val_acc': val_acc,
}, 'model_checkpoint.pth')
πKey Takeaways
- Tensors are PyTorch's core data structure: multi-dimensional arrays with GPU support and automatic differentiation
- Autograd builds dynamic computation graphs and computes gradients via reverse-mode automatic differentiation (backpropagation)
- nn.Module is the base class for all neural network layers and models; the method defines the computation
- Dataset/DataLoader provide efficient data loading with batching, shuffling, and parallelism via multiprocessing
- Always zero gradients with before each backward pass to prevent accumulation
- Use
torch.no_grad()during evaluation to disable gradient tracking and save memory - Cross-Entropy Loss should be used with raw logits (not softmax outputs) for numerical stability
Practice Exercises
- Tensor Operations: Implement matrix multiplication and verify with
torch.mmand@ - Gradient Flow: Create a deep network and visualize gradient magnitudes across layers
- Custom Layer: Implement a custom attention mechanism as an
nn.Module - Data Pipeline: Create a custom dataset for CSV data with preprocessing transforms