CW

PyTorch Fundamentals: Tensors, Autograd and GPU Computing

Module 12: Deep LearningFree Lesson

Advertisement

PyTorch Fundamentals: Tensors, Autograd and GPU Computing

Prerequisites: Python proficiency, basic linear algebra (vectors, matrices, tensor operations), calculus (derivatives, chain rule), and introductory programming concepts.

1. PyTorch vs TensorFlow: A Comparative Analysis

Understanding the landscape of deep learning frameworks is essential for choosing the right tool for your research or production needs.

1.1 Historical Context

FeaturePyTorchTensorFlow
DeveloperMeta AI (Facebook)Google Brain
Initial Release20162015
Design PhilosophyPythonic, imperativeStatic graphs, production-oriented
Primary UsersResearchers, AcademiaIndustry, Production
Computation GraphDynamic (Define-by-Run)Static (TF 1.x) / Eager (TF 2.x)
DeploymentTorchScript, ONNXTF Serving, TF Lite, TF.js
Ecosystemtorchtext, torchaudio, torchvisionTFX, Keras, TF Hub

1.2 Fundamental Differences

Dynamic Computation Graphs (PyTorch):

import torch

# Each forward pass creates a new graph
x = torch.randn(3, requires_grad=True)
for i in range(3):
    y = x ** 2 + 2 * x  # Graph rebuilt each iteration
    z = y.sum()
    z.backward()         # Gradients computed immediately
    print(f"Step {i}: grad = {x.grad}")
    x.grad.zero_()       # Reset gradients

Static Computation Graphs (TensorFlow 1.x paradigm):

import tensorflow as tf

# Graph defined once, executed many times
x = tf.placeholder(tf.float32, shape=[3])
y = tf.pow(x, 2) + 2 * x
z = tf.reduce_sum(y)
with tf.Session() as sess:
    for i in range(3):
        result = sess.run(z, feed_dict={x: [1.0, 2.0, 3.0]})
        print(f"Step {i}: result = {result}")

1.3 When to Choose Which?

  • PyTorch: Research prototyping, dynamic architectures (RNNs with variable lengths), debugging flexibility
  • TensorFlow: Production deployment, mobile/edge inference, established pipelines

2. Tensors: The Fundamental Data Structure

2.1 What is a Tensor?

A tensor is a generalization of scalars (0-D), vectors (1-D), and matrices (2-D) to arbitrary dimensions. Mathematically, a tensor TRn1×n2××ndT \in \mathbb{R}^{n_1 \times n_2 \times \cdots \times n_d} is a multidimensional array with dd dimensions (axes) and shape (n1,n2,,nd)(n_1, n_2, \ldots, n_d).

PyTorch Ecosystem Overview

PyTorch CoretorchvisiontorchaudiotorchtexttorchmetricsTorchScriptONNXtorch.compiletorch.distributedCUDA / ROCm (GPU)CPU / TPU

2.2 Tensor Creation

import torch

# === From Python data structures ===
scalar_tensor = torch.tensor(5.0)                     # 0-D
vector_tensor = torch.tensor([1.0, 2.0, 3.0])        # 1-D
matrix_tensor = torch.tensor([[1, 2], [3, 4]])        # 2-D
cube_tensor = torch.tensor([[[1, 2], [3, 4]],         # 3-D
                             [[5, 6], [7, 8]]])

# === Factory functions ===
zeros = torch.zeros(3, 4)                              # All zeros
ones = torch.ones(2, 3, 4)                             # All ones
rand_uniform = torch.rand(5, 5)                        # Uniform [0, 1)
rand_normal = torch.randn(5, 5)                        # Standard normal
eye = torch.eye(4)                                     # Identity matrix
arange = torch.arange(0, 10, 2)                        # Range
linspace = torch.linspace(0, 1, 100)                   # Linear spacing
empty = torch.empty(2, 3)                              # Uninitialized

# === From NumPy ===
import numpy as np
np_array = np.array([[1, 2], [3, 4]])
tensor_from_numpy = torch.from_numpy(np_array)         # Shares memory
tensor_copy = torch.tensor(np_array)                   # Copies data

# === With specific dtype ===
float32_tensor = torch.tensor([1.0, 2.0], dtype=torch.float32)
float64_tensor = torch.tensor([1.0, 2.0], dtype=torch.float64)
int64_tensor = torch.tensor([1, 2], dtype=torch.int64)
bool_tensor = torch.tensor([True, False], dtype=torch.bool)

2.3 Tensor Attributes and Properties

x = torch.randn(3, 4, 5, dtype=torch.float32, device='cuda')

# Essential attributes
print(f"Shape:    {x.shape}")           # torch.Size([3, 4, 5])
print(f"Strides:  {x.stride()}")        # (20, 5, 1)
print(f"Offset:   {x.storage_offset()}")# 0
print(f"Dtype:    {x.dtype}")           # torch.float32
print(f"Device:   {x.device}")          # cuda:0
print(f"Numel:    {x.numel()}")         # 60
print(f"Dim:      {x.ndim}")            # 3
print(f"Requires grad: {x.requires_grad}")  # False

2.4 Tensor Operations

# === Element-wise operations ===
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

c = a + b          # tensor([5., 7., 9.])
c = a * b          # tensor([4., 10., 18.])
c = a ** 2         # tensor([1., 4., 9.])
c = torch.exp(a)   # tensor([2.7183, 7.3891, 20.0855])
c = torch.log(b)   # tensor([1.3863, 1.6094, 1.7918])

# === Matrix operations ===
A = torch.randn(3, 4)
B = torch.randn(4, 5)

C = A @ B          # Matrix multiply (3, 4) x (4, 5) -> (3, 5)
C = torch.mm(A, B) # Equivalent
C = torch.matmul(A, B)  # Batch-aware

# === Reduction operations ===
x = torch.randn(3, 4)

print(x.sum())            # Scalar sum
print(x.sum(dim=0))       # Sum along dim 0 -> shape (4,)
print(x.sum(dim=1))       # Sum along dim 1 -> shape (3,)
print(x.mean())           # Mean
print(x.std())            # Standard deviation
print(x.max())            # Max value
print(x.argmax())         # Index of max

# === Reshaping operations ===
x = torch.arange(12)
y = x.view(3, 4)          # Reshape (no copy)
z = x.reshape(3, 4)       # May copy if needed
w = x.unsqueeze(0)        # Add dimension: (1, 12)
v = w.squeeze(0)          # Remove dimension: (12,)
u = x.view(2, 2, 3)       # 3-D reshape

# === Permutation and transposition ===
x = torch.randn(2, 3, 4)
y = x.permute(2, 0, 1)    # Reorder dimensions -> (4, 2, 3)
z = x.transpose(0, 2)     # Swap two dimensions

# === Broadcasting ===
# PyTorch follows NumPy broadcasting rules:
# 1. If tensors have different ndim, prepend 1s to the smaller shape
# 2. Dimensions of size 1 can be stretched to match
# 3. Dimensions of size 1 can be ignored (treated as same)
a = torch.randn(3, 1)     # (3, 1)
b = torch.randn(1, 4)     # (1, 4)
c = a + b                  # (3, 4) -- broadcast to common shape

# Example with higher dimensions
x = torch.randn(8, 1, 6, 1)
y = torch.randn(7, 1, 5)
z = x + y                  # (8, 7, 6, 5) -- broadcast along all dims

2.5 Advanced Indexing and Slicing

x = torch.arange(24).reshape(2, 3, 4)

# Basic slicing
print(x[0])                # First batch element (3, 4)
print(x[:, 1, :])          # All batches, second row, all cols
print(x[..., -1])          # Ellipsis notation, last element along last dim

# Boolean masking
mask = x > 10
print(x[mask])              # 1-D tensor of elements > 10

# Fancy indexing
indices = torch.tensor([[0, 2], [1, 3]])
print(x[0, indices])       # Advanced indexing

# Where (conditional selection)
x = torch.randn(3, 4)
y = torch.where(x > 0, x, torch.zeros_like(x))  # ReLU-like

2.6 In-place Operations

# Operations with trailing underscore modify tensor in-place
x = torch.randn(3, 4)

x.zero_()                  # In-place zero fill
x.add_(5)                  # In-place addition
x.mul_(2)                  # In-place multiplication
x.clamp_(min=0)            # In-place ReLU

# In-place operations on GPU save memory
# But they prevent gradient computation through autograd

3. Autograd: Automatic Differentiation

3.1 Computational Graph Concept

Autograd builds a Directed Acyclic Graph (DAG) where:

  • Leaf nodes: Input tensors (no gradient needed)
  • Intermediate nodes: Operations producing outputs
  • Root node: The scalar loss (backpropagation starts here)

Computational Graph for Autograd

Forward PassBackward Passx (leaf)w (leaf)matmul+ biaslogitslossCrossEntropyLeaf tensors (requires_grad=True) accumulate gradientsvia .backward(). Intermediate gradients are freed after .backward()unless retain_graph=True is specified.

3.2 Enabling Gradient Tracking

# requires_grad=True enables gradient computation for a tensor
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Operations on x will track gradients
y = x ** 2
z = y.sum()
z.backward()

print(x.grad)  # tensor([2., 4., 6.]) -- dz/dx = 2x

# Detach from graph
y_detached = y.detach()  # Shares storage, no grad tracking

# torch.no_grad() context manager
with torch.no_grad():
    y = x ** 2  # No graph constructed here

3.3 The Chain Rule in Autograd

For a composition f(g(h(x)))f(g(h(x))), autograd computes:

dfdx=dfdgdgdhdhdx\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dx}
# Manual chain rule verification
x = torch.tensor(2.0, requires_grad=True)

# Forward: x -> x^2 -> x^2 + 1 -> (x^2 + 1)^3
y = x ** 2            # dy/dx = 2x = 4
z = y + 1             # dz/dy = 1
w = z ** 3            # dw/dz = 3(z^2) = 3(25) = 75
w.backward()

# dw/dx = dw/dz * dz/dy * dy/dx = 75 * 1 * 4 = 300
print(x.grad)  # tensor(300.)

3.4 Gradient Accumulation

# IMPORTANT: Gradients accumulate by default!
x = torch.tensor(2.0, requires_grad=True)

for i in range(3):
    y = x ** 2
    y.backward()
    print(f"Iteration {i}: grad = {x.grad}")
    # Iteration 0: grad = 4.0
    # Iteration 1: grad = 8.0   (4 + 4)
    # Iteration 2: grad = 12.0  (8 + 4)
    # x.grad is NOT reset between iterations!

# Always zero gradients before backward pass
x.grad.zero_()

3.5 Higher-Order Derivatives

x = torch.tensor(3.0, requires_grad=True)

# First derivative
y = x ** 3              # y = x^3
first_grad = torch.autograd.grad(y, x, create_graph=True)[0]
# dy/dx = 3x^2 = 27

# Second derivative
second_grad = torch.autograd.grad(first_grad, x, create_graph=True)[0]
# d^2y/dx^2 = 6x = 18

# Third derivative
third_grad = torch.autograd.grad(second_grad, x)[0]
# d^3y/dx^3 = 6

print(f"1st: {first_grad}, 2nd: {second_grad}, 3rd: {third_grad}")

3.6 Jacobian and Hessian computation

# Jacobian: matrix of all first-order partial derivatives
def f(x):
    return torch.stack([x[0]**2 + x[1], x[1]**3 * x[0]])

x = torch.tensor([1.0, 2.0], requires_grad=True)
y = f(x)

# Manual Jacobian computation
jacobian = torch.zeros(2, 2)
for i in range(2):
    for j in range(2):
        grad = torch.autograd.grad(y[i], x, retain_graph=True)[0]
        jacobian[i, j] = grad[j]

# J = [[2x0,  1  ],    At (1,2): J = [[2, 1],
#      [x1^3, 3x1^2*x0]]                      [8, 12]]

4. GPU Computing: CUDA Tensors and Device Management

4.1 GPU vs CPU Computation Flow

GPU vs CPU Computation Flow

CPU PathGPU Path (CUDA)Python / NumPy DataPython / NumPy Datatorch.tensor(..., device='cpu')torch.tensor(..., device='cuda')CPU Compute (sequential)CUDA Kernels (massive parallelism)Result on CPUResult on GPU.cpu() -- transfer back to hostBest for: small tensors, debuggingBest for: large models, batch training

4.2 Device Detection and Management

import torch

# Check CUDA availability
print(torch.cuda.is_available())          # True if GPU present
print(torch.cuda.device_count())          # Number of GPUs
print(torch.cuda.get_device_name(0))      # GPU name
print(torch.cuda.get_device_properties(0))# Full properties

# Device agnostic code
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Move tensors to device
x = torch.randn(3, 4)
x_gpu = x.to(device)           # Transfer to GPU
x_gpu = x.cuda()               # Equivalent
x_cpu = x_gpu.cpu()            # Transfer back to CPU

# Create directly on device
x = torch.randn(3, 4, device=device)

4.3 CUDA Operations

# Pin memory for faster CPU -> GPU transfer
x = torch.randn(1000, 1000).pin_memory()
y = x.cuda(non_blocking=True)  # Async transfer

# Mixed precision training
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

with autocast():
    output = model(input)
    loss = loss_fn(output, target)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

# Memory management
print(torch.cuda.memory_allocated())    # Bytes allocated
print(torch.cuda.memory_reserved())     # Bytes reserved
torch.cuda.empty_cache()                # Free unused memory
torch.cuda.synchronize()                # Wait for all ops

4.4 Multi-GPU Training

# DataParallel (simple but less efficient)
model = nn.DataParallel(model)

# DistributedDataParallel (recommended)
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group('nccl')
model = DDP(model, device_ids=[local_rank])

5. nn.Module: Building Custom Layers

5.1 Module Architecture

Neural Network Module Hierarchy

nn.Module (base class)nn.Linearnn.Conv2dnn.LSTMnn.BatchNorm2dnn.Dropoutnn.ReLUCustom Module (nn.Module subclass)Combines layers + defines forward pass

5.2 Custom Layer Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class CustomLinear(nn.Module):
    """Custom linear layer: y = xW^T + b"""
    
    def __init__(self, in_features, out_features, bias=True):
        super().__init__()
        self.weight = nn.Parameter(
            torch.randn(out_features, in_features)
        )
        if bias:
            self.bias = nn.Parameter(torch.zeros(out_features))
        else:
            self.register_parameter('bias', None)
    
    def forward(self, x):
        return F.linear(x, self.weight, self.bias)

# Usage
layer = CustomLinear(128, 64)
x = torch.randn(32, 128)  # batch_size=32, features=128
y = layer(x)               # shape: (32, 64)

print(layer.weight.shape)  # torch.Size([64, 128])
print(layer.bias.shape)    # torch.Size([64])

5.3 Complete Network Architecture

class DeepNet(nn.Module):
    """Multi-layer network with skip connections"""
    
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers=4, dropout=0.1):
        super().__init__()
        
        self.input_proj = nn.Linear(input_dim, hidden_dim)
        self.layers = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_dim, hidden_dim),
                nn.LayerNorm(hidden_dim),
                nn.GELU(),
                nn.Dropout(dropout),
            ) for _ in range(num_layers)
        ])
        self.output_proj = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        h = self.input_proj(x)
        h = F.gelu(h)
        
        for layer in self.layers:
            h = h + layer(h)  # Skip connection
        
        h = self.dropout(h)
        return self.output_proj(h)

# Initialize model
model = DeepNet(input_dim=784, hidden_dim=256, output_dim=10)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

5.4 Parameter Management

# Named parameters
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}, requires_grad={param.requires_grad}")

# Freeze specific layers
for param in model.layers[:2].parameters():
    param.requires_grad = False

# Custom weight initialization
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
        if m.bias is not None:
            nn.init.zeros_(m.bias)

model.apply(init_weights)

6. DataLoader and Dataset

6.1 Data Pipeline Overview

PyTorch Data Pipeline

Datasetgetitem()len()DataLoaderbatch, shufflenum_workers, collateTransformCompose, LambdaNormalize, ResizeModelforward()Common Dataset TypesTensorDataset | ImageFolder | FaceCelebA | TextFolderCommon TransformsToTensor | Normalize | RandomCrop | ToPILImage

6.2 Custom Dataset Implementation

from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, features, labels, transform=None):
        self.features = torch.tensor(features, dtype=torch.float32)
        self.labels = torch.tensor(labels, dtype=torch.long)
        self.transform = transform
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        x = self.features[idx]
        y = self.labels[idx]
        
        if self.transform:
            x = self.transform(x)
        
        return x, y

# Example usage
import numpy as np
X = np.random.randn(1000, 20)
y = np.random.randint(0, 5, 1000)

dataset = CustomDataset(X, y)
print(f"Dataset size: {len(dataset)}")
x_sample, y_sample = dataset[0]
print(f"Sample shape: {x_sample.shape}, label: {y_sample}")

6.3 DataLoader Configuration

# Create DataLoader
dataloader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,          # Parallel data loading
    pin_memory=True,        # Faster GPU transfer
    drop_last=True,         # Drop incomplete batches
    collate_fn=None,        # Custom collation
    sampler=None,           # Custom sampling strategy
)

# Iterate over batches
for batch_idx, (inputs, targets) in enumerate(dataloader):
    inputs = inputs.to(device)
    targets = targets.to(device)
    
    # Forward pass
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if batch_idx % 100 == 0:
        print(f"Batch {batch_idx}, Loss: {loss.item():.4f}")

6.4 Built-in Datasets and Transforms

from torchvision import datasets, transforms

# Standard transforms for image classification
transform_train = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2023, 0.1994, 0.2010]
    ),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2023, 0.1994, 0.2010]
    ),
])

# Load CIFAR-10
train_dataset = datasets.CIFAR10(
    root='./data',
    train=True,
    download=True,
    transform=transform_train
)

test_dataset = datasets.CIFAR10(
    root='./data',
    train=False,
    download=True,
    transform=transform_test
)

train_loader = DataLoader(
    train_dataset,
    batch_size=128,
    shuffle=True,
    num_workers=2,
    pin_memory=True
)

7. Training Loop Pattern

7.1 Training Loop Overview

Neural Network Training Loop

Initialize Modelfor epoch in epochs:model.train() -- training modefor batch in train_loader:1. Forward Pass2. Compute Loss3. Backward + Optimizeoptimizer.zero_grad()loss.backward()optimizer.step()

7.2 Complete Training Implementation

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.cuda.amp import autocast, GradScaler

def train_epoch(model, dataloader, optimizer, criterion, device, scaler=None):
    model.train()
    total_loss = 0.0
    correct = 0
    total = 0
    
    for batch_idx, (inputs, targets) in enumerate(dataloader):
        inputs, targets = inputs.to(device), targets.to(device)
        
        optimizer.zero_grad()
        
        # Mixed precision training
        with autocast():
            outputs = model(inputs)
            loss = criterion(outputs, targets)
        
        if scaler is not None:
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            loss.backward()
            optimizer.step()
        
        total_loss += loss.item() * inputs.size(0)
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()
    
    return total_loss / total, 100.0 * correct / total

def validate(model, dataloader, criterion, device):
    model.eval()
    total_loss = 0.0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for inputs, targets in dataloader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            
            total_loss += loss.item() * inputs.size(0)
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
    
    return total_loss / total, 100.0 * correct / total

# Full training loop
def train_model(model, train_loader, val_loader, num_epochs=100, lr=1e-3):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_epochs)
    scaler = GradScaler() if device.type == 'cuda' else None
    
    best_val_acc = 0.0
    
    for epoch in range(num_epochs):
        train_loss, train_acc = train_epoch(
            model, train_loader, optimizer, criterion, device, scaler
        )
        val_loss, val_acc = validate(model, val_loader, criterion, device)
        scheduler.step()
        
        print(f"Epoch {epoch+1}/{num_epochs}")
        print(f"  Train Loss: {train_loss:.4f}, Acc: {train_acc:.2f}%")
        print(f"  Val Loss:   {val_loss:.4f}, Acc: {val_acc:.2f}%")
        
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), 'best_model.pth')
    
    return model

7.3 Optimizer Selection Guide

OptimizerBest ForKey Parameters
SGDConvNets, generalizationlr, momentum, weight_decay
AdamTransformers, NLP, fast convergencelr, betas, eps
AdamWTransformers with weight decaylr, weight_decay
RMSpropRNNslr, alpha, momentum
# SGD with momentum (classic)
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)

# Adam (default betas)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# AdamW (proper weight decay)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

# Learning rate schedulers
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=5)

8. Saving and Loading Models

8.1 Model Persistence Strategies

import torch

# === Save/Load State Dict (recommended) ===
# Save
torch.save(model.state_dict(), 'model_weights.pth')

# Load
model = MyModel()  # Must create model instance first
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()

# === Save entire model (less flexible) ===
torch.save(model, 'model_complete.pth')

# Load
model = torch.load('model_complete.pth')

# === Save checkpoint with training state ===
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'scheduler_state_dict': scheduler.state_dict(),
    'loss': loss,
    'best_acc': best_acc,
}
torch.save(checkpoint, 'checkpoint.pth')

# Load checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
start_epoch = checkpoint['epoch'] + 1
best_acc = checkpoint['best_acc']

8.2 Serialization Formats

FormatDescriptionUse Case
.pthPyTorch nativeMost common, simplest
.ptPyTorch native (alias)Same as .pth
.binPyTorch binaryHuggingFace models
.safetensorsSafe, fast formatRecommended for sharing
.onnxOpen Neural Network ExchangeCross-framework deployment
# SafeTensors (recommended for security)
from safetensors.torch import save_file, load_file

save_file(model.state_dict(), 'model.safetensors')
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)

# ONNX export
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
    model, dummy_input, 'model.onnx',
    export_params=True,
    opset_version=17,
    do_constant_folding=True,
    input_names=['input'],
    output_names=['output'],
)

8.3 Multi-GPU and Distributed Checkpoints

# Save DDP model
torch.save(model.module.state_dict(), 'ddp_model.pth')

# Load on single GPU
model = MyModel()
model.load_state_dict(torch.load('ddp_model.pth'))

# Distributed checkpoint (PyTorch 2.0+)
from torch.distributed.checkpoint import save, load

save(model.state_dict(), checkpoint_id="./checkpoints/ckpt")
state_dict = torch.load("checkpoints/ckpt/mp_rank_0_model_states.mpk")
model.load_state_dict(state_dict)

Summary

Key Takeaways:

  • Tensors are PyTorch's fundamental data structure, supporting GPU acceleration and automatic differentiation
  • Autograd builds dynamic computation graphs and implements reverse-mode automatic differentiation
  • nn.Module provides the base class for all neural network layers and models
  • DataLoader/Dataset handle efficient data loading with parallelism and transformations
  • Training loop follows the pattern: zero_grad -> forward -> loss -> backward -> step
  • Model saving should use state_dict for flexibility and reproducibility

Mathematical Foundations

The core operations in PyTorch implement fundamental mathematical transformations:

  1. Matrix Multiplication: y=Wx+b\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}, where WRm×n\mathbf{W} \in \mathbb{R}^{m \times n}, xRn\mathbf{x} \in \mathbb{R}^n

  2. Backpropagation: Using the chain rule, gradients flow backward through the computational graph:

LW=Lyx\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \mathbf{x}^\top
  1. Gradient Descent Update: Wt+1=WtηWL\mathbf{W}_{t+1} = \mathbf{W}_t - \eta \nabla_{\mathbf{W}} \mathcal{L}, where η\eta is the learning rate

Best Practices

  1. Always call optimizer.zero_grad() before each backward pass
  2. Use torch.no_grad() for inference to save memory
  3. Prefer state_dict over saving entire models
  4. Use pin_memory=True and num_workers>0 in DataLoader for GPU training
  5. Use mixed precision (torch.cuda.amp) for faster training on modern GPUs
  6. Move model and data to the same device before training

Common Pitfalls

  1. Forgetting to call model.eval() during validation/inference
  2. Not zeroing gradients between iterations (accumulation)
  3. Using in-place operations on tensors that require gradients
  4. Device mismatch between model and data tensors
  5. Using torch.load() without weights_only=True for untrusted files

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement