PyTorch Fundamentals: Tensors, Autograd and GPU Computing

ℹ️

Prerequisites: Python proficiency, basic linear algebra (vectors, matrices, tensor operations), calculus (derivatives, chain rule), and introductory programming concepts.

1. PyTorch vs TensorFlow: A Comparative Analysis

Understanding the landscape of deep learning frameworks is essential for choosing the right tool for your research or production needs.

1.1 Historical Context

Feature	PyTorch	TensorFlow
Developer	Meta AI (Facebook)	Google Brain
Initial Release	2016	2015
Design Philosophy	Pythonic, imperative	Static graphs, production-oriented
Primary Users	Researchers, Academia	Industry, Production
Computation Graph	Dynamic (Define-by-Run)	Static (TF 1.x) / Eager (TF 2.x)
Deployment	TorchScript, ONNX	TF Serving, TF Lite, TF.js
Ecosystem	torchtext, torchaudio, torchvision	TFX, Keras, TF Hub

1.2 Fundamental Differences

Dynamic Computation Graphs (PyTorch):

import torch

# Each forward pass creates a new graph
x = torch.randn(3, requires_grad=True)
for i in range(3):
    y = x ** 2 + 2 * x  # Graph rebuilt each iteration
    z = y.sum()
    z.backward()         # Gradients computed immediately
    print(f"Step {i}: grad = {x.grad}")
    x.grad.zero_()       # Reset gradients

Static Computation Graphs (TensorFlow 1.x paradigm):

import tensorflow as tf

# Graph defined once, executed many times
x = tf.placeholder(tf.float32, shape=[3])
y = tf.pow(x, 2) + 2 * x
z = tf.reduce_sum(y)
with tf.Session() as sess:
    for i in range(3):
        result = sess.run(z, feed_dict={x: [1.0, 2.0, 3.0]})
        print(f"Step {i}: result = {result}")

1.3 When to Choose Which?

PyTorch: Research prototyping, dynamic architectures (RNNs with variable lengths), debugging flexibility
TensorFlow: Production deployment, mobile/edge inference, established pipelines

2. Tensors: The Fundamental Data Structure

2.1 What is a Tensor?

A tensor is a generalization of scalars (0-D), vectors (1-D), and matrices (2-D) to arbitrary dimensions. Mathematically, a tensor is a multidimensional array with dimensions (axes) and shape .

PyTorch Ecosystem Overview

PyTorch CoretorchvisiontorchaudiotorchtexttorchmetricsTorchScriptONNXtorch.compiletorch.distributedCUDA / ROCm (GPU)CPU / TPU

2.2 Tensor Creation

import torch

# === From Python data structures ===
scalar_tensor = torch.tensor(5.0)                     # 0-D
vector_tensor = torch.tensor([1.0, 2.0, 3.0])        # 1-D
matrix_tensor = torch.tensor([[1, 2], [3, 4]])        # 2-D
cube_tensor = torch.tensor([[[1, 2], [3, 4]],         # 3-D
                             [[5, 6], [7, 8]]])

# === Factory functions ===
zeros = torch.zeros(3, 4)                              # All zeros
ones = torch.ones(2, 3, 4)                             # All ones
rand_uniform = torch.rand(5, 5)                        # Uniform [0, 1)
rand_normal = torch.randn(5, 5)                        # Standard normal
eye = torch.eye(4)                                     # Identity matrix
arange = torch.arange(0, 10, 2)                        # Range
linspace = torch.linspace(0, 1, 100)                   # Linear spacing
empty = torch.empty(2, 3)                              # Uninitialized

# === From NumPy ===
import numpy as np
np_array = np.array([[1, 2], [3, 4]])
tensor_from_numpy = torch.from_numpy(np_array)         # Shares memory
tensor_copy = torch.tensor(np_array)                   # Copies data

# === With specific dtype ===
float32_tensor = torch.tensor([1.0, 2.0], dtype=torch.float32)
float64_tensor = torch.tensor([1.0, 2.0], dtype=torch.float64)
int64_tensor = torch.tensor([1, 2], dtype=torch.int64)
bool_tensor = torch.tensor([True, False], dtype=torch.bool)

2.3 Tensor Attributes and Properties

x = torch.randn(3, 4, 5, dtype=torch.float32, device='cuda')

# Essential attributes
print(f"Shape:    {x.shape}")           # torch.Size([3, 4, 5])
print(f"Strides:  {x.stride()}")        # (20, 5, 1)
print(f"Offset:   {x.storage_offset()}")# 0
print(f"Dtype:    {x.dtype}")           # torch.float32
print(f"Device:   {x.device}")          # cuda:0
print(f"Numel:    {x.numel()}")         # 60
print(f"Dim:      {x.ndim}")            # 3
print(f"Requires grad: {x.requires_grad}")  # False

2.4 Tensor Operations

# === Element-wise operations ===
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

c = a + b          # tensor([5., 7., 9.])
c = a * b          # tensor([4., 10., 18.])
c = a ** 2         # tensor([1., 4., 9.])
c = torch.exp(a)   # tensor([2.7183, 7.3891, 20.0855])
c = torch.log(b)   # tensor([1.3863, 1.6094, 1.7918])

# === Matrix operations ===
A = torch.randn(3, 4)
B = torch.randn(4, 5)

C = A @ B          # Matrix multiply (3, 4) x (4, 5) -> (3, 5)
C = torch.mm(A, B) # Equivalent
C = torch.matmul(A, B)  # Batch-aware

# === Reduction operations ===
x = torch.randn(3, 4)

print(x.sum())            # Scalar sum
print(x.sum(dim=0))       # Sum along dim 0 -> shape (4,)
print(x.sum(dim=1))       # Sum along dim 1 -> shape (3,)
print(x.mean())           # Mean
print(x.std())            # Standard deviation
print(x.max())            # Max value
print(x.argmax())         # Index of max

# === Reshaping operations ===
x = torch.arange(12)
y = x.view(3, 4)          # Reshape (no copy)
z = x.reshape(3, 4)       # May copy if needed
w = x.unsqueeze(0)        # Add dimension: (1, 12)
v = w.squeeze(0)          # Remove dimension: (12,)
u = x.view(2, 2, 3)       # 3-D reshape

# === Permutation and transposition ===
x = torch.randn(2, 3, 4)
y = x.permute(2, 0, 1)    # Reorder dimensions -> (4, 2, 3)
z = x.transpose(0, 2)     # Swap two dimensions

# === Broadcasting ===
# PyTorch follows NumPy broadcasting rules:
# 1. If tensors have different ndim, prepend 1s to the smaller shape
# 2. Dimensions of size 1 can be stretched to match
# 3. Dimensions of size 1 can be ignored (treated as same)
a = torch.randn(3, 1)     # (3, 1)
b = torch.randn(1, 4)     # (1, 4)
c = a + b                  # (3, 4) -- broadcast to common shape

# Example with higher dimensions
x = torch.randn(8, 1, 6, 1)
y = torch.randn(7, 1, 5)
z = x + y                  # (8, 7, 6, 5) -- broadcast along all dims

2.5 Advanced Indexing and Slicing

x = torch.arange(24).reshape(2, 3, 4)

# Basic slicing
print(x[0])                # First batch element (3, 4)
print(x[:, 1, :])          # All batches, second row, all cols
print(x[..., -1])          # Ellipsis notation, last element along last dim

# Boolean masking
mask = x > 10
print(x[mask])              # 1-D tensor of elements > 10

# Fancy indexing
indices = torch.tensor([[0, 2], [1, 3]])
print(x[0, indices])       # Advanced indexing

# Where (conditional selection)
x = torch.randn(3, 4)
y = torch.where(x > 0, x, torch.zeros_like(x))  # ReLU-like

2.6 In-place Operations

# Operations with trailing underscore modify tensor in-place
x = torch.randn(3, 4)

x.zero_()                  # In-place zero fill
x.add_(5)                  # In-place addition
x.mul_(2)                  # In-place multiplication
x.clamp_(min=0)            # In-place ReLU

# In-place operations on GPU save memory
# But they prevent gradient computation through autograd

3. Autograd: Automatic Differentiation

3.1 Computational Graph Concept

Autograd builds a Directed Acyclic Graph (DAG) where:

Leaf nodes: Input tensors (no gradient needed)
Intermediate nodes: Operations producing outputs
Root node: The scalar loss (backpropagation starts here)

Computational Graph for Autograd

Forward PassBackward Passx (leaf)w (leaf)matmul+ biaslogitslossCrossEntropyLeaf tensors (requires_grad=True) accumulate gradientsvia .backward(). Intermediate gradients are freed after .backward()unless retain_graph=True is specified.

3.2 Enabling Gradient Tracking

# requires_grad=True enables gradient computation for a tensor
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Operations on x will track gradients
y = x ** 2
z = y.sum()
z.backward()

print(x.grad)  # tensor([2., 4., 6.]) -- dz/dx = 2x

# Detach from graph
y_detached = y.detach()  # Shares storage, no grad tracking

# torch.no_grad() context manager
with torch.no_grad():
    y = x ** 2  # No graph constructed here

3.3 The Chain Rule in Autograd

For a composition , autograd computes:

# Manual chain rule verification
x = torch.tensor(2.0, requires_grad=True)

# Forward: x -> x^2 -> x^2 + 1 -> (x^2 + 1)^3
y = x ** 2            # dy/dx = 2x = 4
z = y + 1             # dz/dy = 1
w = z ** 3            # dw/dz = 3(z^2) = 3(25) = 75
w.backward()

# dw/dx = dw/dz * dz/dy * dy/dx = 75 * 1 * 4 = 300
print(x.grad)  # tensor(300.)

3.4 Gradient Accumulation

# IMPORTANT: Gradients accumulate by default!
x = torch.tensor(2.0, requires_grad=True)

for i in range(3):
    y = x ** 2
    y.backward()
    print(f"Iteration {i}: grad = {x.grad}")
    # Iteration 0: grad = 4.0
    # Iteration 1: grad = 8.0   (4 + 4)
    # Iteration 2: grad = 12.0  (8 + 4)
    # x.grad is NOT reset between iterations!

# Always zero gradients before backward pass
x.grad.zero_()

3.5 Higher-Order Derivatives

x = torch.tensor(3.0, requires_grad=True)

# First derivative
y = x ** 3              # y = x^3
first_grad = torch.autograd.grad(y, x, create_graph=True)[0]
# dy/dx = 3x^2 = 27

# Second derivative
second_grad = torch.autograd.grad(first_grad, x, create_graph=True)[0]
# d^2y/dx^2 = 6x = 18

# Third derivative
third_grad = torch.autograd.grad(second_grad, x)[0]
# d^3y/dx^3 = 6

print(f"1st: {first_grad}, 2nd: {second_grad}, 3rd: {third_grad}")

3.6 Jacobian and Hessian computation

# Jacobian: matrix of all first-order partial derivatives
def f(x):
    return torch.stack([x[0]**2 + x[1], x[1]**3 * x[0]])

x = torch.tensor([1.0, 2.0], requires_grad=True)
y = f(x)

# Manual Jacobian computation
jacobian = torch.zeros(2, 2)
for i in range(2):
    for j in range(2):
        grad = torch.autograd.grad(y[i], x, retain_graph=True)[0]
        jacobian[i, j] = grad[j]

# J = [[2x0,  1  ],    At (1,2): J = [[2, 1],
#      [x1^3, 3x1^2*x0]]                      [8, 12]]

4. GPU Computing: CUDA Tensors and Device Management

4.1 GPU vs CPU Computation Flow

GPU vs CPU Computation Flow

CPU PathGPU Path (CUDA)Python / NumPy DataPython / NumPy Datatorch.tensor(..., device='cpu')torch.tensor(..., device='cuda')CPU Compute (sequential)CUDA Kernels (massive parallelism)Result on CPUResult on GPU.cpu() -- transfer back to hostBest for: small tensors, debuggingBest for: large models, batch training

4.2 Device Detection and Management

import torch

# Check CUDA availability
print(torch.cuda.is_available())          # True if GPU present
print(torch.cuda.device_count())          # Number of GPUs
print(torch.cuda.get_device_name(0))      # GPU name
print(torch.cuda.get_device_properties(0))# Full properties

# Device agnostic code
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Move tensors to device
x = torch.randn(3, 4)
x_gpu = x.to(device)           # Transfer to GPU
x_gpu = x.cuda()               # Equivalent
x_cpu = x_gpu.cpu()            # Transfer back to CPU

# Create directly on device
x = torch.randn(3, 4, device=device)

4.3 CUDA Operations

# Pin memory for faster CPU -> GPU transfer
x = torch.randn(1000, 1000).pin_memory()
y = x.cuda(non_blocking=True)  # Async transfer

# Mixed precision training
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

with autocast():
    output = model(input)
    loss = loss_fn(output, target)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

# Memory management
print(torch.cuda.memory_allocated())    # Bytes allocated
print(torch.cuda.memory_reserved())     # Bytes reserved
torch.cuda.empty_cache()                # Free unused memory
torch.cuda.synchronize()                # Wait for all ops

4.4 Multi-GPU Training

# DataParallel (simple but less efficient)
model = nn.DataParallel(model)

# DistributedDataParallel (recommended)
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group('nccl')
model = DDP(model, device_ids=[local_rank])

5. nn.Module: Building Custom Layers

5.1 Module Architecture

Neural Network Module Hierarchy

nn.Module (base class)nn.Linearnn.Conv2dnn.LSTMnn.BatchNorm2dnn.Dropoutnn.ReLUCustom Module (nn.Module subclass)Combines layers + defines forward pass

5.2 Custom Layer Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class CustomLinear(nn.Module):
    """Custom linear layer: y = xW^T + b"""
    
    def __init__(self, in_features, out_features, bias=True):
        super().__init__()
        self.weight = nn.Parameter(
            torch.randn(out_features, in_features)
        )
        if bias:
            self.bias = nn.Parameter(torch.zeros(out_features))
        else:
            self.register_parameter('bias', None)
    
    def forward(self, x):
        return F.linear(x, self.weight, self.bias)

# Usage
layer = CustomLinear(128, 64)
x = torch.randn(32, 128)  # batch_size=32, features=128
y = layer(x)               # shape: (32, 64)

print(layer.weight.shape)  # torch.Size([64, 128])
print(layer.bias.shape)    # torch.Size([64])

5.3 Complete Network Architecture

class DeepNet(nn.Module):
    """Multi-layer network with skip connections"""
    
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers=4, dropout=0.1):
        super().__init__()
        
        self.input_proj = nn.Linear(input_dim, hidden_dim)
        self.layers = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_dim, hidden_dim),
                nn.LayerNorm(hidden_dim),
                nn.GELU(),
                nn.Dropout(dropout),
            ) for _ in range(num_layers)
        ])
        self.output_proj = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        h = self.input_proj(x)
        h = F.gelu(h)
        
        for layer in self.layers:
            h = h + layer(h)  # Skip connection
        
        h = self.dropout(h)
        return self.output_proj(h)

# Initialize model
model = DeepNet(input_dim=784, hidden_dim=256, output_dim=10)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

5.4 Parameter Management

# Named parameters
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}, requires_grad={param.requires_grad}")

# Freeze specific layers
for param in model.layers[:2].parameters():
    param.requires_grad = False

# Custom weight initialization
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
        if m.bias is not None:
            nn.init.zeros_(m.bias)

model.apply(init_weights)

6. DataLoader and Dataset

6.1 Data Pipeline Overview

PyTorch Data Pipeline

6.2 Custom Dataset Implementation

from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, features, labels, transform=None):
        self.features = torch.tensor(features, dtype=torch.float32)
        self.labels = torch.tensor(labels, dtype=torch.long)
        self.transform = transform
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        x = self.features[idx]
        y = self.labels[idx]
        
        if self.transform:
            x = self.transform(x)
        
        return x, y

# Example usage
import numpy as np
X = np.random.randn(1000, 20)
y = np.random.randint(0, 5, 1000)

dataset = CustomDataset(X, y)
print(f"Dataset size: {len(dataset)}")
x_sample, y_sample = dataset[0]
print(f"Sample shape: {x_sample.shape}, label: {y_sample}")

6.3 DataLoader Configuration

# Create DataLoader
dataloader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,          # Parallel data loading
    pin_memory=True,        # Faster GPU transfer
    drop_last=True,         # Drop incomplete batches
    collate_fn=None,        # Custom collation
    sampler=None,           # Custom sampling strategy
)

# Iterate over batches
for batch_idx, (inputs, targets) in enumerate(dataloader):
    inputs = inputs.to(device)
    targets = targets.to(device)
    
    # Forward pass
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if batch_idx % 100 == 0:
        print(f"Batch {batch_idx}, Loss: {loss.item():.4f}")

6.4 Built-in Datasets and Transforms

from torchvision import datasets, transforms

# Standard transforms for image classification
transform_train = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2023, 0.1994, 0.2010]
    ),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2023, 0.1994, 0.2010]
    ),
])

# Load CIFAR-10
train_dataset = datasets.CIFAR10(
    root='./data',
    train=True,
    download=True,
    transform=transform_train
)

test_dataset = datasets.CIFAR10(
    root='./data',
    train=False,
    download=True,
    transform=transform_test
)

train_loader = DataLoader(
    train_dataset,
    batch_size=128,
    shuffle=True,
    num_workers=2,
    pin_memory=True
)

7. Training Loop Pattern

7.1 Training Loop Overview

Neural Network Training Loop

Initialize Modelfor epoch in epochs:model.train() -- training modefor batch in train_loader:1. Forward Pass2. Compute Loss3. Backward + Optimizeoptimizer.zero_grad()loss.backward()optimizer.step()

7.2 Complete Training Implementation

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.cuda.amp import autocast, GradScaler

def train_epoch(model, dataloader, optimizer, criterion, device, scaler=None):
    model.train()
    total_loss = 0.0
    correct = 0
    total = 0
    
    for batch_idx, (inputs, targets) in enumerate(dataloader):
        inputs, targets = inputs.to(device), targets.to(device)
        
        optimizer.zero_grad()
        
        # Mixed precision training
        with autocast():
            outputs = model(inputs)
            loss = criterion(outputs, targets)
        
        if scaler is not None:
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            loss.backward()
            optimizer.step()
        
        total_loss += loss.item() * inputs.size(0)
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()
    
    return total_loss / total, 100.0 * correct / total

def validate(model, dataloader, criterion, device):
    model.eval()
    total_loss = 0.0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for inputs, targets in dataloader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            
            total_loss += loss.item() * inputs.size(0)
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
    
    return total_loss / total, 100.0 * correct / total

# Full training loop
def train_model(model, train_loader, val_loader, num_epochs=100, lr=1e-3):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_epochs)
    scaler = GradScaler() if device.type == 'cuda' else None
    
    best_val_acc = 0.0
    
    for epoch in range(num_epochs):
        train_loss, train_acc = train_epoch(
            model, train_loader, optimizer, criterion, device, scaler
        )
        val_loss, val_acc = validate(model, val_loader, criterion, device)
        scheduler.step()
        
        print(f"Epoch {epoch+1}/{num_epochs}")
        print(f"  Train Loss: {train_loss:.4f}, Acc: {train_acc:.2f}%")
        print(f"  Val Loss:   {val_loss:.4f}, Acc: {val_acc:.2f}%")
        
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), 'best_model.pth')
    
    return model

7.3 Optimizer Selection Guide

Optimizer	Best For	Key Parameters
SGD	ConvNets, generalization	lr, momentum, weight_decay
Adam	Transformers, NLP, fast convergence	lr, betas, eps
AdamW	Transformers with weight decay	lr, weight_decay
RMSprop	RNNs	lr, alpha, momentum

# SGD with momentum (classic)
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)

# Adam (default betas)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# AdamW (proper weight decay)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

# Learning rate schedulers
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=5)

8. Saving and Loading Models

8.1 Model Persistence Strategies

import torch

# === Save/Load State Dict (recommended) ===
# Save
torch.save(model.state_dict(), 'model_weights.pth')

# Load
model = MyModel()  # Must create model instance first
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()

# === Save entire model (less flexible) ===
torch.save(model, 'model_complete.pth')

# Load
model = torch.load('model_complete.pth')

# === Save checkpoint with training state ===
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'scheduler_state_dict': scheduler.state_dict(),
    'loss': loss,
    'best_acc': best_acc,
}
torch.save(checkpoint, 'checkpoint.pth')

# Load checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
start_epoch = checkpoint['epoch'] + 1
best_acc = checkpoint['best_acc']

8.2 Serialization Formats

Format	Description	Use Case
`.pth`	PyTorch native	Most common, simplest
`.pt`	PyTorch native (alias)	Same as .pth
`.bin`	PyTorch binary	HuggingFace models
`.safetensors`	Safe, fast format	Recommended for sharing
`.onnx`	Open Neural Network Exchange	Cross-framework deployment

# SafeTensors (recommended for security)
from safetensors.torch import save_file, load_file

save_file(model.state_dict(), 'model.safetensors')
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)

# ONNX export
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
    model, dummy_input, 'model.onnx',
    export_params=True,
    opset_version=17,
    do_constant_folding=True,
    input_names=['input'],
    output_names=['output'],
)

8.3 Multi-GPU and Distributed Checkpoints

# Save DDP model
torch.save(model.module.state_dict(), 'ddp_model.pth')

# Load on single GPU
model = MyModel()
model.load_state_dict(torch.load('ddp_model.pth'))

# Distributed checkpoint (PyTorch 2.0+)
from torch.distributed.checkpoint import save, load

save(model.state_dict(), checkpoint_id="./checkpoints/ckpt")
state_dict = torch.load("checkpoints/ckpt/mp_rank_0_model_states.mpk")
model.load_state_dict(state_dict)

Summary

ℹ️

Key Takeaways:

Tensors are PyTorch's fundamental data structure, supporting GPU acceleration and automatic differentiation
Autograd builds dynamic computation graphs and implements reverse-mode automatic differentiation
nn.Module provides the base class for all neural network layers and models
DataLoader/Dataset handle efficient data loading with parallelism and transformations
Training loop follows the pattern: zero_grad -> forward -> loss -> backward -> step
Model saving should use state_dict for flexibility and reproducibility

Mathematical Foundations

The core operations in PyTorch implement fundamental mathematical transformations:

Matrix Multiplication: , where ,
Backpropagation: Using the chain rule, gradients flow backward through the computational graph:

Gradient Descent Update: , where is the learning rate

Best Practices

Always call optimizer.zero_grad() before each backward pass
Use torch.no_grad() for inference to save memory
Prefer state_dict over saving entire models
Use pin_memory=True and num_workers>0 in DataLoader for GPU training
Use mixed precision (torch.cuda.amp) for faster training on modern GPUs
Move model and data to the same device before training

Common Pitfalls

Forgetting to call model.eval() during validation/inference
Not zeroing gradients between iterations (accumulation)
Using in-place operations on tensors that require gradients
Device mismatch between model and data tensors
Using torch.load() without weights_only=True for untrusted files

PyTorch Fundamentals: Tensors, Autograd and GPU Computing

PyTorch Fundamentals: Tensors, Autograd and GPU Computing

1. PyTorch vs TensorFlow: A Comparative Analysis

1.1 Historical Context

1.2 Fundamental Differences

1.3 When to Choose Which?

2. Tensors: The Fundamental Data Structure

2.1 What is a Tensor?

2.2 Tensor Creation

2.3 Tensor Attributes and Properties

2.4 Tensor Operations

2.5 Advanced Indexing and Slicing

2.6 In-place Operations

3. Autograd: Automatic Differentiation

3.1 Computational Graph Concept

3.2 Enabling Gradient Tracking

3.3 The Chain Rule in Autograd

3.4 Gradient Accumulation

3.5 Higher-Order Derivatives

3.6 Jacobian and Hessian computation

4. GPU Computing: CUDA Tensors and Device Management

4.1 GPU vs CPU Computation Flow

4.2 Device Detection and Management

4.3 CUDA Operations

4.4 Multi-GPU Training

5. nn.Module: Building Custom Layers

5.1 Module Architecture

5.2 Custom Layer Implementation

5.3 Complete Network Architecture

5.4 Parameter Management

6. DataLoader and Dataset

6.1 Data Pipeline Overview

6.2 Custom Dataset Implementation

6.3 DataLoader Configuration

6.4 Built-in Datasets and Transforms

7. Training Loop Pattern

7.1 Training Loop Overview

7.2 Complete Training Implementation

7.3 Optimizer Selection Guide

8. Saving and Loading Models

8.1 Model Persistence Strategies

8.2 Serialization Formats

8.3 Multi-GPU and Distributed Checkpoints

Summary

Mathematical Foundations

Best Practices

Common Pitfalls

Need Expert Data Science Help?