PyTorch Fundamentals: Tensors, Autograd and GPU Computing
Prerequisites: Python proficiency, basic linear algebra (vectors, matrices, tensor operations), calculus (derivatives, chain rule), and introductory programming concepts.
1. PyTorch vs TensorFlow: A Comparative Analysis
Understanding the landscape of deep learning frameworks is essential for choosing the right tool for your research or production needs.
1.1 Historical Context
| Feature | PyTorch | TensorFlow |
|---|---|---|
| Developer | Meta AI (Facebook) | Google Brain |
| Initial Release | 2016 | 2015 |
| Design Philosophy | Pythonic, imperative | Static graphs, production-oriented |
| Primary Users | Researchers, Academia | Industry, Production |
| Computation Graph | Dynamic (Define-by-Run) | Static (TF 1.x) / Eager (TF 2.x) |
| Deployment | TorchScript, ONNX | TF Serving, TF Lite, TF.js |
| Ecosystem | torchtext, torchaudio, torchvision | TFX, Keras, TF Hub |
1.2 Fundamental Differences
Dynamic Computation Graphs (PyTorch):
import torch
# Each forward pass creates a new graph
x = torch.randn(3, requires_grad=True)
for i in range(3):
y = x ** 2 + 2 * x # Graph rebuilt each iteration
z = y.sum()
z.backward() # Gradients computed immediately
print(f"Step {i}: grad = {x.grad}")
x.grad.zero_() # Reset gradients
Static Computation Graphs (TensorFlow 1.x paradigm):
import tensorflow as tf
# Graph defined once, executed many times
x = tf.placeholder(tf.float32, shape=[3])
y = tf.pow(x, 2) + 2 * x
z = tf.reduce_sum(y)
with tf.Session() as sess:
for i in range(3):
result = sess.run(z, feed_dict={x: [1.0, 2.0, 3.0]})
print(f"Step {i}: result = {result}")
1.3 When to Choose Which?
- PyTorch: Research prototyping, dynamic architectures (RNNs with variable lengths), debugging flexibility
- TensorFlow: Production deployment, mobile/edge inference, established pipelines
2. Tensors: The Fundamental Data Structure
2.1 What is a Tensor?
A tensor is a generalization of scalars (0-D), vectors (1-D), and matrices (2-D) to arbitrary dimensions. Mathematically, a tensor is a multidimensional array with dimensions (axes) and shape .
2.2 Tensor Creation
import torch
# === From Python data structures ===
scalar_tensor = torch.tensor(5.0) # 0-D
vector_tensor = torch.tensor([1.0, 2.0, 3.0]) # 1-D
matrix_tensor = torch.tensor([[1, 2], [3, 4]]) # 2-D
cube_tensor = torch.tensor([[[1, 2], [3, 4]], # 3-D
[[5, 6], [7, 8]]])
# === Factory functions ===
zeros = torch.zeros(3, 4) # All zeros
ones = torch.ones(2, 3, 4) # All ones
rand_uniform = torch.rand(5, 5) # Uniform [0, 1)
rand_normal = torch.randn(5, 5) # Standard normal
eye = torch.eye(4) # Identity matrix
arange = torch.arange(0, 10, 2) # Range
linspace = torch.linspace(0, 1, 100) # Linear spacing
empty = torch.empty(2, 3) # Uninitialized
# === From NumPy ===
import numpy as np
np_array = np.array([[1, 2], [3, 4]])
tensor_from_numpy = torch.from_numpy(np_array) # Shares memory
tensor_copy = torch.tensor(np_array) # Copies data
# === With specific dtype ===
float32_tensor = torch.tensor([1.0, 2.0], dtype=torch.float32)
float64_tensor = torch.tensor([1.0, 2.0], dtype=torch.float64)
int64_tensor = torch.tensor([1, 2], dtype=torch.int64)
bool_tensor = torch.tensor([True, False], dtype=torch.bool)
2.3 Tensor Attributes and Properties
x = torch.randn(3, 4, 5, dtype=torch.float32, device='cuda')
# Essential attributes
print(f"Shape: {x.shape}") # torch.Size([3, 4, 5])
print(f"Strides: {x.stride()}") # (20, 5, 1)
print(f"Offset: {x.storage_offset()}")# 0
print(f"Dtype: {x.dtype}") # torch.float32
print(f"Device: {x.device}") # cuda:0
print(f"Numel: {x.numel()}") # 60
print(f"Dim: {x.ndim}") # 3
print(f"Requires grad: {x.requires_grad}") # False
2.4 Tensor Operations
# === Element-wise operations ===
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
c = a + b # tensor([5., 7., 9.])
c = a * b # tensor([4., 10., 18.])
c = a ** 2 # tensor([1., 4., 9.])
c = torch.exp(a) # tensor([2.7183, 7.3891, 20.0855])
c = torch.log(b) # tensor([1.3863, 1.6094, 1.7918])
# === Matrix operations ===
A = torch.randn(3, 4)
B = torch.randn(4, 5)
C = A @ B # Matrix multiply (3, 4) x (4, 5) -> (3, 5)
C = torch.mm(A, B) # Equivalent
C = torch.matmul(A, B) # Batch-aware
# === Reduction operations ===
x = torch.randn(3, 4)
print(x.sum()) # Scalar sum
print(x.sum(dim=0)) # Sum along dim 0 -> shape (4,)
print(x.sum(dim=1)) # Sum along dim 1 -> shape (3,)
print(x.mean()) # Mean
print(x.std()) # Standard deviation
print(x.max()) # Max value
print(x.argmax()) # Index of max
# === Reshaping operations ===
x = torch.arange(12)
y = x.view(3, 4) # Reshape (no copy)
z = x.reshape(3, 4) # May copy if needed
w = x.unsqueeze(0) # Add dimension: (1, 12)
v = w.squeeze(0) # Remove dimension: (12,)
u = x.view(2, 2, 3) # 3-D reshape
# === Permutation and transposition ===
x = torch.randn(2, 3, 4)
y = x.permute(2, 0, 1) # Reorder dimensions -> (4, 2, 3)
z = x.transpose(0, 2) # Swap two dimensions
# === Broadcasting ===
# PyTorch follows NumPy broadcasting rules:
# 1. If tensors have different ndim, prepend 1s to the smaller shape
# 2. Dimensions of size 1 can be stretched to match
# 3. Dimensions of size 1 can be ignored (treated as same)
a = torch.randn(3, 1) # (3, 1)
b = torch.randn(1, 4) # (1, 4)
c = a + b # (3, 4) -- broadcast to common shape
# Example with higher dimensions
x = torch.randn(8, 1, 6, 1)
y = torch.randn(7, 1, 5)
z = x + y # (8, 7, 6, 5) -- broadcast along all dims
2.5 Advanced Indexing and Slicing
x = torch.arange(24).reshape(2, 3, 4)
# Basic slicing
print(x[0]) # First batch element (3, 4)
print(x[:, 1, :]) # All batches, second row, all cols
print(x[..., -1]) # Ellipsis notation, last element along last dim
# Boolean masking
mask = x > 10
print(x[mask]) # 1-D tensor of elements > 10
# Fancy indexing
indices = torch.tensor([[0, 2], [1, 3]])
print(x[0, indices]) # Advanced indexing
# Where (conditional selection)
x = torch.randn(3, 4)
y = torch.where(x > 0, x, torch.zeros_like(x)) # ReLU-like
2.6 In-place Operations
# Operations with trailing underscore modify tensor in-place
x = torch.randn(3, 4)
x.zero_() # In-place zero fill
x.add_(5) # In-place addition
x.mul_(2) # In-place multiplication
x.clamp_(min=0) # In-place ReLU
# In-place operations on GPU save memory
# But they prevent gradient computation through autograd
3. Autograd: Automatic Differentiation
3.1 Computational Graph Concept
Autograd builds a Directed Acyclic Graph (DAG) where:
- Leaf nodes: Input tensors (no gradient needed)
- Intermediate nodes: Operations producing outputs
- Root node: The scalar loss (backpropagation starts here)
3.2 Enabling Gradient Tracking
# requires_grad=True enables gradient computation for a tensor
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
# Operations on x will track gradients
y = x ** 2
z = y.sum()
z.backward()
print(x.grad) # tensor([2., 4., 6.]) -- dz/dx = 2x
# Detach from graph
y_detached = y.detach() # Shares storage, no grad tracking
# torch.no_grad() context manager
with torch.no_grad():
y = x ** 2 # No graph constructed here
3.3 The Chain Rule in Autograd
For a composition , autograd computes:
# Manual chain rule verification
x = torch.tensor(2.0, requires_grad=True)
# Forward: x -> x^2 -> x^2 + 1 -> (x^2 + 1)^3
y = x ** 2 # dy/dx = 2x = 4
z = y + 1 # dz/dy = 1
w = z ** 3 # dw/dz = 3(z^2) = 3(25) = 75
w.backward()
# dw/dx = dw/dz * dz/dy * dy/dx = 75 * 1 * 4 = 300
print(x.grad) # tensor(300.)
3.4 Gradient Accumulation
# IMPORTANT: Gradients accumulate by default!
x = torch.tensor(2.0, requires_grad=True)
for i in range(3):
y = x ** 2
y.backward()
print(f"Iteration {i}: grad = {x.grad}")
# Iteration 0: grad = 4.0
# Iteration 1: grad = 8.0 (4 + 4)
# Iteration 2: grad = 12.0 (8 + 4)
# x.grad is NOT reset between iterations!
# Always zero gradients before backward pass
x.grad.zero_()
3.5 Higher-Order Derivatives
x = torch.tensor(3.0, requires_grad=True)
# First derivative
y = x ** 3 # y = x^3
first_grad = torch.autograd.grad(y, x, create_graph=True)[0]
# dy/dx = 3x^2 = 27
# Second derivative
second_grad = torch.autograd.grad(first_grad, x, create_graph=True)[0]
# d^2y/dx^2 = 6x = 18
# Third derivative
third_grad = torch.autograd.grad(second_grad, x)[0]
# d^3y/dx^3 = 6
print(f"1st: {first_grad}, 2nd: {second_grad}, 3rd: {third_grad}")
3.6 Jacobian and Hessian computation
# Jacobian: matrix of all first-order partial derivatives
def f(x):
return torch.stack([x[0]**2 + x[1], x[1]**3 * x[0]])
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = f(x)
# Manual Jacobian computation
jacobian = torch.zeros(2, 2)
for i in range(2):
for j in range(2):
grad = torch.autograd.grad(y[i], x, retain_graph=True)[0]
jacobian[i, j] = grad[j]
# J = [[2x0, 1 ], At (1,2): J = [[2, 1],
# [x1^3, 3x1^2*x0]] [8, 12]]
4. GPU Computing: CUDA Tensors and Device Management
4.1 GPU vs CPU Computation Flow
4.2 Device Detection and Management
import torch
# Check CUDA availability
print(torch.cuda.is_available()) # True if GPU present
print(torch.cuda.device_count()) # Number of GPUs
print(torch.cuda.get_device_name(0)) # GPU name
print(torch.cuda.get_device_properties(0))# Full properties
# Device agnostic code
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
# Move tensors to device
x = torch.randn(3, 4)
x_gpu = x.to(device) # Transfer to GPU
x_gpu = x.cuda() # Equivalent
x_cpu = x_gpu.cpu() # Transfer back to CPU
# Create directly on device
x = torch.randn(3, 4, device=device)
4.3 CUDA Operations
# Pin memory for faster CPU -> GPU transfer
x = torch.randn(1000, 1000).pin_memory()
y = x.cuda(non_blocking=True) # Async transfer
# Mixed precision training
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
output = model(input)
loss = loss_fn(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
# Memory management
print(torch.cuda.memory_allocated()) # Bytes allocated
print(torch.cuda.memory_reserved()) # Bytes reserved
torch.cuda.empty_cache() # Free unused memory
torch.cuda.synchronize() # Wait for all ops
4.4 Multi-GPU Training
# DataParallel (simple but less efficient)
model = nn.DataParallel(model)
# DistributedDataParallel (recommended)
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
dist.init_process_group('nccl')
model = DDP(model, device_ids=[local_rank])
5. nn.Module: Building Custom Layers
5.1 Module Architecture
5.2 Custom Layer Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
class CustomLinear(nn.Module):
"""Custom linear layer: y = xW^T + b"""
def __init__(self, in_features, out_features, bias=True):
super().__init__()
self.weight = nn.Parameter(
torch.randn(out_features, in_features)
)
if bias:
self.bias = nn.Parameter(torch.zeros(out_features))
else:
self.register_parameter('bias', None)
def forward(self, x):
return F.linear(x, self.weight, self.bias)
# Usage
layer = CustomLinear(128, 64)
x = torch.randn(32, 128) # batch_size=32, features=128
y = layer(x) # shape: (32, 64)
print(layer.weight.shape) # torch.Size([64, 128])
print(layer.bias.shape) # torch.Size([64])
5.3 Complete Network Architecture
class DeepNet(nn.Module):
"""Multi-layer network with skip connections"""
def __init__(self, input_dim, hidden_dim, output_dim, num_layers=4, dropout=0.1):
super().__init__()
self.input_proj = nn.Linear(input_dim, hidden_dim)
self.layers = nn.ModuleList([
nn.Sequential(
nn.Linear(hidden_dim, hidden_dim),
nn.LayerNorm(hidden_dim),
nn.GELU(),
nn.Dropout(dropout),
) for _ in range(num_layers)
])
self.output_proj = nn.Linear(hidden_dim, output_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
h = self.input_proj(x)
h = F.gelu(h)
for layer in self.layers:
h = h + layer(h) # Skip connection
h = self.dropout(h)
return self.output_proj(h)
# Initialize model
model = DeepNet(input_dim=784, hidden_dim=256, output_dim=10)
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
5.4 Parameter Management
# Named parameters
for name, param in model.named_parameters():
print(f"{name}: {param.shape}, requires_grad={param.requires_grad}")
# Freeze specific layers
for param in model.layers[:2].parameters():
param.requires_grad = False
# Custom weight initialization
def init_weights(m):
if isinstance(m, nn.Linear):
nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
if m.bias is not None:
nn.init.zeros_(m.bias)
model.apply(init_weights)
6. DataLoader and Dataset
6.1 Data Pipeline Overview
6.2 Custom Dataset Implementation
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
def __init__(self, features, labels, transform=None):
self.features = torch.tensor(features, dtype=torch.float32)
self.labels = torch.tensor(labels, dtype=torch.long)
self.transform = transform
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
x = self.features[idx]
y = self.labels[idx]
if self.transform:
x = self.transform(x)
return x, y
# Example usage
import numpy as np
X = np.random.randn(1000, 20)
y = np.random.randint(0, 5, 1000)
dataset = CustomDataset(X, y)
print(f"Dataset size: {len(dataset)}")
x_sample, y_sample = dataset[0]
print(f"Sample shape: {x_sample.shape}, label: {y_sample}")
6.3 DataLoader Configuration
# Create DataLoader
dataloader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
num_workers=4, # Parallel data loading
pin_memory=True, # Faster GPU transfer
drop_last=True, # Drop incomplete batches
collate_fn=None, # Custom collation
sampler=None, # Custom sampling strategy
)
# Iterate over batches
for batch_idx, (inputs, targets) in enumerate(dataloader):
inputs = inputs.to(device)
targets = targets.to(device)
# Forward pass
outputs = model(inputs)
loss = loss_fn(outputs, targets)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
if batch_idx % 100 == 0:
print(f"Batch {batch_idx}, Loss: {loss.item():.4f}")
6.4 Built-in Datasets and Transforms
from torchvision import datasets, transforms
# Standard transforms for image classification
transform_train = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomCrop(32, padding=4),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.4914, 0.4822, 0.4465],
std=[0.2023, 0.1994, 0.2010]
),
])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(
mean=[0.4914, 0.4822, 0.4465],
std=[0.2023, 0.1994, 0.2010]
),
])
# Load CIFAR-10
train_dataset = datasets.CIFAR10(
root='./data',
train=True,
download=True,
transform=transform_train
)
test_dataset = datasets.CIFAR10(
root='./data',
train=False,
download=True,
transform=transform_test
)
train_loader = DataLoader(
train_dataset,
batch_size=128,
shuffle=True,
num_workers=2,
pin_memory=True
)
7. Training Loop Pattern
7.1 Training Loop Overview
7.2 Complete Training Implementation
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.cuda.amp import autocast, GradScaler
def train_epoch(model, dataloader, optimizer, criterion, device, scaler=None):
model.train()
total_loss = 0.0
correct = 0
total = 0
for batch_idx, (inputs, targets) in enumerate(dataloader):
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
# Mixed precision training
with autocast():
outputs = model(inputs)
loss = criterion(outputs, targets)
if scaler is not None:
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
loss.backward()
optimizer.step()
total_loss += loss.item() * inputs.size(0)
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
return total_loss / total, 100.0 * correct / total
def validate(model, dataloader, criterion, device):
model.eval()
total_loss = 0.0
correct = 0
total = 0
with torch.no_grad():
for inputs, targets in dataloader:
inputs, targets = inputs.to(device), targets.to(device)
outputs = model(inputs)
loss = criterion(outputs, targets)
total_loss += loss.item() * inputs.size(0)
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
return total_loss / total, 100.0 * correct / total
# Full training loop
def train_model(model, train_loader, val_loader, num_epochs=100, lr=1e-3):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_epochs)
scaler = GradScaler() if device.type == 'cuda' else None
best_val_acc = 0.0
for epoch in range(num_epochs):
train_loss, train_acc = train_epoch(
model, train_loader, optimizer, criterion, device, scaler
)
val_loss, val_acc = validate(model, val_loader, criterion, device)
scheduler.step()
print(f"Epoch {epoch+1}/{num_epochs}")
print(f" Train Loss: {train_loss:.4f}, Acc: {train_acc:.2f}%")
print(f" Val Loss: {val_loss:.4f}, Acc: {val_acc:.2f}%")
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), 'best_model.pth')
return model
7.3 Optimizer Selection Guide
| Optimizer | Best For | Key Parameters |
|---|---|---|
| SGD | ConvNets, generalization | lr, momentum, weight_decay |
| Adam | Transformers, NLP, fast convergence | lr, betas, eps |
| AdamW | Transformers with weight decay | lr, weight_decay |
| RMSprop | RNNs | lr, alpha, momentum |
# SGD with momentum (classic)
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
# Adam (default betas)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# AdamW (proper weight decay)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
# Learning rate schedulers
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=5)
8. Saving and Loading Models
8.1 Model Persistence Strategies
import torch
# === Save/Load State Dict (recommended) ===
# Save
torch.save(model.state_dict(), 'model_weights.pth')
# Load
model = MyModel() # Must create model instance first
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()
# === Save entire model (less flexible) ===
torch.save(model, 'model_complete.pth')
# Load
model = torch.load('model_complete.pth')
# === Save checkpoint with training state ===
checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'scheduler_state_dict': scheduler.state_dict(),
'loss': loss,
'best_acc': best_acc,
}
torch.save(checkpoint, 'checkpoint.pth')
# Load checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
start_epoch = checkpoint['epoch'] + 1
best_acc = checkpoint['best_acc']
8.2 Serialization Formats
| Format | Description | Use Case |
|---|---|---|
.pth | PyTorch native | Most common, simplest |
.pt | PyTorch native (alias) | Same as .pth |
.bin | PyTorch binary | HuggingFace models |
.safetensors | Safe, fast format | Recommended for sharing |
.onnx | Open Neural Network Exchange | Cross-framework deployment |
# SafeTensors (recommended for security)
from safetensors.torch import save_file, load_file
save_file(model.state_dict(), 'model.safetensors')
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
# ONNX export
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model, dummy_input, 'model.onnx',
export_params=True,
opset_version=17,
do_constant_folding=True,
input_names=['input'],
output_names=['output'],
)
8.3 Multi-GPU and Distributed Checkpoints
# Save DDP model
torch.save(model.module.state_dict(), 'ddp_model.pth')
# Load on single GPU
model = MyModel()
model.load_state_dict(torch.load('ddp_model.pth'))
# Distributed checkpoint (PyTorch 2.0+)
from torch.distributed.checkpoint import save, load
save(model.state_dict(), checkpoint_id="./checkpoints/ckpt")
state_dict = torch.load("checkpoints/ckpt/mp_rank_0_model_states.mpk")
model.load_state_dict(state_dict)
Summary
Key Takeaways:
- Tensors are PyTorch's fundamental data structure, supporting GPU acceleration and automatic differentiation
- Autograd builds dynamic computation graphs and implements reverse-mode automatic differentiation
- nn.Module provides the base class for all neural network layers and models
- DataLoader/Dataset handle efficient data loading with parallelism and transformations
- Training loop follows the pattern: zero_grad -> forward -> loss -> backward -> step
- Model saving should use state_dict for flexibility and reproducibility
Mathematical Foundations
The core operations in PyTorch implement fundamental mathematical transformations:
-
Matrix Multiplication: , where ,
-
Backpropagation: Using the chain rule, gradients flow backward through the computational graph:
- Gradient Descent Update: , where is the learning rate
Best Practices
- Always call
optimizer.zero_grad()before each backward pass - Use
torch.no_grad()for inference to save memory - Prefer
state_dictover saving entire models - Use
pin_memory=Trueandnum_workers>0in DataLoader for GPU training - Use mixed precision (
torch.cuda.amp) for faster training on modern GPUs - Move model and data to the same device before training
Common Pitfalls
- Forgetting to call
model.eval()during validation/inference - Not zeroing gradients between iterations (accumulation)
- Using in-place operations on tensors that require gradients
- Device mismatch between model and data tensors
- Using
torch.load()withoutweights_only=Truefor untrusted files