PyTorch Fundamentals

Introduction

PyTorch is an open-source deep learning framework known for its dynamic computation graphs, Pythonic design, and strong GPU acceleration. It's the preferred framework for research and increasingly used in production.

Architecture Diagram

PyTorch Ecosystem:
═══════════════════════════════════════════════════════════════════

 ┌─────────────────────────────────────────────────────────────┐
 │                    PyTorch Core                             │
 │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐      │
 │  │ Tensors │  │Autograd │  │  nn     │  │Optim    │      │
 │  │         │  │         │  │ Module  │  │         │      │
 │  └─────────┘  └─────────┘  └─────────┘  └─────────┘      │
 └─────────────────────────────────────────────────────────────┘
        │              │              │              │
        ▼              ▼              ▼              ▼
 ┌─────────────┐ ┌──────────┐ ┌─────────────┐ ┌──────────┐
 │   CUDA      │ │  Data    │ │  torchvision│ │ torchaudio│
 │  (GPU)      │ │ Loading  │ │  (Vision)   │ │ (Audio)  │
 └─────────────┘ └──────────┘ └─────────────┘ └──────────┘
═══════════════════════════════════════════════════════════════════

Tensors

Creating Tensors

DfTensor

A tensor is a multi-dimensional array, the fundamental data structure in PyTorch. Tensors are similar to NumPy arrays but can run on GPUs for accelerated computing. Formally, a tensor of order $k$ is a multi-linear map from $k$ vector spaces to the real numbers.

\mathbf{C} = \mathbf{A} \cdot \mathbf{B} \quad \text{where} \quad \mathbf{A} \in \mathbb{R}^{m \times n}, \; \mathbf{B} \in \mathbb{R}^{n \times p}, \; \mathbf{C} \in \mathbb{R}^{m \times p}

ℹ️ Tensor vs. NumPy Array

The key difference between a PyTorch tensor and a NumPy array is that tensors support automatic differentiation (autograd) and can reside on GPU memory. The $requires\_grad$ flag determines whether PyTorch tracks operations on the tensor for gradient computation.

import torch
import numpy as np

# ═══════════════════════════════════════════════════
# Tensor Creation Methods
# ═══════════════════════════════════════════════════

# From Python lists
t1 = torch.tensor([1, 2, 3, 4])
print(f"From list: {t1}")

# From NumPy array
np_array = np.array([[1, 2], [3, 4]])
t2 = torch.from_numpy(np_array)
print(f"From NumPy: {t2}")

# Common creation functions
zeros = torch.zeros(3, 4)          # 3x4 zeros
ones = torch.ones(3, 4)           # 3x4 ones
rand = torch.rand(3, 4)           # Uniform [0, 1)
randn = torch.randn(3, 4)         # Standard normal
eye = torch.eye(3)                # Identity matrix
arange = torch.arange(0, 10, 2)   # [0, 2, 4, 6, 8]
linspace = torch.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1]

print(f"\nZeros:\n{zeros}")
print(f"\nRandom:\n{randn}")
print(f"\nArange: {arange}")

# ═══════════════════════════════════════════════════
# Tensor Properties
# ═══════════════════════════════════════════════════
t = torch.randn(3, 4, 5)

print(f"\nShape: {t.shape}")           # torch.Size([3, 4, 5])
print(f"Dimensions: {t.ndim}")         # 3
print(f"Numel: {t.numel()}")           # 60
print(f"Dtype: {t.dtype}")             # torch.float32
print(f"Device: {t.device}")           # cpu
print(f"Requires grad: {t.requires_grad}")  # False

# ═══════════════════════════════════════════════════
# Tensor Operations
# ═══════════════════════════════════════════════════

# Element-wise operations
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

print(f"\nAddition: {a + b}")
print(f"Multiplication: {a * b}")
print(f"Power: {a ** 2}")
print(f"Square root: {torch.sqrt(a)}")

# Matrix operations
A = torch.randn(3, 4)
B = torch.randn(4, 5)

# Matrix multiplication
C = torch.mm(A, B)          # 3x5
C_alt = A @ B               # Same as mm
print(f"\nMatrix multiply shape: {C.shape}")

# Dot product
v1 = torch.randn(5)
v2 = torch.randn(5)
dot = torch.dot(v1, v2)
print(f"Dot product: {dot}")

# Broadcasting
x = torch.randn(3, 4)
y = torch.randn(4)
result = x + y  # y broadcasts to (3, 4)
print(f"\nBroadcasting result shape: {result.shape}")

Matrix Multiplication (Core Operation)

C_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj}

Here,

$A$ =Input matrix of shape (m x n)
$B$ =Input matrix of shape (n x p)
$C$ =Output matrix of shape (m x p)
$n$ =Inner dimension (must match)

💡 Tensor Operations in PyTorch

PyTorch provides multiple ways to perform matrix multiplication: $torch.mm(A, B)$ , $torch.matmul(A, B)$ , and the $@$ operator. For batched operations, $torch.bmm(A, B)$ performs batched matrix multiplication. The $@$ operator is the most Pythonic and recommended approach.

Tensor Manipulation

# ═══════════════════════════════════════════════════
# Reshaping Operations
# ═══════════════════════════════════════════════════
t = torch.arange(12)
print(f"Original: {t.shape} -> {t}")

# Reshape
t_reshaped = t.reshape(3, 4)
print(f"Reshape (3,4): {t_reshaped.shape}")

# View (shares memory)
t_view = t.view(4, 3)
print(f"View (4,3): {t_view.shape}")

# Transpose
t_transposed = t_reshaped.T
print(f"Transpose: {t_transposed.shape}")

# Permute
t_permuted = torch.randn(2, 3, 4).permute(2, 0, 1)
print(f"Permute (2,3,4) -> (4,2,3): {t_permuted.shape}")

# Flatten
t_flat = t_reshaped.flatten()
print(f"Flatten: {t_flat.shape}")

# Squeeze/Unsqueeze
t_squeeze = torch.randn(1, 3, 1, 4).squeeze()
print(f"Squeeze: {t_squeeze.shape}")

t_unsqueeze = t_squeeze.unsqueeze(0)
print(f"Unsqueeze: {t_unsqueeze.shape}")

# ═══════════════════════════════════════════════════
# Indexing and Slicing
# ═══════════════════════════════════════════════════
t = torch.arange(24).reshape(2, 3, 4)

print(f"\nOriginal shape: {t.shape}")
print(f"t[0]: {t[0].shape}")           # First batch
print(f"t[:, 1]: {t[:, 1].shape}")     # Second row
print(f"t[0, :, 2]: {t[0, :, 2].shape}")  # Third column

# Boolean indexing
mask = t > 15
filtered = t[mask]
print(f"\nFiltered (>15): {filtered}")

# Fancy indexing
indices = torch.tensor([0, 2])
selected = t[:, indices, :]
print(f"Fancy indexing: {selected.shape}")

# ═══════════════════════════════════════════════════
# Device Management (CPU/GPU)
# ═══════════════════════════════════════════════════
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"\nUsing device: {device}")

# Move tensor to GPU
t_gpu = t.to(device)
print(f"Tensor device: {t_gpu.device}")

# Create directly on GPU
t_on_gpu = torch.randn(3, 4, device=device)
print(f"Created on GPU: {t_on_gpu.device}")

# Check GPU memory
if torch.cuda.is_available():
    print(f"GPU Memory: {torch.cuda.get_device_name(0)}")
    print(f"Memory Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")

Autograd (Automatic Differentiation)

Computation Graphs

DfAutograd (Automatic Differentiation)

Autograd is PyTorch's automatic differentiation engine. It records operations on tensors to build a dynamic computation graph (DAG), then computes gradients via backpropagation when $.backward()$ is called. This is a form of reverse-mode automatic differentiation.

ThChain Rule in Autograd

For a composition of functions $y = f(g(h(x)))$ , the derivative is computed as $\frac{dy}{dx} = f' \cdot g' \cdot h'$ . Autograd implements this automatically by traversing the computation graph in reverse topological order, applying the chain rule at each node.

ℹ️ Dynamic vs. Static Graphs

PyTorch uses dynamic computation graphs (define-by-run): the graph is built on-the-fly during each forward pass. This differs from TensorFlow 1.x's static graphs, which required pre-defining the entire computation. Dynamic graphs enable natural Python control flow (if/else, loops) within the model, making debugging and experimentation much easier.

Architecture Diagram

Autograd Computation Graph:
═══════════════════════════════════════════════════

 Forward Pass:
 ─────────────

 x (requires_grad=True)
 │
 ▼
 ┌─────┐
 │  *  │ ← y = x * 2
 │  2  │
 └──┬──┘
    │
    ▼
 ┌─────┐
 │  +  │ ← z = y + 3
 │  3  │
 └──┬──┘
    │
    ▼
   L = z²  (loss)

 Backward Pass (Auto):
 ─────────────────────
    ▲
    │
 ┌─────┐
 │ dL  │ ← dL/dz = 2z
 │ /dz │
 └──┬──┘
    │
    ▼
 ┌─────┐
 │ dL  │ ← dL/dy = 2z * 1
 │ /dy │
 └──┬──┘
    │
    ▼
 ┌─────┐
 │ dL  │ ← dL/dx = 2z * 2
 │ /dx │
 └─────┘
═══════════════════════════════════════════════════

# ═══════════════════════════════════════════════════
# Autograd Basics
# ═══════════════════════════════════════════════════

# Create tensor with gradient tracking
x = torch.tensor(2.0, requires_grad=True)
print(f"x: {x}")
print(f"x.grad_fn: {x.grad_fn}")  # None for leaf tensors

# Forward pass
y = x ** 2 + 3 * x + 1
print(f"\ny = x² + 3x + 1")
print(f"y: {y}")
print(f"y.grad_fn: {y.grad_fn}")  # AddBackward

# Backward pass (compute gradients)
y.backward()

# dy/dx = 2x + 3 = 2(2) + 3 = 7
print(f"\ndy/dx at x=2: {x.grad}")  # tensor(7.)

# ═══════════════════════════════════════════════════
# Gradient Computation Example
# ═══════════════════════════════════════════════════
x = torch.randn(3, requires_grad=True)
w = torch.randn(3, requires_grad=True)

print(f"\nx: {x}")
print(f"w: {w}")

# Forward pass
y = torch.dot(x, w)
z = y ** 2

print(f"\ny = x · w = {y}")
print(f"z = y² = {z}")

# Backward pass
z.backward()

print(f"\ndz/dx = 2y · w = {x.grad}")
print(f"dz/dw = 2y · x = {w.grad}")

# Verify manually
manual_grad_x = 2 * y.detach() * w
manual_grad_w = 2 * y.detach() * x
print(f"\nManual dz/dx: {manual_grad_x}")
print(f"Manual dz/dw: {manual_grad_w}")

# ═══════════════════════════════════════════════════
# Gradient Control
# ═══════════════════════════════════════════════════

# Zero gradients (important before each step!)
x.grad.zero_()

# Stop gradient tracking
with torch.no_grad():
    y = x * 2
    print(f"\nNo grad - y: {y}")
    print(f"y.grad_fn: {y.grad_fn}")  # None

# Detach from graph
y_detached = y.detach()
print(f"Detached: {y_detached}")

# Gradient accumulation (default behavior)
x.grad.zero_()
for _ in range(3):
    y = x * 2
    y.backward()
    print(f"Gradient after step: {x.grad}")
    # Gradients accumulate! Must zero manually

Gradient of Composite Function

\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx} = 2y \cdot 1 = 2(x^2 + 3x + 1) \cdot (2x + 3)

Here,

$z$ =Output loss function
$y$ =Intermediate variable
$x$ =Input variable

💡 Zeroing Gradients

Gradients accumulate by default in PyTorch. You must call $optimizer.zero\_grad()$ before each backward pass to prevent gradient accumulation across mini-batches. Alternatively, use $model.zero\_grad()$ to zero gradients for a specific model.

nn.Module and Layers

Building Custom Layers

import torch.nn as nn
import torch.nn.functional as F

# ═══════════════════════════════════════════════════
# Custom Linear Layer
# ═══════════════════════════════════════════════════
class CustomLinear(nn.Module):
    def __init__(self, in_features, out_features, bias=True):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        if bias:
            self.bias = nn.Parameter(torch.randn(out_features))
        else:
            self.bias = None

        # Initialize weights
        nn.init.kaiming_normal_(self.weight)
        if self.bias is not None:
            nn.init.zeros_(self.bias)

    def forward(self, x):
        return F.linear(x, self.weight, self.bias)

# Test custom layer
custom_linear = CustomLinear(10, 5)
x = torch.randn(32, 10)  # Batch of 32
y = custom_linear(x)
print(f"Custom Linear output: {y.shape}")

# ═══════════════════════════════════════════════════
# Built-in Layers
# ═══════════════════════════════════════════════════

# Linear (fully connected)
linear = nn.Linear(10, 5)
print(f"\nLinear: {linear}")

# Convolutional
conv2d = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
x_img = torch.randn(1, 3, 32, 32)  # Batch, channels, H, W
y_img = conv2d(x_img)
print(f"Conv2d: {y_img.shape}")  # [1, 16, 32, 32]

# Pooling
pool = nn.MaxPool2d(2, 2)
y_pool = pool(y_img)
print(f"MaxPool2d: {y_pool.shape}")  # [1, 16, 16, 16]

# Batch Normalization
bn = nn.BatchNorm2d(16)
y_bn = bn(y_img)
print(f"BatchNorm2d: {y_bn.shape}")

# Dropout
dropout = nn.Dropout(0.5)
y_drop = dropout(y_img)
print(f"Dropout: {y_drop.shape}")

# ═══════════════════════════════════════════════════
# Complete Model Architecture
# ═══════════════════════════════════════════════════
class CNNClassifier(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        # Feature extractor
        self.features = nn.Sequential(
            # Block 1: 3 -> 32 channels
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.Conv2d(32, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout2d(0.25),

            # Block 2: 32 -> 64 channels
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout2d(0.25),

            # Block 3: 64 -> 128 channels
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout2d(0.25)
        )

        # Classifier
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(256, 128),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(128, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# Initialize model
model = CNNClassifier(num_classes=10)
print(f"\nModel parameters: {sum(p.numel() for p in model.parameters()):,}")

# Test forward pass
x = torch.randn(8, 3, 32, 32)
y = model(x)
print(f"Output shape: {y.shape}")

📝Autograd Gradient Computation

Given $x = 2.0$ , compute $\frac{d}{dx}(x^2 + 3x + 1)$ .

Forward pass: $y = x^2 + 3x + 1 = 4 + 6 + 1 = 11$

Manual derivative: $\frac{dy}{dx} = 2x + 3 = 2(2) + 3 = 7$

PyTorch autograd:

x = torch.tensor(2.0, requires_grad=True)
y = x**2 + 3*x + 1
y.backward()
print(x.grad)  # tensor(7.)

Autograd automatically computes the same result by tracking the computation graph and applying the chain rule in reverse.

Dataset and DataLoader

from torch.utils.data import Dataset, DataLoader, random_split
from torchvision import transforms

# ═══════════════════════════════════════════════════
# Custom Dataset
# ═══════════════════════════════════════════════════
class CustomDataset(Dataset):
    def __init__(self, num_samples=1000, transform=None):
        self.X = torch.randn(num_samples, 10)
        self.y = (self.X.sum(dim=1) > 0).long()
        self.transform = transform

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        x = self.X[idx]
        y = self.y[idx]

        if self.transform:
            x = self.transform(x)

        return x, y

# Create dataset
dataset = CustomDataset(num_samples=1000)
print(f"Dataset size: {len(dataset)}")

# Sample
x, y = dataset[0]
print(f"Sample x: {x.shape}, y: {y}")

# ═══════════════════════════════════════════════════
# Train/Validation Split
# ═══════════════════════════════════════════════════
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print(f"Train: {len(train_dataset)}, Val: {len(val_dataset)}")

# ═══════════════════════════════════════════════════
# DataLoader
# ═══════════════════════════════════════════════════
train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True,
    num_workers=0,
    pin_memory=True,
    drop_last=True
)

val_loader = DataLoader(
    val_dataset,
    batch_size=64,
    shuffle=False,
    num_workers=0
)

# Iterate through batches
for batch_idx, (x_batch, y_batch) in enumerate(train_loader):
    print(f"\nBatch {batch_idx}: x={x_batch.shape}, y={y_batch.shape}")
    if batch_idx >= 2:
        break

# ═══════════════════════════════════════════════════
# Image Dataset with Transforms
# ═══════════════════════════════════════════════════
transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(10),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2470, 0.2435, 0.2616]
    )
])

# Using torchvision datasets
from torchvision.datasets import CIFAR10

cifar_train = CIFAR10(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

cifar_loader = DataLoader(
    cifar_train,
    batch_size=64,
    shuffle=True,
    num_workers=2
)

print(f"\nCIFAR-10 training samples: {len(cifar_train)}")

Cross-Entropy Loss (Classification)

\mathcal{L} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)

Here,

$C$ =Number of classes
$y_c$ =True label (one-hot encoded)
$\hat{y}_c$ =Predicted probability for class c

ℹ️ Why Cross-Entropy with Logits

PyTorch provides $nn.CrossEntropyLoss()$ which combines $nn.LogSoftmax()$ and $nn.NLLLoss()$ in one operation. This is numerically more stable than applying softmax separately and then computing cross-entropy, because the log-softmax computation avoids numerical overflow.

Complete Training Loop

import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR

# ═══════════════════════════════════════════════════
# Training Loop Implementation
# ═══════════════════════════════════════════════════
def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_idx, (data, target) in enumerate(loader):
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        _, predicted = output.max(1)
        total += target.size(0)
        correct += predicted.eq(target).sum().item()

    accuracy = 100. * correct / total
    avg_loss = total_loss / len(loader)
    return avg_loss, accuracy

def validate(model, loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for data, target in loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            loss = criterion(output, target)

            total_loss += loss.item()
            _, predicted = output.max(1)
            total += target.size(0)
            correct += predicted.eq(target).sum().item()

    accuracy = 100. * correct / total
    avg_loss = total_loss / len(loader)
    return avg_loss, accuracy

# ═══════════════════════════════════════════════════
# Full Training Pipeline
# ═══════════════════════════════════════════════════
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Initialize model, loss, optimizer
model = CNNClassifier(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = CosineAnnealingLR(optimizer, T_max=50)

# Training loop
num_epochs = 50
history = {'train_loss': [], 'val_loss': [],
           'train_acc': [], 'val_acc': []}

for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(
        model, train_loader, criterion, optimizer, device
    )
    val_loss, val_acc = validate(
        model, val_loader, criterion, device
    )
    scheduler.step()

    history['train_loss'].append(train_loss)
    history['val_loss'].append(val_loss)
    history['train_acc'].append(train_acc)
    history['val_acc'].append(val_acc)

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1}/{num_epochs}:")
        print(f"  Train Loss: {train_loss:.4f}, Acc: {train_acc:.2f}%")
        print(f"  Val Loss: {val_loss:.4f}, Acc: {val_acc:.2f}%")

# Save model
torch.save({
    'epoch': num_epochs,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'val_acc': val_acc,
}, 'model_checkpoint.pth')

📋Key Takeaways

Tensors are PyTorch's core data structure: multi-dimensional arrays with GPU support and automatic differentiation
Autograd builds dynamic computation graphs and computes gradients via reverse-mode automatic differentiation (backpropagation)
nn.Module is the base class for all neural network layers and models; the $forward()$ method defines the computation
Dataset/DataLoader provide efficient data loading with batching, shuffling, and parallelism via multiprocessing
Always zero gradients with $optimizer.zero\_grad()$ before each backward pass to prevent accumulation
Use torch.no_grad() during evaluation to disable gradient tracking and save memory
Cross-Entropy Loss should be used with raw logits (not softmax outputs) for numerical stability

Practice Exercises

Tensor Operations: Implement matrix multiplication and verify with torch.mm and @
Gradient Flow: Create a deep network and visualize gradient magnitudes across layers
Custom Layer: Implement a custom attention mechanism as an nn.Module
Data Pipeline: Create a custom dataset for CSV data with preprocessing transforms

PyTorch Fundamentals

PyTorch Fundamentals

Introduction

Tensors

Creating Tensors

DfTensor

Matrix Multiplication (Core Operation)

Tensor Manipulation

Autograd (Automatic Differentiation)

Computation Graphs

DfAutograd (Automatic Differentiation)

ThChain Rule in Autograd

Gradient of Composite Function

nn.Module and Layers

Building Custom Layers

📝Autograd Gradient Computation

Dataset and DataLoader

Cross-Entropy Loss (Classification)

Complete Training Loop

📋Key Takeaways

Practice Exercises

Need Expert Data Science Help?