Optimization
Low-Rank Factorization β Compressing Weights Through Structure
Large weight matrices contain significant redundancy. Low-rank factorization decomposes these matrices into products of smaller matrices, reducing parameters by 10-100Γ while preserving the essential computation. This underpins LoRA, weight sharing, and efficient inference.
- SVD Decomposition β Finding the optimal low-rank approximation
- LoRA Theory β Why low-rank adaptation works for fine-tuning
- Weight Sharing β Reusing parameters across layers
- Practical Methods β SVD, truncated SVD, and randomized factorization
The best compression is finding the structure that was always there.
Low-Rank Factorization
Neural network weight matrices are often low-rank, meaning most of their information can be captured by a much smaller set of dimensions. Low-rank factorization exploits this structure to reduce parameters and computation.
DfLow-Rank Matrix
A matrix W β β^{mΓn} has rank r if it can be expressed as W = AB where A β β^{mΓr} and B β β^{rΓn} with r << min(m, n). The rank r represents the intrinsic dimensionality of the matrix.
Singular Value Decomposition (SVD)
Theory
Singular Value Decomposition
Here,
- =Original matrix (m Γ n)
- =Left singular vectors (m Γ m, orthogonal)
- =Singular values (m Γ n, diagonal)
- =Right singular vectors (n Γ n, orthogonal)
SVD decomposes any matrix into orthogonal rotations (U, V) and scaling (Ξ£). The singular values in Ξ£ are sorted in decreasing order, and the first r values capture the most important information.
Truncated SVD
Truncated SVD
Here,
- =First r columns of U
- =Top r singular values
- =First r columns of V
- =Target rank
Approximation Error
Here,
- =Frobenius norm
- =i-th singular value
- =Truncation rank
SVD Compression
Original weight matrix: 4096 Γ 4096 = 16,777,216 parameters
Low-rank factorization with r = 256:
- A: 4096 Γ 256 = 1,048,576 parameters
- B: 256 Γ 4096 = 1,048,576 parameters
- Total: 2,097,152 parameters (12.5% of original)
Compression ratio: 8Γ with minimal quality loss if the matrix is truly low-rank.
Implementation
import torch
import torch.nn as nn
import numpy as np
class SVDLinear(nn.Module):
"""Linear layer using SVD factorization."""
def __init__(self, in_features, out_features, rank=None):
super().__init__()
self.in_features = in_features
self.out_features = out_features
if rank is None:
rank = min(in_features, out_features) // 4
self.rank = rank
# Factorized weight: W = U @ diag(S) @ V^T
self.U = nn.Parameter(torch.randn(out_features, rank))
self.S = nn.Parameter(torch.ones(rank))
self.V = nn.Parameter(torch.randn(rank, in_features))
# Optional bias
self.bias = nn.Parameter(torch.zeros(out_features))
def forward(self, x):
# Reconstruct weight: W = U @ diag(S) @ V^T
weight = self.U @ torch.diag(self.S) @ self.V
return F.linear(x, weight, self.bias)
@classmethod
def from_linear(cls, linear, rank):
"""Create factorized layer from existing linear layer."""
W = linear.weight.data
# Compute SVD
U, S, Vh = torch.linalg.svd(W, full_matrices=False)
# Truncate to rank
U_r = U[:, :rank]
S_r = S[:rank]
V_r = Vh[:rank, :]
# Create factorized layer
factorized = cls(linear.in_features, linear.out_features, rank)
factorized.U.data = U_r
factorized.S.data = S_r
factorized.V.data = V_r
factorized.bias.data = linear.bias.data.clone()
return factorized
LoRA Theory
Low-Rank Adaptation
DfLoRA
LoRA (Low-Rank Adaptation, Hu et al., 2022) freezes the pre-trained model weights and injects trainable low-rank decomposition matrices into each layer of the transformer. This enables efficient fine-tuning with 0.1-1% of the original parameters.
LoRA Update
Here,
- =Original frozen weights (d Γ d)
- =Low-rank matrix (d Γ r)
- =Low-rank matrix (r Γ d)
- =LoRA rank (typically 8-64)
LoRA works because the weight updates during fine-tuning are typically low-rank. Instead of updating all dΒ² parameters, LoRA only trains 2dr parameters (r << d), achieving the same effect with far fewer parameters.
Why LoRA Works
class LoRALayer(nn.Module):
"""LoRA adaptation layer."""
def __init__(self, original_layer, rank=8, alpha=16):
super().__init__()
self.original_layer = original_layer
self.rank = rank
self.alpha = alpha
self.scaling = alpha / rank
d = original_layer.in_features
# Low-rank matrices
self.lora_A = nn.Parameter(torch.randn(rank, d) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(d, rank))
# Freeze original weights
for param in self.original_layer.parameters():
param.requires_grad = False
def forward(self, x):
# Original forward pass
original_output = self.original_layer(x)
# LoRA contribution
lora_output = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
return original_output + lora_output
LoRA Rank Selection
LoRA Parameter Count
Here,
- =Model dimension
- =LoRA rank
- =Number of layers
| Model | d_model | n_layers | r=8 | r=16 | r=64 |
|---|---|---|---|---|---|
| LLaMA-7B | 4096 | 32 | 2M | 4M | 16M |
| LLaMA-13B | 5120 | 40 | 3.3M | 6.6M | 26M |
| LLaMA-70B | 8192 | 80 | 10.5M | 21M | 84M |
For most tasks, rank 8-16 provides good results. Higher ranks (32-64) are needed for complex tasks requiring significant adaptation, while lower ranks (4-8) suffice for simple fine-tuning.
Weight Sharing
Cross-Layer Weight Sharing
DfWeight Sharing
Weight sharing reuses weights across multiple layers, reducing the total number of parameters. This can be done through tying embeddings, sharing attention weights, or using the same weights across transformer blocks.
class WeightSharedTransformer(nn.Module):
"""Transformer with weight sharing across layers."""
def __init__(self, d_model, n_layers, n_heads):
super().__init__()
# Shared layers
self.shared_attention = MultiHeadAttention(d_model, n_heads)
self.shared_ffn = FeedForward(d_model)
self.shared_norm1 = nn.LayerNorm(d_model)
self.shared_norm2 = nn.LayerNorm(d_model)
# Layer-specific parameters (lightweight)
self.layer_params = nn.ModuleList([
nn.ParameterDict({
'scale1': nn.Parameter(torch.ones(d_model)),
'scale2': nn.Parameter(torch.ones(d_model)),
})
for _ in range(n_layers)
])
def forward(self, x):
for i, params in enumerate(self.layer_params):
# Shared computation with layer-specific scaling
residual = x
x = self.shared_norm1(x) * params['scale1']
x = self.shared_attention(x, x, x)
x = residual + x
residual = x
x = self.shared_norm2(x) * params['scale2']
x = self.shared_ffn(x)
x = residual + x
return x
Embedding Tying
DfTied Embeddings
Tied embeddings share the word embedding matrix with the output projection matrix. Since both maps between the vocabulary and hidden dimension, they can share weights, reducing parameters by VΓd.
Embedding Tying
Here,
- =Input embedding matrix (V Γ d)
- =Output projection (d Γ V)
class TiedEmbeddingTransformer(nn.Module):
"""Transformer with tied embeddings."""
def __init__(self, vocab_size, d_model):
super().__init__()
# Shared embedding
self.embedding = nn.Embedding(vocab_size, d_model)
# Output head reuses embedding weights
self.output_head = lambda x: F.linear(x, self.embedding.weight)
# Alternative: explicit weight tying
# self.output_proj = nn.Linear(d_model, vocab_size, bias=False)
# self.output_proj.weight = self.embedding.weight
def forward(self, x):
h = self.embedding(x)
logits = self.output_head(h)
return logits
GPT-2 and many modern LLMs tie input and output embeddings. This reduces parameters by VΓd (e.g., 50,000 Γ 4,096 = 204M parameters for a 4096-dim model) with minimal quality loss.
Randomized Factorization
Random SVD
DfRandomized SVD
Randomized SVD uses random projections to approximate the SVD of large matrices efficiently. It's much faster than exact SVD for computing low-rank approximations of large matrices.
def randomized_svd(M, rank, n_oversamples=10, n_iter=4):
"""Randomized SVD algorithm."""
m, n = M.shape
# Random projection
P = torch.randn(n, rank + n_oversamples)
Z = M @ P
# Power iteration for better approximation
for _ in range(n_iter):
Z = M @ (M.T @ Z)
# Compute QR decomposition
Q, _ = torch.linalg.qr(Z)
# Project and compute exact SVD
B = Q.T @ M
U_hat, S, Vt = torch.linalg.svd(B, full_matrices=False)
U = Q @ U_hat
return U[:, :rank], S[:rank], Vt[:rank, :]
NystrΓΆm Approximation
NystrΓΆm Approximation
Here,
- =Subset of columns/rows (m Γ m)
- =Cross terms (m Γ n)
Practical Applications
SVD for Model Compression
def compress_model_svd(model, compression_ratio=0.5):
"""Compress model using SVD on weight matrices."""
compressed = {}
for name, param in model.named_parameters():
if param.dim() >= 2:
# Compute SVD
U, S, Vh = torch.linalg.svd(param.data, full_matrices=False)
# Determine rank based on compression ratio
total_params = U.shape[0] * U.shape[1] + S.shape[0] + Vh.shape[0] * Vh.shape[1]
target_params = int(param.numel() * compression_ratio)
# Find optimal rank
rank = 1
while rank < min(S.shape):
current_params = U.shape[0] * rank + rank + rank * Vh.shape[1]
if current_params > target_params:
break
rank += 1
# Store factorized version
compressed[name] = {
'U': U[:, :rank],
'S': S[:rank],
'V': Vh[:rank, :],
'original_shape': param.shape
}
else:
compressed[name] = param.data
return compressed
LoRA Fine-Tuning
from peft import LoraConfig, get_peft_model, TaskType
def setup_lora_finetuning(model_name, rank=16):
"""Set up LoRA fine-tuning for a pre-trained model."""
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_name)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=rank,
lora_alpha=2 * rank,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
bias="none",
)
model = get_peft_model(model, lora_config)
# Print parameter statistics
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
print(f"Total: {total_params:,}")
return model, lora_config
Low-Rank Analysis
Singular Value Spectrum
Most neural network weight matrices have rapidly decaying singular values, meaning they are approximately low-rank. For a 4096Γ4096 matrix, the top 256 singular values typically capture 95%+ of the total energy (sum of squared singular values).
def analyze_rank_spectrum(weight, n_components=50):
"""Analyze the singular value spectrum of a weight matrix."""
U, S, Vh = torch.linalg.svd(weight, full_matrices=False)
# Compute cumulative energy
energy = S ** 2
cumulative_energy = torch.cumsum(energy, dim=0) / energy.sum()
# Find rank for different energy thresholds
thresholds = [0.9, 0.95, 0.99, 0.999]
ranks = {}
for threshold in thresholds:
rank = torch.searchsorted(cumulative_energy, threshold).item() + 1
ranks[f"{threshold:.1%}"] = rank
return ranks, S[:n_components]
Practice Exercises
-
Mathematical: For a dΓd weight matrix, compute the compression ratio of LoRA with rank r compared to the full matrix. What is the break-even point where LoRA uses more parameters?
-
Implementation: Implement truncated SVD compression on a transformer layer and measure the perplexity degradation at different ranks (r = 16, 32, 64, 128).
-
Analysis: Analyze the singular value spectrum of attention projection matrices in a pre-trained LLM. How low-rank are these matrices in practice?
-
Research: Compare LoRA with different rank selections across tasks (classification, generation, reasoning). Is there a universal optimal rank or does it depend on the task?
Key Takeaways:
- SVD provides the optimal low-rank approximation in the Frobenius norm
- LoRA trains only 0.1-1% of parameters by exploiting low-rank weight updates
- Weight sharing (tied embeddings, cross-layer sharing) reduces parameters significantly
- Randomized SVD enables efficient factorization of large matrices
- Most weight matrices are approximately low-rank (top 5-10% singular values capture 95%+ energy)
- LoRA rank 8-16 works well for most tasks; higher ranks for complex adaptation
- SVD compression and LoRA are complementary: SVD for inference, LoRA for training
What to Learn Next
-> LoRA and PEFT Efficient fine-tuning using low-rank adaptation.
-> Quantization Techniques Deep Dive GPTQ, AWQ, GGUF, and INT4/INT8 methods.
-> Pruning for LLMs Reducing model size through weight pruning.
-> Model Merging and Fusion Combining multiple fine-tuned models.
-> Mixture of Experts Sparse architectures that scale efficiently.
-> Knowledge Distillation for LLMs Training smaller models from larger teachers.