Optimization

Low-Rank Factorization — Compressing Weights Through Structure

Large weight matrices contain significant redundancy. Low-rank factorization decomposes these matrices into products of smaller matrices, reducing parameters by 10-100× while preserving the essential computation. This underpins LoRA, weight sharing, and efficient inference.

SVD Decomposition — Finding the optimal low-rank approximation
LoRA Theory — Why low-rank adaptation works for fine-tuning
Weight Sharing — Reusing parameters across layers
Practical Methods — SVD, truncated SVD, and randomized factorization

The best compression is finding the structure that was always there.

Low-Rank Factorization

Neural network weight matrices are often low-rank, meaning most of their information can be captured by a much smaller set of dimensions. Low-rank factorization exploits this structure to reduce parameters and computation.

DfLow-Rank Matrix

A matrix W ∈ ℝ^{m×n} has rank r if it can be expressed as W = AB where A ∈ ℝ^{m×r} and B ∈ ℝ^{r×n} with r << min(m, n). The rank r represents the intrinsic dimensionality of the matrix.

Singular Value Decomposition (SVD)

Theory

Singular Value Decomposition

W = U \\Sigma V^T

Here,

$W$ =Original matrix (m × n)
$U$ =Left singular vectors (m × m, orthogonal)
$\Sigma$ =Singular values (m × n, diagonal)
$V$ =Right singular vectors (n × n, orthogonal)

SVD decomposes any matrix into orthogonal rotations (U, V) and scaling (Σ). The singular values in Σ are sorted in decreasing order, and the first r values capture the most important information.

Truncated SVD

W \\approx U_r \\Sigma_r V_r^T

Here,

$U_r$ =First r columns of U
$\Sigma_r$ =Top r singular values
$V_r$ =First r columns of V
$r$ =Target rank

Approximation Error

\\|W - W_r\\|_F = \\sqrt{\\sum_{i=r+1}^{\\min(m,n)} \\sigma_i^2}

Here,

$\|\cdot\|_F$ =Frobenius norm
$\sigma_i$ =i-th singular value
$r$ =Truncation rank

SVD Compression

Original weight matrix: 4096 × 4096 = 16,777,216 parameters

Low-rank factorization with r = 256:

A: 4096 × 256 = 1,048,576 parameters
B: 256 × 4096 = 1,048,576 parameters
Total: 2,097,152 parameters (12.5% of original)

Compression ratio: 8× with minimal quality loss if the matrix is truly low-rank.

Implementation

import torch
import torch.nn as nn
import numpy as np

class SVDLinear(nn.Module):
    """Linear layer using SVD factorization."""
    
    def __init__(self, in_features, out_features, rank=None):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        
        if rank is None:
            rank = min(in_features, out_features) // 4
        
        self.rank = rank
        
        # Factorized weight: W = U @ diag(S) @ V^T
        self.U = nn.Parameter(torch.randn(out_features, rank))
        self.S = nn.Parameter(torch.ones(rank))
        self.V = nn.Parameter(torch.randn(rank, in_features))
        
        # Optional bias
        self.bias = nn.Parameter(torch.zeros(out_features))
    
    def forward(self, x):
        # Reconstruct weight: W = U @ diag(S) @ V^T
        weight = self.U @ torch.diag(self.S) @ self.V
        return F.linear(x, weight, self.bias)
    
    @classmethod
    def from_linear(cls, linear, rank):
        """Create factorized layer from existing linear layer."""
        W = linear.weight.data
        
        # Compute SVD
        U, S, Vh = torch.linalg.svd(W, full_matrices=False)
        
        # Truncate to rank
        U_r = U[:, :rank]
        S_r = S[:rank]
        V_r = Vh[:rank, :]
        
        # Create factorized layer
        factorized = cls(linear.in_features, linear.out_features, rank)
        factorized.U.data = U_r
        factorized.S.data = S_r
        factorized.V.data = V_r
        factorized.bias.data = linear.bias.data.clone()
        
        return factorized

LoRA Theory

Low-Rank Adaptation

DfLoRA

LoRA (Low-Rank Adaptation, Hu et al., 2022) freezes the pre-trained model weights and injects trainable low-rank decomposition matrices into each layer of the transformer. This enables efficient fine-tuning with 0.1-1% of the original parameters.

LoRA Update

W' = W + \\Delta W = W + BA

Here,

$W$ =Original frozen weights (d × d)
$B$ =Low-rank matrix (d × r)
$A$ =Low-rank matrix (r × d)
$r$ =LoRA rank (typically 8-64)

LoRA works because the weight updates during fine-tuning are typically low-rank. Instead of updating all d² parameters, LoRA only trains 2dr parameters (r << d), achieving the same effect with far fewer parameters.

Why LoRA Works

class LoRALayer(nn.Module):
    """LoRA adaptation layer."""
    
    def __init__(self, original_layer, rank=8, alpha=16):
        super().__init__()
        self.original_layer = original_layer
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        d = original_layer.in_features
        
        # Low-rank matrices
        self.lora_A = nn.Parameter(torch.randn(rank, d) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(d, rank))
        
        # Freeze original weights
        for param in self.original_layer.parameters():
            param.requires_grad = False
    
    def forward(self, x):
        # Original forward pass
        original_output = self.original_layer(x)
        
        # LoRA contribution
        lora_output = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
        
        return original_output + lora_output

LoRA Rank Selection

LoRA Parameter Count

\\text{LoRA params} = 2 \\cdot d_{\\text{model}} \\cdot r \\cdot n_{\\text{layers}}

Here,

$d_{\text{model}}$ =Model dimension
$r$ =LoRA rank
$n_{\text{layers}}$ =Number of layers

Model	d_model	n_layers	r=8	r=16	r=64
LLaMA-7B	4096	32	2M	4M	16M
LLaMA-13B	5120	40	3.3M	6.6M	26M
LLaMA-70B	8192	80	10.5M	21M	84M

For most tasks, rank 8-16 provides good results. Higher ranks (32-64) are needed for complex tasks requiring significant adaptation, while lower ranks (4-8) suffice for simple fine-tuning.

Weight Sharing

Cross-Layer Weight Sharing

DfWeight Sharing

Weight sharing reuses weights across multiple layers, reducing the total number of parameters. This can be done through tying embeddings, sharing attention weights, or using the same weights across transformer blocks.

class WeightSharedTransformer(nn.Module):
    """Transformer with weight sharing across layers."""
    
    def __init__(self, d_model, n_layers, n_heads):
        super().__init__()
        
        # Shared layers
        self.shared_attention = MultiHeadAttention(d_model, n_heads)
        self.shared_ffn = FeedForward(d_model)
        self.shared_norm1 = nn.LayerNorm(d_model)
        self.shared_norm2 = nn.LayerNorm(d_model)
        
        # Layer-specific parameters (lightweight)
        self.layer_params = nn.ModuleList([
            nn.ParameterDict({
                'scale1': nn.Parameter(torch.ones(d_model)),
                'scale2': nn.Parameter(torch.ones(d_model)),
            })
            for _ in range(n_layers)
        ])
    
    def forward(self, x):
        for i, params in enumerate(self.layer_params):
            # Shared computation with layer-specific scaling
            residual = x
            x = self.shared_norm1(x) * params['scale1']
            x = self.shared_attention(x, x, x)
            x = residual + x
            
            residual = x
            x = self.shared_norm2(x) * params['scale2']
            x = self.shared_ffn(x)
            x = residual + x
        
        return x

Embedding Tying

DfTied Embeddings

Tied embeddings share the word embedding matrix with the output projection matrix. Since both maps between the vocabulary and hidden dimension, they can share weights, reducing parameters by V×d.

Embedding Tying

W_{\\text{output}} = W_{\\text{embedding}}^T

Here,

$W_{\text{embedding}}$ =Input embedding matrix (V × d)
$W_{\text{output}}$ =Output projection (d × V)

class TiedEmbeddingTransformer(nn.Module):
    """Transformer with tied embeddings."""
    
    def __init__(self, vocab_size, d_model):
        super().__init__()
        
        # Shared embedding
        self.embedding = nn.Embedding(vocab_size, d_model)
        
        # Output head reuses embedding weights
        self.output_head = lambda x: F.linear(x, self.embedding.weight)
        
        # Alternative: explicit weight tying
        # self.output_proj = nn.Linear(d_model, vocab_size, bias=False)
        # self.output_proj.weight = self.embedding.weight
    
    def forward(self, x):
        h = self.embedding(x)
        logits = self.output_head(h)
        return logits

GPT-2 and many modern LLMs tie input and output embeddings. This reduces parameters by V×d (e.g., 50,000 × 4,096 = 204M parameters for a 4096-dim model) with minimal quality loss.

Randomized Factorization

Random SVD

DfRandomized SVD

Randomized SVD uses random projections to approximate the SVD of large matrices efficiently. It's much faster than exact SVD for computing low-rank approximations of large matrices.

def randomized_svd(M, rank, n_oversamples=10, n_iter=4):
    """Randomized SVD algorithm."""
    m, n = M.shape
    
    # Random projection
    P = torch.randn(n, rank + n_oversamples)
    Z = M @ P
    
    # Power iteration for better approximation
    for _ in range(n_iter):
        Z = M @ (M.T @ Z)
    
    # Compute QR decomposition
    Q, _ = torch.linalg.qr(Z)
    
    # Project and compute exact SVD
    B = Q.T @ M
    U_hat, S, Vt = torch.linalg.svd(B, full_matrices=False)
    
    U = Q @ U_hat
    
    return U[:, :rank], S[:rank], Vt[:rank, :]

Nyström Approximation

W \\approx W_{mm} W_{mn}^T (W_{mm})^{-1} W_{mn}

Here,

$W_{mm}$ =Subset of columns/rows (m × m)
$W_{mn}$ =Cross terms (m × n)

Practical Applications

SVD for Model Compression

def compress_model_svd(model, compression_ratio=0.5):
    """Compress model using SVD on weight matrices."""
    compressed = {}
    
    for name, param in model.named_parameters():
        if param.dim() >= 2:
            # Compute SVD
            U, S, Vh = torch.linalg.svd(param.data, full_matrices=False)
            
            # Determine rank based on compression ratio
            total_params = U.shape[0] * U.shape[1] + S.shape[0] + Vh.shape[0] * Vh.shape[1]
            target_params = int(param.numel() * compression_ratio)
            
            # Find optimal rank
            rank = 1
            while rank < min(S.shape):
                current_params = U.shape[0] * rank + rank + rank * Vh.shape[1]
                if current_params > target_params:
                    break
                rank += 1
            
            # Store factorized version
            compressed[name] = {
                'U': U[:, :rank],
                'S': S[:rank],
                'V': Vh[:rank, :],
                'original_shape': param.shape
            }
        else:
            compressed[name] = param.data
    
    return compressed

LoRA Fine-Tuning

from peft import LoraConfig, get_peft_model, TaskType

def setup_lora_finetuning(model_name, rank=16):
    """Set up LoRA fine-tuning for a pre-trained model."""
    from transformers import AutoModelForCausalLM
    
    model = AutoModelForCausalLM.from_pretrained(model_name)
    
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=rank,
        lora_alpha=2 * rank,
        lora_dropout=0.1,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        bias="none",
    )
    
    model = get_peft_model(model, lora_config)
    
    # Print parameter statistics
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    
    print(f"Trainable: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
    print(f"Total: {total_params:,}")
    
    return model, lora_config

Low-Rank Analysis

Singular Value Spectrum

Most neural network weight matrices have rapidly decaying singular values, meaning they are approximately low-rank. For a 4096×4096 matrix, the top 256 singular values typically capture 95%+ of the total energy (sum of squared singular values).

def analyze_rank_spectrum(weight, n_components=50):
    """Analyze the singular value spectrum of a weight matrix."""
    U, S, Vh = torch.linalg.svd(weight, full_matrices=False)
    
    # Compute cumulative energy
    energy = S ** 2
    cumulative_energy = torch.cumsum(energy, dim=0) / energy.sum()
    
    # Find rank for different energy thresholds
    thresholds = [0.9, 0.95, 0.99, 0.999]
    ranks = {}
    
    for threshold in thresholds:
        rank = torch.searchsorted(cumulative_energy, threshold).item() + 1
        ranks[f"{threshold:.1%}"] = rank
    
    return ranks, S[:n_components]

Practice Exercises

Mathematical: For a d×d weight matrix, compute the compression ratio of LoRA with rank r compared to the full matrix. What is the break-even point where LoRA uses more parameters?
Implementation: Implement truncated SVD compression on a transformer layer and measure the perplexity degradation at different ranks (r = 16, 32, 64, 128).
Analysis: Analyze the singular value spectrum of attention projection matrices in a pre-trained LLM. How low-rank are these matrices in practice?
Research: Compare LoRA with different rank selections across tasks (classification, generation, reasoning). Is there a universal optimal rank or does it depend on the task?

Key Takeaways:

SVD provides the optimal low-rank approximation in the Frobenius norm
LoRA trains only 0.1-1% of parameters by exploiting low-rank weight updates
Weight sharing (tied embeddings, cross-layer sharing) reduces parameters significantly
Randomized SVD enables efficient factorization of large matrices
Most weight matrices are approximately low-rank (top 5-10% singular values capture 95%+ energy)
LoRA rank 8-16 works well for most tasks; higher ranks for complex adaptation
SVD compression and LoRA are complementary: SVD for inference, LoRA for training

What to Learn Next

-> LoRA and PEFT Efficient fine-tuning using low-rank adaptation.

-> Quantization Techniques Deep Dive GPTQ, AWQ, GGUF, and INT4/INT8 methods.

-> Pruning for LLMs Reducing model size through weight pruning.

-> Model Merging and Fusion Combining multiple fine-tuned models.

-> Mixture of Experts Sparse architectures that scale efficiently.

-> Knowledge Distillation for LLMs Training smaller models from larger teachers.

Low-Rank Factorization

Low-Rank Factorization — Compressing Weights Through Structure

Low-Rank Factorization

DfLow-Rank Matrix

Singular Value Decomposition (SVD)

Theory

Singular Value Decomposition

Truncated SVD

Truncated SVD

Approximation Error

SVD Compression

Implementation

LoRA Theory

Low-Rank Adaptation

DfLoRA

LoRA Update

Why LoRA Works

LoRA Rank Selection

LoRA Parameter Count

Weight Sharing

Cross-Layer Weight Sharing

DfWeight Sharing

Embedding Tying

DfTied Embeddings

Embedding Tying

Randomized Factorization

Random SVD

DfRandomized SVD

Nyström Approximation

Nyström Approximation

Practical Applications

SVD for Model Compression

LoRA Fine-Tuning

Low-Rank Analysis

Singular Value Spectrum

Practice Exercises

What to Learn Next

Need Expert LLM Help?