CW

Low-Rank Factorization

OptimizationMatrix FactorizationFree Lesson

Advertisement

Optimization

Low-Rank Factorization β€” Compressing Weights Through Structure

Large weight matrices contain significant redundancy. Low-rank factorization decomposes these matrices into products of smaller matrices, reducing parameters by 10-100Γ— while preserving the essential computation. This underpins LoRA, weight sharing, and efficient inference.

  • SVD Decomposition β€” Finding the optimal low-rank approximation
  • LoRA Theory β€” Why low-rank adaptation works for fine-tuning
  • Weight Sharing β€” Reusing parameters across layers
  • Practical Methods β€” SVD, truncated SVD, and randomized factorization

The best compression is finding the structure that was always there.

Low-Rank Factorization

Neural network weight matrices are often low-rank, meaning most of their information can be captured by a much smaller set of dimensions. Low-rank factorization exploits this structure to reduce parameters and computation.

DfLow-Rank Matrix

A matrix W ∈ ℝ^{mΓ—n} has rank r if it can be expressed as W = AB where A ∈ ℝ^{mΓ—r} and B ∈ ℝ^{rΓ—n} with r << min(m, n). The rank r represents the intrinsic dimensionality of the matrix.

Singular Value Decomposition (SVD)

Theory

Singular Value Decomposition

W=USigmaVTW = U \\Sigma V^T

Here,

  • WW=Original matrix (m Γ— n)
  • UU=Left singular vectors (m Γ— m, orthogonal)
  • Ξ£\Sigma=Singular values (m Γ— n, diagonal)
  • VV=Right singular vectors (n Γ— n, orthogonal)

SVD decomposes any matrix into orthogonal rotations (U, V) and scaling (Ξ£). The singular values in Ξ£ are sorted in decreasing order, and the first r values capture the most important information.

Truncated SVD

Truncated SVD

WapproxUrSigmarVrTW \\approx U_r \\Sigma_r V_r^T

Here,

  • UrU_r=First r columns of U
  • Ξ£r\Sigma_r=Top r singular values
  • VrV_r=First r columns of V
  • rr=Target rank

Approximation Error

∣Wβˆ’Wr∣F=sqrtsumi=r+1min(m,n)sigmai2\\|W - W_r\\|_F = \\sqrt{\\sum_{i=r+1}^{\\min(m,n)} \\sigma_i^2}

Here,

  • βˆ₯β‹…βˆ₯F\|\cdot\|_F=Frobenius norm
  • Οƒi\sigma_i=i-th singular value
  • rr=Truncation rank

SVD Compression

Original weight matrix: 4096 Γ— 4096 = 16,777,216 parameters

Low-rank factorization with r = 256:

  • A: 4096 Γ— 256 = 1,048,576 parameters
  • B: 256 Γ— 4096 = 1,048,576 parameters
  • Total: 2,097,152 parameters (12.5% of original)

Compression ratio: 8Γ— with minimal quality loss if the matrix is truly low-rank.

Implementation

import torch
import torch.nn as nn
import numpy as np

class SVDLinear(nn.Module):
    """Linear layer using SVD factorization."""
    
    def __init__(self, in_features, out_features, rank=None):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        
        if rank is None:
            rank = min(in_features, out_features) // 4
        
        self.rank = rank
        
        # Factorized weight: W = U @ diag(S) @ V^T
        self.U = nn.Parameter(torch.randn(out_features, rank))
        self.S = nn.Parameter(torch.ones(rank))
        self.V = nn.Parameter(torch.randn(rank, in_features))
        
        # Optional bias
        self.bias = nn.Parameter(torch.zeros(out_features))
    
    def forward(self, x):
        # Reconstruct weight: W = U @ diag(S) @ V^T
        weight = self.U @ torch.diag(self.S) @ self.V
        return F.linear(x, weight, self.bias)
    
    @classmethod
    def from_linear(cls, linear, rank):
        """Create factorized layer from existing linear layer."""
        W = linear.weight.data
        
        # Compute SVD
        U, S, Vh = torch.linalg.svd(W, full_matrices=False)
        
        # Truncate to rank
        U_r = U[:, :rank]
        S_r = S[:rank]
        V_r = Vh[:rank, :]
        
        # Create factorized layer
        factorized = cls(linear.in_features, linear.out_features, rank)
        factorized.U.data = U_r
        factorized.S.data = S_r
        factorized.V.data = V_r
        factorized.bias.data = linear.bias.data.clone()
        
        return factorized

LoRA Theory

Low-Rank Adaptation

DfLoRA

LoRA (Low-Rank Adaptation, Hu et al., 2022) freezes the pre-trained model weights and injects trainable low-rank decomposition matrices into each layer of the transformer. This enables efficient fine-tuning with 0.1-1% of the original parameters.

LoRA Update

Wβ€²=W+DeltaW=W+BAW' = W + \\Delta W = W + BA

Here,

  • WW=Original frozen weights (d Γ— d)
  • BB=Low-rank matrix (d Γ— r)
  • AA=Low-rank matrix (r Γ— d)
  • rr=LoRA rank (typically 8-64)

LoRA works because the weight updates during fine-tuning are typically low-rank. Instead of updating all dΒ² parameters, LoRA only trains 2dr parameters (r << d), achieving the same effect with far fewer parameters.

Why LoRA Works

class LoRALayer(nn.Module):
    """LoRA adaptation layer."""
    
    def __init__(self, original_layer, rank=8, alpha=16):
        super().__init__()
        self.original_layer = original_layer
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        d = original_layer.in_features
        
        # Low-rank matrices
        self.lora_A = nn.Parameter(torch.randn(rank, d) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(d, rank))
        
        # Freeze original weights
        for param in self.original_layer.parameters():
            param.requires_grad = False
    
    def forward(self, x):
        # Original forward pass
        original_output = self.original_layer(x)
        
        # LoRA contribution
        lora_output = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
        
        return original_output + lora_output

LoRA Rank Selection

LoRA Parameter Count

textLoRAparams=2cdotdtextmodelcdotrcdotntextlayers\\text{LoRA params} = 2 \\cdot d_{\\text{model}} \\cdot r \\cdot n_{\\text{layers}}

Here,

  • dmodeld_{\text{model}}=Model dimension
  • rr=LoRA rank
  • nlayersn_{\text{layers}}=Number of layers
Modeld_modeln_layersr=8r=16r=64
LLaMA-7B4096322M4M16M
LLaMA-13B5120403.3M6.6M26M
LLaMA-70B81928010.5M21M84M

For most tasks, rank 8-16 provides good results. Higher ranks (32-64) are needed for complex tasks requiring significant adaptation, while lower ranks (4-8) suffice for simple fine-tuning.

Weight Sharing

Cross-Layer Weight Sharing

DfWeight Sharing

Weight sharing reuses weights across multiple layers, reducing the total number of parameters. This can be done through tying embeddings, sharing attention weights, or using the same weights across transformer blocks.

class WeightSharedTransformer(nn.Module):
    """Transformer with weight sharing across layers."""
    
    def __init__(self, d_model, n_layers, n_heads):
        super().__init__()
        
        # Shared layers
        self.shared_attention = MultiHeadAttention(d_model, n_heads)
        self.shared_ffn = FeedForward(d_model)
        self.shared_norm1 = nn.LayerNorm(d_model)
        self.shared_norm2 = nn.LayerNorm(d_model)
        
        # Layer-specific parameters (lightweight)
        self.layer_params = nn.ModuleList([
            nn.ParameterDict({
                'scale1': nn.Parameter(torch.ones(d_model)),
                'scale2': nn.Parameter(torch.ones(d_model)),
            })
            for _ in range(n_layers)
        ])
    
    def forward(self, x):
        for i, params in enumerate(self.layer_params):
            # Shared computation with layer-specific scaling
            residual = x
            x = self.shared_norm1(x) * params['scale1']
            x = self.shared_attention(x, x, x)
            x = residual + x
            
            residual = x
            x = self.shared_norm2(x) * params['scale2']
            x = self.shared_ffn(x)
            x = residual + x
        
        return x

Embedding Tying

DfTied Embeddings

Tied embeddings share the word embedding matrix with the output projection matrix. Since both maps between the vocabulary and hidden dimension, they can share weights, reducing parameters by VΓ—d.

Embedding Tying

Wtextoutput=WtextembeddingTW_{\\text{output}} = W_{\\text{embedding}}^T

Here,

  • WembeddingW_{\text{embedding}}=Input embedding matrix (V Γ— d)
  • WoutputW_{\text{output}}=Output projection (d Γ— V)
class TiedEmbeddingTransformer(nn.Module):
    """Transformer with tied embeddings."""
    
    def __init__(self, vocab_size, d_model):
        super().__init__()
        
        # Shared embedding
        self.embedding = nn.Embedding(vocab_size, d_model)
        
        # Output head reuses embedding weights
        self.output_head = lambda x: F.linear(x, self.embedding.weight)
        
        # Alternative: explicit weight tying
        # self.output_proj = nn.Linear(d_model, vocab_size, bias=False)
        # self.output_proj.weight = self.embedding.weight
    
    def forward(self, x):
        h = self.embedding(x)
        logits = self.output_head(h)
        return logits

GPT-2 and many modern LLMs tie input and output embeddings. This reduces parameters by VΓ—d (e.g., 50,000 Γ— 4,096 = 204M parameters for a 4096-dim model) with minimal quality loss.

Randomized Factorization

Random SVD

DfRandomized SVD

Randomized SVD uses random projections to approximate the SVD of large matrices efficiently. It's much faster than exact SVD for computing low-rank approximations of large matrices.

def randomized_svd(M, rank, n_oversamples=10, n_iter=4):
    """Randomized SVD algorithm."""
    m, n = M.shape
    
    # Random projection
    P = torch.randn(n, rank + n_oversamples)
    Z = M @ P
    
    # Power iteration for better approximation
    for _ in range(n_iter):
        Z = M @ (M.T @ Z)
    
    # Compute QR decomposition
    Q, _ = torch.linalg.qr(Z)
    
    # Project and compute exact SVD
    B = Q.T @ M
    U_hat, S, Vt = torch.linalg.svd(B, full_matrices=False)
    
    U = Q @ U_hat
    
    return U[:, :rank], S[:rank], Vt[:rank, :]

NystrΓΆm Approximation

NystrΓΆm Approximation

WapproxWmmWmnT(Wmm)βˆ’1WmnW \\approx W_{mm} W_{mn}^T (W_{mm})^{-1} W_{mn}

Here,

  • WmmW_{mm}=Subset of columns/rows (m Γ— m)
  • WmnW_{mn}=Cross terms (m Γ— n)

Practical Applications

SVD for Model Compression

def compress_model_svd(model, compression_ratio=0.5):
    """Compress model using SVD on weight matrices."""
    compressed = {}
    
    for name, param in model.named_parameters():
        if param.dim() >= 2:
            # Compute SVD
            U, S, Vh = torch.linalg.svd(param.data, full_matrices=False)
            
            # Determine rank based on compression ratio
            total_params = U.shape[0] * U.shape[1] + S.shape[0] + Vh.shape[0] * Vh.shape[1]
            target_params = int(param.numel() * compression_ratio)
            
            # Find optimal rank
            rank = 1
            while rank < min(S.shape):
                current_params = U.shape[0] * rank + rank + rank * Vh.shape[1]
                if current_params > target_params:
                    break
                rank += 1
            
            # Store factorized version
            compressed[name] = {
                'U': U[:, :rank],
                'S': S[:rank],
                'V': Vh[:rank, :],
                'original_shape': param.shape
            }
        else:
            compressed[name] = param.data
    
    return compressed

LoRA Fine-Tuning

from peft import LoraConfig, get_peft_model, TaskType

def setup_lora_finetuning(model_name, rank=16):
    """Set up LoRA fine-tuning for a pre-trained model."""
    from transformers import AutoModelForCausalLM
    
    model = AutoModelForCausalLM.from_pretrained(model_name)
    
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=rank,
        lora_alpha=2 * rank,
        lora_dropout=0.1,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        bias="none",
    )
    
    model = get_peft_model(model, lora_config)
    
    # Print parameter statistics
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    
    print(f"Trainable: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
    print(f"Total: {total_params:,}")
    
    return model, lora_config

Low-Rank Analysis

Singular Value Spectrum

Most neural network weight matrices have rapidly decaying singular values, meaning they are approximately low-rank. For a 4096Γ—4096 matrix, the top 256 singular values typically capture 95%+ of the total energy (sum of squared singular values).

def analyze_rank_spectrum(weight, n_components=50):
    """Analyze the singular value spectrum of a weight matrix."""
    U, S, Vh = torch.linalg.svd(weight, full_matrices=False)
    
    # Compute cumulative energy
    energy = S ** 2
    cumulative_energy = torch.cumsum(energy, dim=0) / energy.sum()
    
    # Find rank for different energy thresholds
    thresholds = [0.9, 0.95, 0.99, 0.999]
    ranks = {}
    
    for threshold in thresholds:
        rank = torch.searchsorted(cumulative_energy, threshold).item() + 1
        ranks[f"{threshold:.1%}"] = rank
    
    return ranks, S[:n_components]

Practice Exercises

  1. Mathematical: For a dΓ—d weight matrix, compute the compression ratio of LoRA with rank r compared to the full matrix. What is the break-even point where LoRA uses more parameters?

  2. Implementation: Implement truncated SVD compression on a transformer layer and measure the perplexity degradation at different ranks (r = 16, 32, 64, 128).

  3. Analysis: Analyze the singular value spectrum of attention projection matrices in a pre-trained LLM. How low-rank are these matrices in practice?

  4. Research: Compare LoRA with different rank selections across tasks (classification, generation, reasoning). Is there a universal optimal rank or does it depend on the task?

Key Takeaways:

  • SVD provides the optimal low-rank approximation in the Frobenius norm
  • LoRA trains only 0.1-1% of parameters by exploiting low-rank weight updates
  • Weight sharing (tied embeddings, cross-layer sharing) reduces parameters significantly
  • Randomized SVD enables efficient factorization of large matrices
  • Most weight matrices are approximately low-rank (top 5-10% singular values capture 95%+ energy)
  • LoRA rank 8-16 works well for most tasks; higher ranks for complex adaptation
  • SVD compression and LoRA are complementary: SVD for inference, LoRA for training

What to Learn Next

-> LoRA and PEFT Efficient fine-tuning using low-rank adaptation.

-> Quantization Techniques Deep Dive GPTQ, AWQ, GGUF, and INT4/INT8 methods.

-> Pruning for LLMs Reducing model size through weight pruning.

-> Model Merging and Fusion Combining multiple fine-tuned models.

-> Mixture of Experts Sparse architectures that scale efficiently.

-> Knowledge Distillation for LLMs Training smaller models from larger teachers.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement