LoRA and PEFT
Parameter-Efficient Fine-Tuning (PEFT) methods enable adaptation of large language models by updating only a small subset of parameters. This tutorial provides a rigorous treatment of LoRA, its variants, and practical implementation.
DfParameter-Efficient Fine-Tuning (PEFT)
PEFT is a family of techniques that fine-tune large models by updating only a small number of parameters. The key insight is that the weight updates during fine-tuning have low intrinsic dimensionality, meaning they can be decomposed into low-rank matrices.
LoRA: Low-Rank Adaptation
LoRA (Hu et al., 2021) is the most widely used PEFT method. It freezes the pre-trained weights and injects trainable low-rank decomposition matrices into each layer of the Transformer.
DfLoRA
LoRA decomposes the weight update matrix W into two low-rank matrices B and A, such that the update is parameter-efficient: W = W_0 + BA, where B is d x r and A is r x d, with r << d.
LoRA Decomposition
The forward pass with LoRA becomes:
LoRA Forward Pass
Here,
- =Frozen pre-trained weight
- =LoRA matrices
- =Scaling factor (typically 16 or 32)
- =Rank
- =Input activation
📝LoRA Parameter Calculation
For a 7B model with d_model = 4096, 32 layers, rank r = 16, applying LoRA to Q and V:
- Full parameters: 7B
- LoRA Q params: 16 x 4096 x 32 = 2.1M per layer, 67M total
- LoRA V params: 16 x 4096 x 32 = 2.1M per layer, 67M total
- Total LoRA: 134M params (1.9% of full model) This achieves 95%+ of full fine-tuning performance with less than 2% of the parameters.
Parameter Efficiency
LoRA Parameter Count
Here,
- =Rank
- =Model dimension
- =Number of layers
- =Number of LoRA matrices per layer (typically 2-4)
For a 7B model with r=16, applying LoRA to Q and V projections:
- Full parameters: 7B
- LoRA parameters: 2 x 16 x 4096 x 32 x 2 = 8.4M (0.12% of total)
LoRA rank r is a critical hyperparameter. Start with r=8 or r=16; increase to 32-64 only if performance is insufficient. Higher rank does not always mean better performance.
Which Layers to Adapt?
Common strategies for applying LoRA:
- Q, V projections (default): Most impactful, fewest parameters
- Q, K, V, O projections: Full attention adaptation
- Q, K, V, O + FFN: Maximum expressiveness
- All linear layers: Maximum coverage
LoRA Initialization
The initialization strategy is crucial for LoRA performance:
`python import torch import torch.nn as nn
class LoRALinear(nn.Module): def init(self, original_linear, r=8, alpha=16): super().init() self.original = original_linear self.original.weight.requires_grad = False self.original.bias.requires_grad = False
d_out, d_in = original_linear.weight.shape self.lora_A = nn.Parameter(torch.randn(r, d_in) * (1 / d_in ** 0.5)) self.lora_B = nn.Parameter(torch.zeros(d_out, r)) self.scaling = alpha / r
def forward(self, x): base_out = self.original(x) lora_out = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling return base_out + lora_out `
B is initialized to zeros so that LoRA starts with zero update (the model behaves identically to the pre-trained model at initialization). A uses Kaiming or Gaussian initialization.
QLoRA: Quantized LoRA
QLoRA (Dettmers et al., 2023) combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs.
DfQLoRA
QLoRA quantizes the pre-trained model to 4-bit precision (NF4 format) and applies LoRA adapters in FP16. The base model weights are frozen and quantized, while LoRA matrices are trained in FP16.
Memory Savings
QLoRA Memory
Here,
- =Full FP16 model memory
- =LoRA adapter memory (small)
- =Adam states for LoRA params only
| Model Size | FP16 Memory | QLoRA Memory | Can Fit on |
|---|---|---|---|
| 7B | 14 GB | ~4.5 GB | RTX 3060 (12GB) |
| 13B | 26 GB | ~8 GB | RTX 3090 (24GB) |
| 70B | 140 GB | ~36 GB | A100 (80GB) |
NF4 Quantization
DfNormalFloat 4-bit (NF4)
NF4 is a 4-bit data type optimized for normally distributed weights. It uses quantile-based quantization, ensuring equal number of values in each quantization bin. This is more efficient than uniform INT4 quantization for neural network weights.
LoRA Variants
AdaLoRA
AdaLoRA (Zhang et al., 2023) adapts the rank allocation across layers based on importance:
AdaLoRA Rank Allocation
Here,
- =Rank allocated to layer l
- =Importance score of layer l
- =Total rank budget
DoRA
DoRA (Liu et al., 2024) decomposes weights into magnitude and direction:
DoRA Decomposition
Here,
- =Magnitude vector (learnable)
- =LoRA-updated weight
- =Column-wise norm
DoRA often outperforms standard LoRA by separating magnitude and directional learning, especially for tasks requiring significant weight changes.
HuggingFace PEFT Library
`python from peft import ( LoraConfig, get_peft_model, TaskType, PeftModel ) from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-hf" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" )
lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, lora_alpha=32, lora_dropout=0.1, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], bias="none", )
model = get_peft_model(model, lora_config) model.print_trainable_parameters()
trainable params: 13,631,488 || all params: 6,742,609,920 || trainable%: 0.2022
Train the model...
trainer.train()
Save only LoRA weights
model.save_pretrained("./lora-adapter")
Load and merge later
base_model = AutoModelForCausalLM.from_pretrained(model_name) model = PeftModel.from_pretrained(base_model, "./lora-adapter") merged_model = model.merge_and_unload() `
When to Use LoRA vs Full Fine-tuning
| Factor | LoRA | Full Fine-tuning |
|---|---|---|
| GPU Memory | Low (4-8 GB) | High (14-28 GB) |
| Training Speed | Fast | Slow |
| Quality | 95-99% of full FT | 100% |
| Storage | MBs for adapters | GBs for full model |
| Multi-task | Easy (swap adapters) | Expensive |
| Catastrophic Forgetting | Less | More |
For most use cases, LoRA with r=16-32 applied to Q, K, V, O projections achieves 95-99% of full fine-tuning performance at a fraction of the cost. Use full fine-tuning only when maximum performance is critical and compute budget allows.
Practice Exercises
-
Mathematical: Calculate the number of trainable parameters for LoRA applied to Q and V projections of a model with 32 layers, d_model=4096, and rank r=16. Compare with the total model parameters.
-
Implementation: Use the HuggingFace PEFT library to apply LoRA to Mistral-7B and fine-tune on a small dataset. Compare training time and memory with full fine-tuning.
-
Analysis: Experiment with different LoRA ranks (4, 8, 16, 32, 64). Plot the relationship between rank, trainable parameters, and task performance.
-
Research: Compare LoRA, QLoRA, and DoRA on the same task. Which provides the best trade-off between performance and efficiency?
Key Takeaways:
- LoRA decomposes weight updates as W = W_0 + BA with r << d
- LoRA trains only 0.1-0.5% of total parameters while achieving 95-99% of full FT performance
- QLoRA combines 4-bit NF4 quantization with LoRA for consumer GPU training
- AdaLoRA adapts rank allocation based on layer importance
- DoRA separates magnitude and direction for improved performance
- Use PEFT for most fine-tuning tasks; full fine-tuning only when maximum performance is needed