LoRA and PEFT

TrainingPEFTFree Lesson

Advertisement

LoRA and PEFT

Parameter-Efficient Fine-Tuning (PEFT) methods enable adaptation of large language models by updating only a small subset of parameters. This tutorial provides a rigorous treatment of LoRA, its variants, and practical implementation.

DfParameter-Efficient Fine-Tuning (PEFT)

PEFT is a family of techniques that fine-tune large models by updating only a small number of parameters. The key insight is that the weight updates during fine-tuning have low intrinsic dimensionality, meaning they can be decomposed into low-rank matrices.

LoRA: Low-Rank Adaptation

LoRA (Hu et al., 2021) is the most widely used PEFT method. It freezes the pre-trained weights and injects trainable low-rank decomposition matrices into each layer of the Transformer.

DfLoRA

LoRA decomposes the weight update matrix W into two low-rank matrices B and A, such that the update is parameter-efficient: W = W_0 + BA, where B is d x r and A is r x d, with r << d.

LoRA Decomposition

W=W0+DeltaW=W0+BAW = W_0 + \\Delta W = W_0 + BA

The forward pass with LoRA becomes:

LoRA Forward Pass

h=W0x+fracalpharBAxh = W_0 x + \\frac{\\alpha}{r} BAx

Here,

  • W0W_0=Frozen pre-trained weight
  • B,AB, A=LoRA matrices
  • α\alpha=Scaling factor (typically 16 or 32)
  • rr=Rank
  • xx=Input activation

📝LoRA Parameter Calculation

For a 7B model with d_model = 4096, 32 layers, rank r = 16, applying LoRA to Q and V:

  • Full parameters: 7B
  • LoRA Q params: 16 x 4096 x 32 = 2.1M per layer, 67M total
  • LoRA V params: 16 x 4096 x 32 = 2.1M per layer, 67M total
  • Total LoRA: 134M params (1.9% of full model) This achieves 95%+ of full fine-tuning performance with less than 2% of the parameters.

Parameter Efficiency

LoRA Parameter Count

textLoRAparams=2timesrtimesdtimesntextlayerstimesntextmatrices\\text{LoRA params} = 2 \\times r \\times d \\times n_{\\text{layers}} \\times n_{\\text{matrices}}

Here,

  • rr=Rank
  • dd=Model dimension
  • nlayersn_{\text{layers}}=Number of layers
  • nmatricesn_{\text{matrices}}=Number of LoRA matrices per layer (typically 2-4)

For a 7B model with r=16, applying LoRA to Q and V projections:

  • Full parameters: 7B
  • LoRA parameters: 2 x 16 x 4096 x 32 x 2 = 8.4M (0.12% of total)

LoRA rank r is a critical hyperparameter. Start with r=8 or r=16; increase to 32-64 only if performance is insufficient. Higher rank does not always mean better performance.

Which Layers to Adapt?

Common strategies for applying LoRA:

  • Q, V projections (default): Most impactful, fewest parameters
  • Q, K, V, O projections: Full attention adaptation
  • Q, K, V, O + FFN: Maximum expressiveness
  • All linear layers: Maximum coverage

LoRA Initialization

The initialization strategy is crucial for LoRA performance:

`python import torch import torch.nn as nn

class LoRALinear(nn.Module): def init(self, original_linear, r=8, alpha=16): super().init() self.original = original_linear self.original.weight.requires_grad = False self.original.bias.requires_grad = False

d_out, d_in = original_linear.weight.shape self.lora_A = nn.Parameter(torch.randn(r, d_in) * (1 / d_in ** 0.5)) self.lora_B = nn.Parameter(torch.zeros(d_out, r)) self.scaling = alpha / r

def forward(self, x): base_out = self.original(x) lora_out = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling return base_out + lora_out `

B is initialized to zeros so that LoRA starts with zero update (the model behaves identically to the pre-trained model at initialization). A uses Kaiming or Gaussian initialization.

QLoRA: Quantized LoRA

QLoRA (Dettmers et al., 2023) combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs.

DfQLoRA

QLoRA quantizes the pre-trained model to 4-bit precision (NF4 format) and applies LoRA adapters in FP16. The base model weights are frozen and quantized, while LoRA matrices are trained in FP16.

Memory Savings

QLoRA Memory

textMemorytextQLoRA=fractextMemorytextFP164+textMemorytextLoRA+textOptimizerStates\\text{Memory}_{\\text{QLoRA}} = \\frac{\\text{Memory}_{\\text{FP16}}}{4} + \\text{Memory}_{\\text{LoRA}} + \\text{Optimizer States}

Here,

  • MemoryFP16\text{Memory}_{\text{FP16}}=Full FP16 model memory
  • MemoryLoRA\text{Memory}_{\text{LoRA}}=LoRA adapter memory (small)
  • OptimizerStatesOptimizer States=Adam states for LoRA params only
Model SizeFP16 MemoryQLoRA MemoryCan Fit on
7B14 GB~4.5 GBRTX 3060 (12GB)
13B26 GB~8 GBRTX 3090 (24GB)
70B140 GB~36 GBA100 (80GB)

NF4 Quantization

DfNormalFloat 4-bit (NF4)

NF4 is a 4-bit data type optimized for normally distributed weights. It uses quantile-based quantization, ensuring equal number of values in each quantization bin. This is more efficient than uniform INT4 quantization for neural network weights.

LoRA Variants

AdaLoRA

AdaLoRA (Zhang et al., 2023) adapts the rank allocation across layers based on importance:

AdaLoRA Rank Allocation

rl=textroundleft(rtexttotalcdotfracslsumisiright)r_l = \\text{round}\\left(r_{\\text{total}} \\cdot \\frac{s_l}{\\sum_{i} s_i}\\right)

Here,

  • rlr_l=Rank allocated to layer l
  • sls_l=Importance score of layer l
  • rtotalr_{\text{total}}=Total rank budget

DoRA

DoRA (Liu et al., 2024) decomposes weights into magnitude and direction:

DoRA Decomposition

W=mcdotfracW0+BAW0+BAcW = m \\cdot \\frac{W_0 + BA}{\\|W_0 + BA\\|_c}

Here,

  • mm=Magnitude vector (learnable)
  • W0+BAW_0 + BA=LoRA-updated weight
  • c\|\cdot\|_c=Column-wise norm

DoRA often outperforms standard LoRA by separating magnitude and directional learning, especially for tasks requiring significant weight changes.

HuggingFace PEFT Library

`python from peft import ( LoraConfig, get_peft_model, TaskType, PeftModel ) from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" )

lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, lora_alpha=32, lora_dropout=0.1, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], bias="none", )

model = get_peft_model(model, lora_config) model.print_trainable_parameters()

trainable params: 13,631,488 || all params: 6,742,609,920 || trainable%: 0.2022

Train the model...

trainer.train()

Save only LoRA weights

model.save_pretrained("./lora-adapter")

Load and merge later

base_model = AutoModelForCausalLM.from_pretrained(model_name) model = PeftModel.from_pretrained(base_model, "./lora-adapter") merged_model = model.merge_and_unload() `

When to Use LoRA vs Full Fine-tuning

FactorLoRAFull Fine-tuning
GPU MemoryLow (4-8 GB)High (14-28 GB)
Training SpeedFastSlow
Quality95-99% of full FT100%
StorageMBs for adaptersGBs for full model
Multi-taskEasy (swap adapters)Expensive
Catastrophic ForgettingLessMore

For most use cases, LoRA with r=16-32 applied to Q, K, V, O projections achieves 95-99% of full fine-tuning performance at a fraction of the cost. Use full fine-tuning only when maximum performance is critical and compute budget allows.

Practice Exercises

  1. Mathematical: Calculate the number of trainable parameters for LoRA applied to Q and V projections of a model with 32 layers, d_model=4096, and rank r=16. Compare with the total model parameters.

  2. Implementation: Use the HuggingFace PEFT library to apply LoRA to Mistral-7B and fine-tune on a small dataset. Compare training time and memory with full fine-tuning.

  3. Analysis: Experiment with different LoRA ranks (4, 8, 16, 32, 64). Plot the relationship between rank, trainable parameters, and task performance.

  4. Research: Compare LoRA, QLoRA, and DoRA on the same task. Which provides the best trade-off between performance and efficiency?

Key Takeaways:

  • LoRA decomposes weight updates as W = W_0 + BA with r << d
  • LoRA trains only 0.1-0.5% of total parameters while achieving 95-99% of full FT performance
  • QLoRA combines 4-bit NF4 quantization with LoRA for consumer GPU training
  • AdaLoRA adapts rank allocation based on layer importance
  • DoRA separates magnitude and direction for improved performance
  • Use PEFT for most fine-tuning tasks; full fine-tuning only when maximum performance is needed

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement