LLM Training

LoRA and PEFT — Efficient Fine-Tuning Without Full Retraining

Parameter-Efficient Fine-Tuning methods enable adaptation of large language models by updating only a small subset of parameters. This guide provides a rigorous treatment of LoRA, its variants, and practical implementation for cost-effective fine-tuning.

Low-Rank Adaptation — Decomposes weight updates into efficient low-rank matrices
0.1-0.5% Parameters — Achieves 95-99% of full fine-tuning performance
AdaLoRA & DoRA — Advanced variants that adapt rank allocation and direction

The most powerful updates are often the smallest ones.

LoRA and PEFT

Parameter-Efficient Fine-Tuning (PEFT) methods enable adaptation of large language models by updating only a small subset of parameters. This tutorial provides a rigorous treatment of LoRA, its variants, and practical implementation.

LoRA: Low-Rank Adaptation

LoRA (Hu et al., 2021) is the most widely used PEFT method. It freezes the pre-trained weights and injects trainable low-rank decomposition matrices into each layer of the Transformer.

LoRA Decomposition

The forward pass with LoRA becomes:

Parameter Efficiency

For a 7B model with r=16, applying LoRA to Q and V projections:

Full parameters: 7B
LoRA parameters: 2 x 16 x 4096 x 32 x 2 = 8.4M (0.12% of total)

Which Layers to Adapt?

Common strategies for applying LoRA:

Q, V projections (default): Most impactful, fewest parameters
Q, K, V, O projections: Full attention adaptation
Q, K, V, O + FFN: Maximum expressiveness
All linear layers: Maximum coverage

LoRA Initialization

The initialization strategy is crucial for LoRA performance:

import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, original_linear, r=8, alpha=16):
        super().__init__()
        self.original = original_linear
        self.original.weight.requires_grad = False
        self.original.bias.requires_grad = False
        
        d_out, d_in = original_linear.weight.shape
        self.lora_A = nn.Parameter(torch.randn(r, d_in) * (1 / d_in ** 0.5))
        self.lora_B = nn.Parameter(torch.zeros(d_out, r))
        self.scaling = alpha / r
        
    def forward(self, x):
        base_out = self.original(x)
        lora_out = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
        return base_out + lora_out

QLoRA: Quantized LoRA

QLoRA (Dettmers et al., 2023) combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs.

Memory Savings

Model Size	FP16 Memory	QLoRA Memory	Can Fit on
7B	14 GB	~4.5 GB	RTX 3060 (12GB)
13B	26 GB	~8 GB	RTX 3090 (24GB)
70B	140 GB	~36 GB	A100 (80GB)

NF4 Quantization

LoRA Variants

AdaLoRA

AdaLoRA (Zhang et al., 2023) adapts the rank allocation across layers based on importance:

DoRA

DoRA (Liu et al., 2024) decomposes weights into magnitude and direction:

HuggingFace PEFT Library

from peft import (
    LoraConfig,
    get_peft_model,
    TaskType,
    PeftModel
)
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 6,742,609,920 || trainable%: 0.2022

# Train the model...
# trainer.train()

# Save only LoRA weights
model.save_pretrained("./lora-adapter")

# Load and merge later
base_model = AutoModelForCausalLM.from_pretrained(model_name)
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
merged_model = model.merge_and_unload()

When to Use LoRA vs Full Fine-tuning

Factor	LoRA	Full Fine-tuning
GPU Memory	Low (4-8 GB)	High (14-28 GB)
Training Speed	Fast	Slow
Quality	95-99% of full FT	100%
Storage	MBs for adapters	GBs for full model
Multi-task	Easy (swap adapters)	Expensive
Catastrophic Forgetting	Less	More

Practice Exercises

Mathematical: Calculate the number of trainable parameters for LoRA applied to Q and V projections of a model with 32 layers, d_model=4096, and rank r=16. Compare with the total model parameters.
Implementation: Use the HuggingFace PEFT library to apply LoRA to Mistral-7B and fine-tune on a small dataset. Compare training time and memory with full fine-tuning.
Analysis: Experiment with different LoRA ranks (4, 8, 16, 32, 64). Plot the relationship between rank, trainable parameters, and task performance.
Research: Compare LoRA, QLoRA, and DoRA on the same task. Which provides the best trade-off between performance and efficiency?

What to Learn Next

-> QLoRA and Quantization Running LLMs on consumer hardware with INT4 quantization.

-> Fine-Tuning LLMs Customizing language models for your specific tasks and domains.

-> Pretraining Language Models Learning language from the internet with CLM, scaling laws, and data curation.

-> RLHF and Alignment Making LLMs safe and helpful through reinforcement learning from human feedback.

-> LLM Inference Optimization Speeding up model inference for production deployment.

-> Building Production LLM Apps From prototype to production: deploying LLMs at scale.

LoRA and PEFT

LoRA and PEFT — Efficient Fine-Tuning Without Full Retraining

LoRA and PEFT

LoRA: Low-Rank Adaptation

LoRA Decomposition

Parameter Efficiency

Which Layers to Adapt?

LoRA Initialization

QLoRA: Quantized LoRA

Memory Savings

NF4 Quantization

LoRA Variants

AdaLoRA

DoRA

HuggingFace PEFT Library

When to Use LoRA vs Full Fine-tuning

Practice Exercises

What to Learn Next

Need Expert LLM Help?