Fine-Tuning LLMs with LoRA/QLoRA
PEFT + Transformers + Custom Datasets | Memory-Efficient Fine-Tuning
Project Overview
Problem Statement
Fine-tuning a full 7B+ parameter LLM requires 28GB+ VRAM. LoRA and QLoRA enable fine-tuning on consumer GPUs by reducing trainable parameters by 90%+ while maintaining 95%+ performance.
Objectives
- Fine-tune LLaMA 3 8B on custom domain data using QLoRA
- Achieve measurable improvement on domain-specific benchmarks
- Implement efficient data preprocessing and formatting
- Deploy the fine-tuned model with proper evaluation
- Track experiments with MLflow and Weights & Biases
| Component | Technology |
|---|---|
| Base Model | LLaMA 3 8B / Mistral 7B |
| Fine-Tuning | PEFT (LoRA/QLoRA) |
| Training | Hugging Face Transformers + TRL |
| Quantization | bitsandbytes (4-bit NF4) |
| Dataset | Custom + Alpaca format |
| Evaluation | lm-evaluation-harness |
| Tracking | MLflow + W&B |
| Deployment | vLLM + Docker |
Architecture Diagram
+-------------------------------------------------------------------+
| Fine-Tuning Pipeline Architecture |
+-------------------------------------------------------------------+
| +--------------+ +--------------+ +------------------+ |
| | Raw Data |--->| Data Prep |--->| Tokenization | |
| | (JSON/CSV) | | & Formatting| | & Formatting | |
| +--------------+ +--------------+ +--------+---------+ |
| | |
| v |
| +--------------+ +--------------+ +------------------+ |
| | Base Model |--->| 4-bit Quant |--->| LoRA Adapter | |
| | (LLaMA 3) | | (NF4) | | Injection | |
| +--------------+ +--------------+ +--------+---------+ |
| | |
| v |
| +--------------+ +--------------+ +------------------+ |
| | Evaluation |<---| Training |<---| SFTTrainer | |
| | & Metrics | | (Mixed Prec)| | (PEFT) | |
| +--------------+ +--------------+ +------------------+ |
+-------------------------------------------------------------------+
Step-by-Step Implementation
Step 1: Environment Setup
conda create -n llm-finetune python=3.10 -y
conda activate llm-finetune
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers>=4.40.0 datasets accelerate
pip install peft>=0.10.0 trl>=0.8.0
pip install bitsandbytes>=0.43.0
pip install scipy sentencepiece protobuf
pip install mlflow wandb tensorboard
pip install lm-eval
wandb login
huggingface-cli login
Step 2: Data Preparation
The dataset preparation is critical for fine-tuning success. We support Alpaca, ShareGPT, and ChatML formats.
# src/data/prepare_dataset.py
import json
import hashlib
import logging
import pandas as pd
from dataclasses import dataclass
from typing import List, Dict, Optional
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
logger = logging.getLogger(__name__)
@dataclass
class DataConfig:
max_length: int = 2048
train_split: float = 0.9
eval_split: float = 0.1
seed: int = 42
min_response_length: int = 10
max_response_length: int = 2000
class DatasetPreparer:
def __init__(self, config: DataConfig):
self.config = config
def load_alpaca(self, path: str) -> List[Dict]:
with open(path) as f:
data = json.load(f)
formatted = []
for item in data:
text = self.format_alpaca(item)
if len(text) >= self.config.min_response_length:
formatted.append({"text": text})
return formatted
def format_alpaca(self, item: Dict) -> str:
if item.get("input"):
return (
f"Below is an instruction that describes a task.\n\n"
f"### Instruction:\n{item['instruction']}\n\n"
f"### Input:\n{item['input']}\n\n"
f"### Response:\n{item['output']}"
)
return (
f"Below is an instruction that describes a task.\n\n"
f"### Instruction:\n{item['instruction']}\n\n"
f"### Response:\n{item['output']}"
)
def create_dataset(self, data: List[Dict]) -> DatasetDict:
dataset = Dataset.from_list(data)
split = dataset.train_test_split(
test_size=self.config.eval_split,
seed=self.config.seed
)
return DatasetDict({
"train": split["train"],
"eval": split["test"],
})
Step 3: LoRA Configuration and Model Loading
Configure LoRA adapters targeting specific model layers for parameter-efficient fine-tuning.
# src/model/lora_config.py
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from transformers import (
AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,
TrainingArguments
)
import torch
def load_model_with_lora(
model_name: str = "meta-llama/Meta-Llama-3-8B",
lora_r: int = 16,
lora_alpha: int = 32,
lora_dropout: float = 0.05,
use_4bit: bool = True,
):
# 4-bit quantization config
bnb_config = None
if use_4bit:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# Load base model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
# Prepare for k-bit training
if use_4bit:
model = prepare_model_for_kbit_training(model)
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=lora_r,
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
return model, tokenizer
Step 4: Training with SFTTrainer
Use the TRL SFTTrainer for efficient supervised fine-tuning with PEFT integration.
# src/train.py
from transformers import TrainingArguments, TrainerCallback
from trl import SFTTrainer, SFTConfig
from peft import PeftModel
import mlflow
import os
class MLflowCallback(TrainerCallback):
def on_log(self, args, state, control, logs=None, **kwargs):
if logs:
for k, v in logs.items():
if isinstance(v, (int, float)):
mlflow.log_metric(k, v, step=state.global_step)
def on_save(self, args, state, control, **kwargs):
mlflow.log_metric("checkpoint_step", state.global_step)
def train(model, tokenizer, dataset, output_dir="./output"):
training_args = SFTConfig(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
weight_decay=0.01,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
logging_steps=10,
eval_strategy="steps",
eval_steps=100,
save_steps=200,
save_total_limit=3,
bf16=True,
max_seq_length=2048,
dataset_text_field="text",
report_to="none",
optim="paged_adamw_32bit",
max_grad_norm=0.3,
group_by_length=True,
)
mlflow.set_experiment("llm-finetuning")
with mlflow.start_run(run_name="qlora-llama3-8b"):
mlflow.log_params({
"model": "llama-3-8b",
"lora_r": 16,
"lora_alpha": 32,
"learning_rate": 2e-4,
"epochs": 3,
"quantization": "qlora-nf4",
})
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["eval"],
processing_class=tokenizer,
callbacks=[MLflowCallback()],
)
trainer.train()
# Save LoRA adapter
trainer.save_model(os.path.join(output_dir, "lora-adapter"))
mlflow.log_artifact(os.path.join(output_dir, "lora-adapter"))
return trainer
Step 5: Model Evaluation
Evaluate the fine-tuned model on held-out test data using multiple metrics.
# src/evaluate.py
import torch
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
from transformers import pipeline
def evaluate_model(model, tokenizer, test_data, metrics=None):
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=256,
temperature=0.1,
)
predictions = []
references = []
for item in test_data:
prompt = f"### Instruction:\n{item['instruction']}\n\n### Response:\n"
output = pipe(prompt)[0]["generated_text"]
pred = output.split("### Response:\n")[-1].strip()
predictions.append(pred)
references.append(item["output"])
results = {}
results["exact_match"] = accuracy_score(references, predictions)
print(f"Evaluation Results:")
print(f" Exact Match: {results['exact_match']:.4f}")
return results
def compute_perplexity(model, tokenizer, texts):
encodings = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**encodings, labels=encodings["input_ids"])
return torch.exp(outputs.loss).item()
Step 6: Merge and Export
Merge LoRA adapters back into the base model for optimized inference.
# src/merge_export.py
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
def merge_and_export(
base_model_name: str,
adapter_path: str,
output_path: str,
):
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload()
model.save_pretrained(output_path)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.save_pretrained(output_path)
print(f"Merged model saved to {output_path}")
Step 7: Deployment with vLLM
Deploy the merged model using vLLM for high-throughput inference.
# Dockerfile
FROM vllm/vllm-openai:latest
COPY merged_model/ /models/llama3-8b-qlora
CMD ["python", "-m", "vllm.entrypoints.openai.api_server",
"--model", "/models/llama3-8b-qlora",
"--tensor-parallel-size", "1",
"--max-model-len", "4096",
"--gpu-memory-utilization", "0.9"]
# Deploy with Docker
docker build -t llama3-qlora-served .
docker run -p 8000:8000 --gpus all llama3-qlora-served
# Test inference
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama3-8b-qlora", "prompt": "Hello!", "max_tokens": 100}'
βΉοΈ
Always save checkpoints during training. Fine-tuning runs can be expensive and time-consuming. Use MLflow or W&B to track all experiments and reproduce results.
π‘
Start with a smaller model and dataset for rapid prototyping. Only scale up to larger models once your pipeline is validated.
Performance Metrics
| Metric | Target | Description |
|---|---|---|
| Training Loss | Converging | Should decrease steadily |
| Validation Loss | Minimal gap | No overfitting |
| Domain Accuracy | > 90% | On benchmark tasks |
| Inference Latency | < 100ms | Per token generation |
| Memory Usage | < 16GB | Fits single GPU |
Interview Talking Points
- Parameter Efficiency: LoRA reduces trainable parameters from billions to millions by learning low-rank decomposition matrices.
- QLoRA Innovation: Quantizing the base model to 4-bit NF4 while computing gradients in 16-bit BFloat16 enables fine-tuning 65B models on a single 48GB GPU.
- Data Quality: High-quality instruction-following data is more important than quantity. 10K curated examples often outperform 100K noisy ones.
- Hyperparameter Sensitivity: Learning rate (1e-4 to 3e-4), rank (8-64), and target modules significantly impact results.
- Evaluation Strategy: Combine automatic metrics with human evaluation and LLM-as-judge approaches.
- Deployment: Merge LoRA adapters into the base model for faster inference. Use vLLM or TGI for serving.
β οΈ
Always save checkpoints during training. Fine-tuning runs can be expensive and time-consuming. Use MLflow or W&B to track all experiments.
βΉοΈ
This project demonstrates production-grade LLM fine-tuning. For the complete implementation, refer to the accompanying repository.