Fine-Tuning LLMs with LoRA/QLoRA

PEFT + Transformers + Custom Datasets | Memory-Efficient Fine-Tuning

Expert16+ HoursGPU Required

Project Overview

Problem Statement

Fine-tuning a full 7B+ parameter LLM requires 28GB+ VRAM. LoRA and QLoRA enable fine-tuning on consumer GPUs by reducing trainable parameters by 90%+ while maintaining 95%+ performance.

Objectives

Fine-tune LLaMA 3 8B on custom domain data using QLoRA
Achieve measurable improvement on domain-specific benchmarks
Implement efficient data preprocessing and formatting
Deploy the fine-tuned model with proper evaluation
Track experiments with MLflow and Weights & Biases

Component	Technology
Base Model	LLaMA 3 8B / Mistral 7B
Fine-Tuning	PEFT (LoRA/QLoRA)
Training	Hugging Face Transformers + TRL
Quantization	bitsandbytes (4-bit NF4)
Dataset	Custom + Alpaca format
Evaluation	lm-evaluation-harness
Tracking	MLflow + W&B
Deployment	vLLM + Docker

Architecture Diagram

+-------------------------------------------------------------------+
|                  Fine-Tuning Pipeline Architecture                |
+-------------------------------------------------------------------+
|  +--------------+    +--------------+    +------------------+     |
|  |  Raw Data     |--->|  Data Prep   |--->|  Tokenization    |     |
|  |  (JSON/CSV)  |    |  & Formatting|    |  & Formatting    |     |
|  +--------------+    +--------------+    +--------+---------+     |
|                                                 |                 |
|                                                 v                 |
|  +--------------+    +--------------+    +------------------+     |
|  |  Base Model   |--->|  4-bit Quant |--->|  LoRA Adapter    |     |
|  |  (LLaMA 3)   |    |  (NF4)       |    |  Injection       |     |
|  +--------------+    +--------------+    +--------+---------+     |
|                                                 |                 |
|                                                 v                 |
|  +--------------+    +--------------+    +------------------+     |
|  |  Evaluation   |<---|  Training    |<---|  SFTTrainer      |     |
|  |  & Metrics   |    |  (Mixed Prec)|    |  (PEFT)          |     |
|  +--------------+    +--------------+    +------------------+     |
+-------------------------------------------------------------------+

Step-by-Step Implementation

Step 1: Environment Setup

conda create -n llm-finetune python=3.10 -y
conda activate llm-finetune
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers>=4.40.0 datasets accelerate
pip install peft>=0.10.0 trl>=0.8.0
pip install bitsandbytes>=0.43.0
pip install scipy sentencepiece protobuf
pip install mlflow wandb tensorboard
pip install lm-eval
wandb login
huggingface-cli login

Step 2: Data Preparation

The dataset preparation is critical for fine-tuning success. We support Alpaca, ShareGPT, and ChatML formats.

# src/data/prepare_dataset.py
import json
import hashlib
import logging
import pandas as pd
from dataclasses import dataclass
from typing import List, Dict, Optional
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split

logger = logging.getLogger(__name__)


@dataclass
class DataConfig:
    max_length: int = 2048
    train_split: float = 0.9
    eval_split: float = 0.1
    seed: int = 42
    min_response_length: int = 10
    max_response_length: int = 2000


class DatasetPreparer:
    def __init__(self, config: DataConfig):
        self.config = config

    def load_alpaca(self, path: str) -> List[Dict]:
        with open(path) as f:
            data = json.load(f)
        formatted = []
        for item in data:
            text = self.format_alpaca(item)
            if len(text) >= self.config.min_response_length:
                formatted.append({"text": text})
        return formatted

    def format_alpaca(self, item: Dict) -> str:
        if item.get("input"):
            return (
                f"Below is an instruction that describes a task.\n\n"
                f"### Instruction:\n{item['instruction']}\n\n"
                f"### Input:\n{item['input']}\n\n"
                f"### Response:\n{item['output']}"
            )
        return (
            f"Below is an instruction that describes a task.\n\n"
            f"### Instruction:\n{item['instruction']}\n\n"
            f"### Response:\n{item['output']}"
        )

    def create_dataset(self, data: List[Dict]) -> DatasetDict:
        dataset = Dataset.from_list(data)
        split = dataset.train_test_split(
            test_size=self.config.eval_split,
            seed=self.config.seed
        )
        return DatasetDict({
            "train": split["train"],
            "eval": split["test"],
        })

Step 3: LoRA Configuration and Model Loading

Configure LoRA adapters targeting specific model layers for parameter-efficient fine-tuning.

# src/model/lora_config.py
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,
    TrainingArguments
)
import torch


def load_model_with_lora(
    model_name: str = "meta-llama/Meta-Llama-3-8B",
    lora_r: int = 16,
    lora_alpha: int = 32,
    lora_dropout: float = 0.05,
    use_4bit: bool = True,
):
    # 4-bit quantization config
    bnb_config = None
    if use_4bit:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
        )

    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
    )

    # Prepare for k-bit training
    if use_4bit:
        model = prepare_model_for_kbit_training(model)

    # Configure LoRA
    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                         "gate_proj", "up_proj", "down_proj"],
        bias="none",
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    return model, tokenizer

Step 4: Training with SFTTrainer

Use the TRL SFTTrainer for efficient supervised fine-tuning with PEFT integration.

# src/train.py
from transformers import TrainingArguments, TrainerCallback
from trl import SFTTrainer, SFTConfig
from peft import PeftModel
import mlflow
import os


class MLflowCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs:
            for k, v in logs.items():
                if isinstance(v, (int, float)):
                    mlflow.log_metric(k, v, step=state.global_step)

    def on_save(self, args, state, control, **kwargs):
        mlflow.log_metric("checkpoint_step", state.global_step)


def train(model, tokenizer, dataset, output_dir="./output"):
    training_args = SFTConfig(
        output_dir=output_dir,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        weight_decay=0.01,
        warmup_ratio=0.03,
        lr_scheduler_type="cosine",
        logging_steps=10,
        eval_strategy="steps",
        eval_steps=100,
        save_steps=200,
        save_total_limit=3,
        bf16=True,
        max_seq_length=2048,
        dataset_text_field="text",
        report_to="none",
        optim="paged_adamw_32bit",
        max_grad_norm=0.3,
        group_by_length=True,
    )

    mlflow.set_experiment("llm-finetuning")
    with mlflow.start_run(run_name="qlora-llama3-8b"):
        mlflow.log_params({
            "model": "llama-3-8b",
            "lora_r": 16,
            "lora_alpha": 32,
            "learning_rate": 2e-4,
            "epochs": 3,
            "quantization": "qlora-nf4",
        })

        trainer = SFTTrainer(
            model=model,
            args=training_args,
            train_dataset=dataset["train"],
            eval_dataset=dataset["eval"],
            processing_class=tokenizer,
            callbacks=[MLflowCallback()],
        )

        trainer.train()

        # Save LoRA adapter
        trainer.save_model(os.path.join(output_dir, "lora-adapter"))
        mlflow.log_artifact(os.path.join(output_dir, "lora-adapter"))

    return trainer

Step 5: Model Evaluation

Evaluate the fine-tuned model on held-out test data using multiple metrics.

# src/evaluate.py
import torch
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
from transformers import pipeline


def evaluate_model(model, tokenizer, test_data, metrics=None):
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=256,
        temperature=0.1,
    )

    predictions = []
    references = []

    for item in test_data:
        prompt = f"### Instruction:\n{item['instruction']}\n\n### Response:\n"
        output = pipe(prompt)[0]["generated_text"]
        pred = output.split("### Response:\n")[-1].strip()
        predictions.append(pred)
        references.append(item["output"])

    results = {}
    results["exact_match"] = accuracy_score(references, predictions)

    print(f"Evaluation Results:")
    print(f"  Exact Match: {results['exact_match']:.4f}")
    return results


def compute_perplexity(model, tokenizer, texts):
    encodings = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**encodings, labels=encodings["input_ids"])
    return torch.exp(outputs.loss).item()

Step 6: Merge and Export

Merge LoRA adapters back into the base model for optimized inference.

# src/merge_export.py
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


def merge_and_export(
    base_model_name: str,
    adapter_path: str,
    output_path: str,
):
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )

    model = PeftModel.from_pretrained(base_model, adapter_path)
    model = model.merge_and_unload()

    model.save_pretrained(output_path)
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    tokenizer.save_pretrained(output_path)

    print(f"Merged model saved to {output_path}")

Step 7: Deployment with vLLM

Deploy the merged model using vLLM for high-throughput inference.

# Dockerfile
FROM vllm/vllm-openai:latest
COPY merged_model/ /models/llama3-8b-qlora
CMD ["python", "-m", "vllm.entrypoints.openai.api_server",
     "--model", "/models/llama3-8b-qlora",
     "--tensor-parallel-size", "1",
     "--max-model-len", "4096",
     "--gpu-memory-utilization", "0.9"]

# Deploy with Docker
docker build -t llama3-qlora-served .
docker run -p 8000:8000 --gpus all llama3-qlora-served

# Test inference
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3-8b-qlora", "prompt": "Hello!", "max_tokens": 100}'

ℹ️

Always save checkpoints during training. Fine-tuning runs can be expensive and time-consuming. Use MLflow or W&B to track all experiments and reproduce results.

💡

Start with a smaller model and dataset for rapid prototyping. Only scale up to larger models once your pipeline is validated.

Performance Metrics

Metric	Target	Description
Training Loss	Converging	Should decrease steadily
Validation Loss	Minimal gap	No overfitting
Domain Accuracy	> 90%	On benchmark tasks
Inference Latency	< 100ms	Per token generation
Memory Usage	< 16GB	Fits single GPU

Interview Talking Points

Parameter Efficiency: LoRA reduces trainable parameters from billions to millions by learning low-rank decomposition matrices.
QLoRA Innovation: Quantizing the base model to 4-bit NF4 while computing gradients in 16-bit BFloat16 enables fine-tuning 65B models on a single 48GB GPU.
Data Quality: High-quality instruction-following data is more important than quantity. 10K curated examples often outperform 100K noisy ones.
Hyperparameter Sensitivity: Learning rate (1e-4 to 3e-4), rank (8-64), and target modules significantly impact results.
Evaluation Strategy: Combine automatic metrics with human evaluation and LLM-as-judge approaches.
Deployment: Merge LoRA adapters into the base model for faster inference. Use vLLM or TGI for serving.

⚠️

Always save checkpoints during training. Fine-tuning runs can be expensive and time-consuming. Use MLflow or W&B to track all experiments.

ℹ️

This project demonstrates production-grade LLM fine-tuning. For the complete implementation, refer to the accompanying repository.

Fine-Tuning LLMs with LoRA/QLoRA on Custom Data