LLM Training

QLoRA and Quantization — Running LLMs on Consumer Hardware

Quantization reduces model memory by representing weights with fewer bits, enabling deployment on consumer hardware. This guide covers INT8, INT4, GPTQ, AWQ, and practical BitsAndBytes integration for accessible LLM fine-tuning.

NF4 Quantization — Optimal for normally distributed neural network weights
GPTQ & AWQ — Post-training quantization with minimal quality loss
Consumer GPU Training — Fine-tune 7B models on a single GPU with QLoRA

Democratizing AI means making it run on the hardware people already have.

QLoRA and Quantization

Quantization reduces model memory by representing weights with fewer bits. This tutorial covers the theory and practice of quantization for LLMs, enabling deployment on consumer hardware.

Quantization Formats

FP16 (Half Precision)

BF16 (Brain Float 16)

INT8 Quantization

INT4 Quantization

NormalFloat 4-bit (NF4)

Quantization Methods

GPTQ

GPTQ (Frantar et al., 2023) performs post-training quantization using optimal brain quantization:

AWQ

AWQ (Lin et al., 2024) performs activation-aware weight quantization:

GGML/GGUF

GGML and GGUF are quantization formats designed for CPU inference:

BitsAndBytes Integration

BitsAndBytes provides easy-to-use quantization for PyTorch models:

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer
)
from peft import LoraConfig, get_peft_model

# 4-bit quantization config (QLoRA)
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config_4bit,
    device_map="auto",
)

# 8-bit quantization config
bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.float16,
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config_8bit,
    device_map="auto",
)

QLoRA Training Example

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 3,744,571,392 || trainable%: 0.1820

training_args = TrainingArguments(
    output_dir="./qlora-output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_steps=100,
    max_steps=1000,
    fp16=True,
    optim="paged_adamw_32bit",
    logging_steps=50,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()
`

## Memory Savings Calculation

<MathFormula
  title="Memory Savings"
  tex={`\\text{Savings} = \\left(1 - \\frac{\\text{bits}}{16}\\right) \\times 100\\%`}
/>

| Quantization | Bits | Memory per Param | 7B Model Size | Savings vs FP16 |
|-------------|------|-------------------|---------------|-----------------|
| FP32 | 32 | 4 bytes | 28 GB | - |
| FP16 | 16 | 2 bytes | 14 GB | Baseline |
| INT8 | 8 | 1 byte | 7 GB | 50% |
| INT4/NF4 | 4 | 0.5 bytes | 3.5 GB | 75% |
| INT4 (double) | 4 | 0.625 bytes | 4.375 GB | 69% |

<MathNote type="tip">
Double quantization (QLoRA's bnb_4bit_use_double_quant=True) quantizes the quantization constants themselves, saving an additional ~0.37 GB per billion parameters with negligible quality loss.
</MathNote>

### Practical Example: Fine-tuning 7B on Consumer GPU

```python
# RTX 3060 (12GB VRAM) can fine-tune Llama-2-7B with QLoRA!
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# Quantize and load
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# Add LoRA
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)

# Memory usage: ~5GB for 4-bit model + LoRA + optimizer states
# Fits in 12GB VRAM!

Practice Exercises

Mathematical: Calculate the VRAM required to load a 13B parameter model in NF4 with double quantization. Include the KV cache for sequence length 2048.
Implementation: Use GPTQ to quantize a 7B model to 4-bit and measure the perplexity degradation on WikiText-2.
Analysis: Compare the quality of NF4 vs INT4 quantization for models of different sizes (1B, 7B, 13B). At what model size does 4-bit quantization become lossless?
Research: Investigate mixed-precision quantization strategies. Can you achieve better quality by using 8-bit for sensitive layers and 4-bit for others?

What to Learn Next

-> LoRA and PEFT Efficient fine-tuning without full retraining using low-rank adaptation.

-> Fine-Tuning LLMs Customizing language models for your specific tasks and domains.

-> LLM Inference Optimization Speeding up model inference for production deployment.

-> LLM Safety and Red Teaming Testing and hardening LLMs against adversarial attacks.

-> Building Production LLM Apps From prototype to production: deploying LLMs at scale.

-> Open-Source LLM Ecosystem Navigating the landscape of open-weight models and communities.

QLoRA and Quantization

QLoRA and Quantization — Running LLMs on Consumer Hardware

QLoRA and Quantization

Quantization Formats

FP16 (Half Precision)

BF16 (Brain Float 16)

INT8 Quantization

INT4 Quantization

NormalFloat 4-bit (NF4)

Quantization Methods

GPTQ

AWQ

GGML/GGUF

BitsAndBytes Integration

QLoRA Training Example

Practice Exercises

What to Learn Next

Need Expert LLM Help?