QLoRA and Quantization
Quantization reduces model memory by representing weights with fewer bits. This tutorial covers the theory and practice of quantization for LLMs, enabling deployment on consumer hardware.
DfQuantization
Quantization is the process of mapping continuous or high-precision values to a discrete set of lower-precision values. For neural networks, this typically means converting FP32/FP16 weights to INT8, INT4, or other low-bit formats.
Quantization Formats
FP16 (Half Precision)
FP16 Range
Here,
- =16-bit floating point (1 sign + 5 exponent + 10 mantissa bits)
BF16 (Brain Float 16)
BF16 Range
Here,
- =16-bit brain float (1 sign + 8 exponent + 7 mantissa bits)
BF16 has the same dynamic range as FP32 but lower precision. It is preferred for training because it reduces overflow/underflow. FP16 has higher precision but narrower range, requiring loss scaling.
INT8 Quantization
INT8 Quantization
Here,
- =Original FP16/BF16 value
- =Quantized INT8 value
- =Absolute maximum value for scaling
INT4 Quantization
INT4 Quantization
Here,
- =Original value
- =Quantized INT4 value (0-15)
- =Range bounds for quantization
NormalFloat 4-bit (NF4)
DfNF4
NF4 is an information-theoretically optimal 4-bit data type for normally distributed data. It uses quantile-based quantization where each quantization bin has equal probability mass under a standard normal distribution.
NF4 Quantization Levels
Here,
- =Quantization level i
- =Inverse normal CDF
- =Number of bits (4 for NF4)
Quantization Methods
GPTQ
GPTQ (Frantar et al., 2023) performs post-training quantization using optimal brain quantization:
DfGPTQ
GPTQ quantizes model weights column by column, minimizing the squared error between the original and quantized weight matrices. It uses the inverse Hessian of the layer's output to determine optimal quantization order.
GPTQ Objective
Here,
- =Original weight matrix
- =Quantized weight matrix
- =Input activations (calibration data)
AWQ
AWQ (Lin et al., 2024) performs activation-aware weight quantization:
DfAWQ
AWQ identifies important weight channels based on activation magnitudes and quantizes them with higher precision. It scales important weights before quantization to preserve their information.
AWQ Scaling
Here,
- =Original weight
- =Scale factor (larger for important channels)
- =Quantized weight
GGML/GGUF
GGML and GGUF are quantization formats designed for CPU inference:
GGUF (GGML Unified Format) is the successor to GGML and is used by llama.cpp. It supports multiple quantization types (Q4_0, Q4_K_M, Q5_K_M, etc.) with mixed precision across layers.
BitsAndBytes Integration
BitsAndBytes provides easy-to-use quantization for PyTorch models:
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
Trainer
)
from peft import LoraConfig, get_peft_model
# 4-bit quantization config (QLoRA)
bnb_config_4bit = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config_4bit,
device_map="auto",
)
# 8-bit quantization config
bnb_config_8bit = BitsAndBytesConfig(
load_in_8bit=True,
bnb_8bit_compute_dtype=torch.float16,
)
model_8bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config_8bit,
device_map="auto",
)
QLoRA Training Example
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 3,744,571,392 || trainable%: 0.1820
training_args = TrainingArguments(
output_dir="./qlora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_steps=100,
max_steps=1000,
fp16=True,
optim="paged_adamw_32bit",
logging_steps=50,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
)
trainer.train()
`
## Memory Savings Calculation
<MathFormula
title="Memory Savings"
tex="\\text{Savings} = \\left(1 - \\frac{\\text{bits}}{16}\\right) \\times 100\\%"
variables={[
{ symbol: "bits", description: "Target quantization bits" }
]}
/>
| Quantization | Bits | Memory per Param | 7B Model Size | Savings vs FP16 |
|-------------|------|-------------------|---------------|-----------------|
| FP32 | 32 | 4 bytes | 28 GB | - |
| FP16 | 16 | 2 bytes | 14 GB | Baseline |
| INT8 | 8 | 1 byte | 7 GB | 50% |
| INT4/NF4 | 4 | 0.5 bytes | 3.5 GB | 75% |
| INT4 (double) | 4 | 0.625 bytes | 4.375 GB | 69% |
<MathNote type="tip">
Double quantization (QLoRA's bnb_4bit_use_double_quant=True) quantizes the quantization constants themselves, saving an additional ~0.37 GB per billion parameters with negligible quality loss.
</MathNote>
### Practical Example: Fine-tuning 7B on Consumer GPU
`python
# RTX 3060 (12GB VRAM) can fine-tune Llama-2-7B with QLoRA!
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
# Quantize and load
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# Add LoRA
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)
# Memory usage: ~5GB for 4-bit model + LoRA + optimizer states
# Fits in 12GB VRAM!
`
## Practice Exercises
1. **Mathematical**: Calculate the VRAM required to load a 13B parameter model in NF4 with double quantization. Include the KV cache for sequence length 2048.
2. **Implementation**: Use GPTQ to quantize a 7B model to 4-bit and measure the perplexity degradation on WikiText-2.
3. **Analysis**: Compare the quality of NF4 vs INT4 quantization for models of different sizes (1B, 7B, 13B). At what model size does 4-bit quantization become lossless?
4. **Research**: Investigate mixed-precision quantization strategies. Can you achieve better quality by using 8-bit for sensitive layers and 4-bit for others?
<MathSummary>
**Key Takeaways:**
- Quantization maps high-precision weights to lower-bit formats
- NF4 is optimal for normally distributed neural network weights
- GPTQ and AWQ are post-training quantization methods
- BitsAndBytes provides easy INT8/INT4 quantization for PyTorch
- QLoRA enables fine-tuning 7B models on a single consumer GPU
- INT4 quantization saves 75% memory with minimal quality loss for models 7B+
</MathSummary>