Optimization
Quantization Techniques Deep Dive — Reducing Precision Without Sacrificing Quality
Quantization enables running billion-parameter models on consumer hardware. This deep dive covers GPTQ, AWQ, GGUF, and INT4/INT8 methods—theory, implementation, and when to use each approach.
- Post-Training Quantization — GPTQ and AWQ for minimal quality loss
- Quantization Formats — INT8, INT4, NF4, and GGUF variants
- Quantization-Aware Training — QLoRA for fine-tuning quantized models
- Practical Trade-offs — Quality, speed, and memory comparisons
The art of quantization is knowing what you can throw away without anyone noticing.
Quantization Techniques Deep Dive
Quantization reduces model memory by representing weights with fewer bits. While the concept is simple—map high-precision values to low-precision bins—the implementation details significantly affect quality. This guide covers the major quantization methods and their practical trade-offs.
DfQuantization
Quantization is the process of mapping a large set of values to a smaller set. For neural networks, this typically means converting FP32/FP16 weights to INT8, INT4, or other low-bit formats, reducing memory by 2-8× with varying quality impact.
Quantization Theory
Uniform Quantization
Uniform Quantization
Here,
- =Original floating-point value
- =Quantized integer value
- =Scale factor (step size)
- =Zero-point offset
Quantization Parameters
Here,
- =Number of bits (e.g., 8 for INT8)
- =Scale factor
- =Zero-point
Quantization Error
Quantization Error
Here,
- =Quantization error
- =Original value
- =Reconstructed value
The quantization error is bounded by ±s/2 for uniform quantization. For symmetric quantization (z=0), the error is uniform in [-s/2, s/2]. The total error depends on the weight distribution and quantization range.
Group Quantization
DfGroup Quantization
Group quantization divides weight tensors into groups, each with its own scale factor. This allows different groups to have different dynamic ranges, improving accuracy for weights with varying distributions.
Group Quantization
Here,
- =Scale factor for group g
- =Weights in group g
- =Group index
GPTQ (Post-Training Quantization)
Theory
DfGPTQ
GPTQ (Frantar et al., 2023) performs optimal brain quantization using the inverse Hessian of the layer's output. It quantizes weights column by column, compensating for quantization errors in subsequent columns.
GPTQ Objective
Here,
- =Original weight matrix
- =Quantized weight matrix
- =Input activations (calibration data)
- =Regularization parameter
Implementation
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
class GPTQQuantizer:
"""GPTQ quantization implementation."""
def __init__(self, model, tokenizer, calibration_data):
self.model = model
self.tokenizer = tokenizer
self.calibration_data = calibration_data
def quantize_layer(self, layer, n_bits=4, group_size=128):
"""Quantize a single linear layer using GPTQ."""
W = layer.weight.data # (out_features, in_features)
n_rows, n_cols = W.shape
# Compute Hessian approximation using calibration data
H = self._compute_hessian(layer, self.calibration_data)
H_inv = torch.linalg.inv(H + 1e-6 * torch.eye(n_cols, device=H.device))
# Quantize column by column
W_quantized = torch.zeros_like(W)
for col in range(n_cols):
# Quantize current column
w_col = W[:, col]
# Compute quantization parameters
if group_size > 0:
# Group quantization
for g in range(0, n_rows, group_size):
group = w_col[g:g+group_size]
scale = (group.max() - group.min()) / (2**n_bits - 1)
zero_point = (-group.min() / scale).round()
W_quantized[g:g+group_size, col] = (
(group / scale + zero_point).round() * scale - scale * zero_point
)
# Compensate for quantization error in remaining columns
if col < n_cols - 1:
error = W[:, col] - W_quantized[:, col]
# Distribute error using Hessian
W[:, col+1:] -= error.unsqueeze(1) * H_inv[col, col+1:]
return W_quantized
def _compute_hessian(self, layer, data):
"""Compute Hessian approximation from calibration data."""
H = torch.zeros(layer.in_features, layer.in_features)
for batch in data:
with torch.no_grad():
# Forward pass to get input activations
x = layer._get_input activations(batch)
H += x.T @ x
H /= len(data)
return H
GPTQ Configuration
# Using HuggingFace Transformers
from transformers import AutoModelForCausalLM, GPTQConfig
# Configure GPTQ quantization
gptq_config = GPTQConfig(
bits=4,
dataset="c4",
group_size=128,
desc_act=True, # Sort by activation magnitude
damp_percent=0.01,
true_sequential=True
)
# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=gptq_config,
device_map="auto"
)
# Save quantized model
model.save_pretrained("llama-2-7b-gptq")
AWQ (Activation-Aware Weight Quantization)
Theory
DfAWQ
AWQ (Lin et al., 2024) identifies important weight channels based on activation magnitudes and quantizes them with higher precision. It scales important weights before quantization to preserve their information.
AWQ Scaling
Here,
- =Original weight matrix
- =Per-channel scale factors
- =Quantized weight matrix
Scale Factor Selection
AWQ Scale Optimization
Here,
- =Optimal scale factors
- =Weight matrix
AWQ finds scale factors by minimizing the quantization error of the most important channels. Important channels are identified by the magnitude of their input activations—channels with large activations contribute more to the output and should be quantized more carefully.
Implementation
class AWQQuantizer:
"""AWQ quantization with activation-aware scaling."""
def __init__(self, model, calibration_data):
self.model = model
self.calibration_data = calibration_data
def find_optimal_scales(self, layer, n_bits=4, group_size=128):
"""Find optimal per-channel scales for AWQ."""
W = layer.weight.data
n_out, n_in = W.shape
# Compute activation importance
importance = self._compute_importance(layer)
# Initialize scales
scales = torch.ones(n_in, device=W.device)
# Grid search for optimal scales
best_error = float('inf')
best_scales = scales.clone()
for alpha in [0.5, 0.75, 1.0, 1.25, 1.5]:
candidate_scales = importance.pow(alpha)
candidate_scales = candidate_scales / candidate_scales.mean()
# Quantize with candidate scales
W_scaled = W * candidate_scales.unsqueeze(0)
W_quant = self._quantize(W_scaled, n_bits, group_size)
W_dequant = W_quant / candidate_scales.unsqueeze(0)
# Compute error
error = (W - W_dequant).pow(2).mean()
if error < best_error:
best_error = error
best_scales = candidate_scales
return best_scales
def _compute_importance(self, layer):
"""Compute channel importance from activations."""
importance = torch.zeros(layer.in_features)
for batch in self.calibration_data:
with torch.no_grad():
x = layer._get_input_activations(batch)
importance += x.abs().mean(dim=0)
return importance / len(self.calibration_data)
GGUF (GGML Unified Format)
Overview
DfGGUF
GGUF is a quantization format designed for CPU and CPU+GPU inference, used by llama.cpp and other C++ inference engines. It supports multiple quantization types with per-layer mixed precision.
Quantization Types
| Type | Bits | Quality | Speed | Use Case |
|---|---|---|---|---|
| Q2_K | 2 | Low | Fastest | Extreme compression |
| Q3_K_M | 3 | Moderate | Fast | Low memory |
| Q4_0 | 4 | Good | Fast | Balanced |
| Q4_K_M | 4 | Better | Fast | Recommended |
| Q5_K_M | 5 | Very Good | Moderate | High quality |
| Q6_K | 6 | Excellent | Slower | Near-lossless |
| Q8_0 | 8 | Lossless | Slowest | Maximum quality |
GGUF Conversion
# Converting to GGUF format using llama.cpp
# Step 1: Convert HuggingFace model to GGML
python convert.py /path/to/model --outtype f16
# Step 2: Quantize to desired format
./quantize /path/to/model.ggml /path/to/model-q4_k_m.gguf q4_k_m
# Step 3: Run inference
./main -m /path/to/model-q4_k_m.gguf -p "Hello, world" -n 100
GGUF's mixed-precision approach allows different layers to use different quantization levels. Sensitive layers (like attention projections) can use Q6 or Q8 while FFN layers use Q4, achieving better overall quality than uniform quantization.
INT4 vs INT8 Quantization
Quality Comparison
| Model Size | INT8 PPL | INT4 PPL | FP16 PPL | INT8 Delta | INT4 Delta |
|---|---|---|---|---|---|
| 1B | 12.1 | 12.8 | 11.9 | +0.2 | +0.9 |
| 7B | 5.72 | 5.91 | 5.68 | +0.04 | +0.23 |
| 13B | 5.48 | 5.62 | 5.45 | +0.03 | +0.17 |
| 70B | 3.12 | 3.21 | 3.10 | +0.02 | +0.11 |
Larger models are more robust to quantization. A 70B model quantized to INT4 maintains better quality than a 7B model in FP16. This is because larger models have more redundant representations that can absorb quantization noise.
Memory and Speed
| Format | Memory (7B) | Inference Speed | GPU Support |
|---|---|---|---|
| FP16 | 14 GB | Baseline | All GPUs |
| INT8 | 7 GB | 1.0-1.2× | A100, RTX 30xx+ |
| INT4 | 3.5 GB | 0.8-1.0× | Limited |
| NF4 | 3.5 GB | 0.7-0.9× | CPU only |
INT8 quantization often provides speedup because it reduces memory bandwidth requirements. For memory-bound operations (like autoregressive generation), INT8 can be 20% faster than FP16 despite the quantization overhead.
Quantization-Aware Training (QAT)
QLoRA
DfQLoRA
QLoRA (Dettmers et al., 2023) combines NF4 quantization with LoRA adapters, enabling fine-tuning of quantized models on consumer GPUs. The base model is frozen in INT4 while only LoRA parameters are trained in FP16.
import torch
from transformers import (
AutoModelForCausalLM,
BitsAndBytesConfig,
TrainingArguments,
Trainer
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# NF4 quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
# Prepare for training
model = prepare_model_for_kbit_training(model)
# Add LoRA adapters
lora_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 33,554,432 || all params: 3,776,126,976 || trainable%: 0.8887
Practical Deployment
Choosing the Right Quantization
| Scenario | Recommended | Why |
|---|---|---|
| Production serving | GPTQ INT4 or AWQ | Fast inference, good quality |
| CPU deployment | GGUF Q4_K_M | Optimized for CPU |
| Fine-tuning | QLoRA NF4 | Enables training on consumer GPUs |
| Edge devices | GGUF Q2_K or Q3_K | Minimal memory |
| Maximum quality | INT8 or FP16 | No quality loss |
Benchmarking Results
Model: Llama-2-7B
Hardware: RTX 4090 (24 GB)
| Format | Memory | Tokens/s | PPL |
|---------|--------|----------|------|
| FP16 | 14 GB | 85 | 5.68 |
| GPTQ-4 | 4 GB | 95 | 5.91 |
| AWQ-4 | 4 GB | 98 | 5.89 |
| GGUF-Q4 | 4.2 GB | 72* | 5.93 |
*GGUF runs on CPU, slower than GPU
AWQ typically provides slightly better quality than GPTQ at the same bit width because it preserves important channels. However, GPTQ is more widely supported and faster to quantize.
Practice Exercises
-
Mathematical: Calculate the theoretical memory savings of quantizing a 13B parameter model from FP16 to INT4. Include the overhead of quantization constants and group scales.
-
Implementation: Implement a simple INT8 quantizer with per-channel scaling and compare its perplexity degradation against per-tensor quantization.
-
Analysis: Compare GPTQ and AWQ on a 7B model using the same calibration dataset. Which method produces better quality at 3-bit quantization?
-
Research: Investigate mixed-precision quantization where different layers use different bit widths. What is the optimal allocation strategy?
Key Takeaways:
- GPTQ uses Hessian-based compensation for optimal INT4 quantization
- AWQ preserves important channels using activation-aware scaling
- GGUF supports mixed-precision quantization for CPU inference
- Larger models are more robust to quantization (70B INT4 > 7B FP16)
- QLoRA enables fine-tuning quantized models on consumer GPUs
- INT8 often provides speedup by reducing memory bandwidth requirements
- AWQ generally provides slightly better quality than GPTQ at same bit width
What to Learn Next
-> LoRA and PEFT Efficient fine-tuning using low-rank adaptation.
-> Pruning for LLMs Reducing model size through weight pruning.
-> Low-Rank Factorization SVD decomposition and weight sharing techniques.
-> Hardware-Aware LLM Design Optimizing models for GPU memory hierarchy.
-> Model Merging and Fusion Combining multiple fine-tuned models.
-> LLM Inference Optimization Speeding up model inference for production.