Optimization

Quantization Techniques Deep Dive — Reducing Precision Without Sacrificing Quality

Quantization enables running billion-parameter models on consumer hardware. This deep dive covers GPTQ, AWQ, GGUF, and INT4/INT8 methods—theory, implementation, and when to use each approach.

Post-Training Quantization — GPTQ and AWQ for minimal quality loss
Quantization Formats — INT8, INT4, NF4, and GGUF variants
Quantization-Aware Training — QLoRA for fine-tuning quantized models
Practical Trade-offs — Quality, speed, and memory comparisons

The art of quantization is knowing what you can throw away without anyone noticing.

Quantization Techniques Deep Dive

Quantization reduces model memory by representing weights with fewer bits. While the concept is simple—map high-precision values to low-precision bins—the implementation details significantly affect quality. This guide covers the major quantization methods and their practical trade-offs.

DfQuantization

Quantization is the process of mapping a large set of values to a smaller set. For neural networks, this typically means converting FP32/FP16 weights to INT8, INT4, or other low-bit formats, reducing memory by 2-8× with varying quality impact.

Quantization Theory

Uniform Quantization

x_q = \\text{round}\\left(\\frac{x}{s}\\right) + z

Here,

$x$ =Original floating-point value
$x_q$ =Quantized integer value
$s$ =Scale factor (step size)
$z$ =Zero-point offset

Quantization Parameters

s = \\frac{\\max(x) - \\min(x)}{2^b - 1}, \\quad z = \\text{round}\\left(-\\frac{\\min(x)}{s}\\right)

Here,

$b$ =Number of bits (e.g., 8 for INT8)
$s$ =Scale factor
$z$ =Zero-point

Quantization Error

\\epsilon = x - \\hat{x} = x - s \\cdot (x_q - z)

Here,

$\epsilon$ =Quantization error
$x$ =Original value
$\hat{x}$ =Reconstructed value

The quantization error is bounded by ±s/2 for uniform quantization. For symmetric quantization (z=0), the error is uniform in [-s/2, s/2]. The total error depends on the weight distribution and quantization range.

Group Quantization

DfGroup Quantization

Group quantization divides weight tensors into groups, each with its own scale factor. This allows different groups to have different dynamic ranges, improving accuracy for weights with varying distributions.

Group Quantization

s_g = \\frac{\\max(W_{g}) - \\min(W_{g})}{2^b - 1}

Here,

$s_g$ =Scale factor for group g
$W_{g}$ =Weights in group g
$g$ =Group index

GPTQ (Post-Training Quantization)

Theory

DfGPTQ

GPTQ (Frantar et al., 2023) performs optimal brain quantization using the inverse Hessian of the layer's output. It quantizes weights column by column, compensating for quantization errors in subsequent columns.

GPTQ Objective

\\min_{\\hat{W}} \\|WX - \\hat{W}X\\|_2^2 + \\lambda \\|W - \\hat{W}\\|_F^2

Here,

$W$ =Original weight matrix
$\hat{W}$ =Quantized weight matrix
$X$ =Input activations (calibration data)
$\lambda$ =Regularization parameter

Implementation

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

class GPTQQuantizer:
    """GPTQ quantization implementation."""
    
    def __init__(self, model, tokenizer, calibration_data):
        self.model = model
        self.tokenizer = tokenizer
        self.calibration_data = calibration_data
        
    def quantize_layer(self, layer, n_bits=4, group_size=128):
        """Quantize a single linear layer using GPTQ."""
        W = layer.weight.data  # (out_features, in_features)
        n_rows, n_cols = W.shape
        
        # Compute Hessian approximation using calibration data
        H = self._compute_hessian(layer, self.calibration_data)
        H_inv = torch.linalg.inv(H + 1e-6 * torch.eye(n_cols, device=H.device))
        
        # Quantize column by column
        W_quantized = torch.zeros_like(W)
        
        for col in range(n_cols):
            # Quantize current column
            w_col = W[:, col]
            
            # Compute quantization parameters
            if group_size > 0:
                # Group quantization
                for g in range(0, n_rows, group_size):
                    group = w_col[g:g+group_size]
                    scale = (group.max() - group.min()) / (2**n_bits - 1)
                    zero_point = (-group.min() / scale).round()
                    W_quantized[g:g+group_size, col] = (
                        (group / scale + zero_point).round() * scale - scale * zero_point
                    )
            
            # Compensate for quantization error in remaining columns
            if col < n_cols - 1:
                error = W[:, col] - W_quantized[:, col]
                # Distribute error using Hessian
                W[:, col+1:] -= error.unsqueeze(1) * H_inv[col, col+1:]
        
        return W_quantized
    
    def _compute_hessian(self, layer, data):
        """Compute Hessian approximation from calibration data."""
        H = torch.zeros(layer.in_features, layer.in_features)
        
        for batch in data:
            with torch.no_grad():
                # Forward pass to get input activations
                x = layer._get_input activations(batch)
                H += x.T @ x
        
        H /= len(data)
        return H

GPTQ Configuration

# Using HuggingFace Transformers
from transformers import AutoModelForCausalLM, GPTQConfig

# Configure GPTQ quantization
gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",
    group_size=128,
    desc_act=True,      # Sort by activation magnitude
    damp_percent=0.01,
    true_sequential=True
)

# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=gptq_config,
    device_map="auto"
)

# Save quantized model
model.save_pretrained("llama-2-7b-gptq")

AWQ (Activation-Aware Weight Quantization)

Theory

DfAWQ

AWQ (Lin et al., 2024) identifies important weight channels based on activation magnitudes and quantizes them with higher precision. It scales important weights before quantization to preserve their information.

AWQ Scaling

\\hat{W} = \\text{quantize}(W \\cdot \\text{diag}(s)) / s

Here,

$W$ =Original weight matrix
$s$ =Per-channel scale factors
$\hat{W}$ =Quantized weight matrix

Scale Factor Selection

AWQ Scale Optimization

s^* = \\arg\\min_s \\|W \\cdot \\text{diag}(s) - \\text{quantize}(W \\cdot \\text{diag}(s))\\|_F^2

Here,

$s^*$ =Optimal scale factors
$W$ =Weight matrix

AWQ finds scale factors by minimizing the quantization error of the most important channels. Important channels are identified by the magnitude of their input activations—channels with large activations contribute more to the output and should be quantized more carefully.

Implementation

class AWQQuantizer:
    """AWQ quantization with activation-aware scaling."""
    
    def __init__(self, model, calibration_data):
        self.model = model
        self.calibration_data = calibration_data
    
    def find_optimal_scales(self, layer, n_bits=4, group_size=128):
        """Find optimal per-channel scales for AWQ."""
        W = layer.weight.data
        n_out, n_in = W.shape
        
        # Compute activation importance
        importance = self._compute_importance(layer)
        
        # Initialize scales
        scales = torch.ones(n_in, device=W.device)
        
        # Grid search for optimal scales
        best_error = float('inf')
        best_scales = scales.clone()
        
        for alpha in [0.5, 0.75, 1.0, 1.25, 1.5]:
            candidate_scales = importance.pow(alpha)
            candidate_scales = candidate_scales / candidate_scales.mean()
            
            # Quantize with candidate scales
            W_scaled = W * candidate_scales.unsqueeze(0)
            W_quant = self._quantize(W_scaled, n_bits, group_size)
            W_dequant = W_quant / candidate_scales.unsqueeze(0)
            
            # Compute error
            error = (W - W_dequant).pow(2).mean()
            
            if error < best_error:
                best_error = error
                best_scales = candidate_scales
        
        return best_scales
    
    def _compute_importance(self, layer):
        """Compute channel importance from activations."""
        importance = torch.zeros(layer.in_features)
        
        for batch in self.calibration_data:
            with torch.no_grad():
                x = layer._get_input_activations(batch)
                importance += x.abs().mean(dim=0)
        
        return importance / len(self.calibration_data)

GGUF (GGML Unified Format)

Overview

DfGGUF

GGUF is a quantization format designed for CPU and CPU+GPU inference, used by llama.cpp and other C++ inference engines. It supports multiple quantization types with per-layer mixed precision.

Quantization Types

Type	Bits	Quality	Speed	Use Case
Q2_K	2	Low	Fastest	Extreme compression
Q3_K_M	3	Moderate	Fast	Low memory
Q4_0	4	Good	Fast	Balanced
Q4_K_M	4	Better	Fast	Recommended
Q5_K_M	5	Very Good	Moderate	High quality
Q6_K	6	Excellent	Slower	Near-lossless
Q8_0	8	Lossless	Slowest	Maximum quality

GGUF Conversion

# Converting to GGUF format using llama.cpp
# Step 1: Convert HuggingFace model to GGML
python convert.py /path/to/model --outtype f16

# Step 2: Quantize to desired format
./quantize /path/to/model.ggml /path/to/model-q4_k_m.gguf q4_k_m

# Step 3: Run inference
./main -m /path/to/model-q4_k_m.gguf -p "Hello, world" -n 100

GGUF's mixed-precision approach allows different layers to use different quantization levels. Sensitive layers (like attention projections) can use Q6 or Q8 while FFN layers use Q4, achieving better overall quality than uniform quantization.

INT4 vs INT8 Quantization

Quality Comparison

Model Size	INT8 PPL	INT4 PPL	FP16 PPL	INT8 Delta	INT4 Delta
1B	12.1	12.8	11.9	+0.2	+0.9
7B	5.72	5.91	5.68	+0.04	+0.23
13B	5.48	5.62	5.45	+0.03	+0.17
70B	3.12	3.21	3.10	+0.02	+0.11

Larger models are more robust to quantization. A 70B model quantized to INT4 maintains better quality than a 7B model in FP16. This is because larger models have more redundant representations that can absorb quantization noise.

Memory and Speed

Format	Memory (7B)	Inference Speed	GPU Support
FP16	14 GB	Baseline	All GPUs
INT8	7 GB	1.0-1.2×	A100, RTX 30xx+
INT4	3.5 GB	0.8-1.0×	Limited
NF4	3.5 GB	0.7-0.9×	CPU only

INT8 quantization often provides speedup because it reduces memory bandwidth requirements. For memory-bound operations (like autoregressive generation), INT8 can be 20% faster than FP16 despite the quantization overhead.

Quantization-Aware Training (QAT)

QLoRA

DfQLoRA

QLoRA (Dettmers et al., 2023) combines NF4 quantization with LoRA adapters, enabling fine-tuning of quantized models on consumer GPUs. The base model is frozen in INT4 while only LoRA parameters are trained in FP16.

import torch
from transformers import (
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# NF4 quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# Add LoRA adapters
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 33,554,432 || all params: 3,776,126,976 || trainable%: 0.8887

Practical Deployment

Choosing the Right Quantization

Scenario	Recommended	Why
Production serving	GPTQ INT4 or AWQ	Fast inference, good quality
CPU deployment	GGUF Q4_K_M	Optimized for CPU
Fine-tuning	QLoRA NF4	Enables training on consumer GPUs
Edge devices	GGUF Q2_K or Q3_K	Minimal memory
Maximum quality	INT8 or FP16	No quality loss

Benchmarking Results

Architecture Diagram

Model: Llama-2-7B
Hardware: RTX 4090 (24 GB)

| Format  | Memory | Tokens/s | PPL  |
|---------|--------|----------|------|
| FP16    | 14 GB  | 85       | 5.68 |
| GPTQ-4  | 4 GB   | 95       | 5.91 |
| AWQ-4   | 4 GB   | 98       | 5.89 |
| GGUF-Q4 | 4.2 GB | 72*      | 5.93 |

*GGUF runs on CPU, slower than GPU

AWQ typically provides slightly better quality than GPTQ at the same bit width because it preserves important channels. However, GPTQ is more widely supported and faster to quantize.

Practice Exercises

Mathematical: Calculate the theoretical memory savings of quantizing a 13B parameter model from FP16 to INT4. Include the overhead of quantization constants and group scales.
Implementation: Implement a simple INT8 quantizer with per-channel scaling and compare its perplexity degradation against per-tensor quantization.
Analysis: Compare GPTQ and AWQ on a 7B model using the same calibration dataset. Which method produces better quality at 3-bit quantization?
Research: Investigate mixed-precision quantization where different layers use different bit widths. What is the optimal allocation strategy?

Key Takeaways:

GPTQ uses Hessian-based compensation for optimal INT4 quantization
AWQ preserves important channels using activation-aware scaling
GGUF supports mixed-precision quantization for CPU inference
Larger models are more robust to quantization (70B INT4 > 7B FP16)
QLoRA enables fine-tuning quantized models on consumer GPUs
INT8 often provides speedup by reducing memory bandwidth requirements
AWQ generally provides slightly better quality than GPTQ at same bit width

What to Learn Next

-> LoRA and PEFT Efficient fine-tuning using low-rank adaptation.

-> Pruning for LLMs Reducing model size through weight pruning.

-> Low-Rank Factorization SVD decomposition and weight sharing techniques.

-> Hardware-Aware LLM Design Optimizing models for GPU memory hierarchy.

-> Model Merging and Fusion Combining multiple fine-tuned models.

-> LLM Inference Optimization Speeding up model inference for production.

Quantization Techniques Deep Dive

Quantization Techniques Deep Dive — Reducing Precision Without Sacrificing Quality

Quantization Techniques Deep Dive

DfQuantization

Quantization Theory

Uniform Quantization

Uniform Quantization

Quantization Parameters

Quantization Error

Quantization Error

Group Quantization

DfGroup Quantization

Group Quantization

GPTQ (Post-Training Quantization)

Theory

DfGPTQ

GPTQ Objective

Implementation

GPTQ Configuration

AWQ (Activation-Aware Weight Quantization)

Theory

DfAWQ

AWQ Scaling

Scale Factor Selection

AWQ Scale Optimization

Implementation

GGUF (GGML Unified Format)

Overview

DfGGUF

Quantization Types

GGUF Conversion

INT4 vs INT8 Quantization

Quality Comparison

Memory and Speed

Quantization-Aware Training (QAT)

QLoRA

DfQLoRA

Practical Deployment

Choosing the Right Quantization

Benchmarking Results

Practice Exercises

What to Learn Next

Need Expert LLM Help?