CW

Quantization Techniques Deep Dive

OptimizationQuantizationFree Lesson

Advertisement

Optimization

Quantization Techniques Deep Dive — Reducing Precision Without Sacrificing Quality

Quantization enables running billion-parameter models on consumer hardware. This deep dive covers GPTQ, AWQ, GGUF, and INT4/INT8 methods—theory, implementation, and when to use each approach.

  • Post-Training Quantization — GPTQ and AWQ for minimal quality loss
  • Quantization Formats — INT8, INT4, NF4, and GGUF variants
  • Quantization-Aware Training — QLoRA for fine-tuning quantized models
  • Practical Trade-offs — Quality, speed, and memory comparisons

The art of quantization is knowing what you can throw away without anyone noticing.

Quantization Techniques Deep Dive

Quantization reduces model memory by representing weights with fewer bits. While the concept is simple—map high-precision values to low-precision bins—the implementation details significantly affect quality. This guide covers the major quantization methods and their practical trade-offs.

DfQuantization

Quantization is the process of mapping a large set of values to a smaller set. For neural networks, this typically means converting FP32/FP16 weights to INT8, INT4, or other low-bit formats, reducing memory by 2-8× with varying quality impact.

Quantization Theory

Uniform Quantization

Uniform Quantization

xq=textroundleft(fracxsright)+zx_q = \\text{round}\\left(\\frac{x}{s}\\right) + z

Here,

  • xx=Original floating-point value
  • xqx_q=Quantized integer value
  • ss=Scale factor (step size)
  • zz=Zero-point offset

Quantization Parameters

s=fracmax(x)min(x)2b1,quadz=textroundleft(fracmin(x)sright)s = \\frac{\\max(x) - \\min(x)}{2^b - 1}, \\quad z = \\text{round}\\left(-\\frac{\\min(x)}{s}\\right)

Here,

  • bb=Number of bits (e.g., 8 for INT8)
  • ss=Scale factor
  • zz=Zero-point

Quantization Error

Quantization Error

epsilon=xhatx=xscdot(xqz)\\epsilon = x - \\hat{x} = x - s \\cdot (x_q - z)

Here,

  • ϵ\epsilon=Quantization error
  • xx=Original value
  • x^\hat{x}=Reconstructed value

The quantization error is bounded by ±s/2 for uniform quantization. For symmetric quantization (z=0), the error is uniform in [-s/2, s/2]. The total error depends on the weight distribution and quantization range.

Group Quantization

DfGroup Quantization

Group quantization divides weight tensors into groups, each with its own scale factor. This allows different groups to have different dynamic ranges, improving accuracy for weights with varying distributions.

Group Quantization

sg=fracmax(Wg)min(Wg)2b1s_g = \\frac{\\max(W_{g}) - \\min(W_{g})}{2^b - 1}

Here,

  • sgs_g=Scale factor for group g
  • WgW_{g}=Weights in group g
  • gg=Group index

GPTQ (Post-Training Quantization)

Theory

DfGPTQ

GPTQ (Frantar et al., 2023) performs optimal brain quantization using the inverse Hessian of the layer's output. It quantizes weights column by column, compensating for quantization errors in subsequent columns.

GPTQ Objective

minhatWWXhatWX22+lambdaWhatWF2\\min_{\\hat{W}} \\|WX - \\hat{W}X\\|_2^2 + \\lambda \\|W - \\hat{W}\\|_F^2

Here,

  • WW=Original weight matrix
  • W^\hat{W}=Quantized weight matrix
  • XX=Input activations (calibration data)
  • λ\lambda=Regularization parameter

Implementation

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

class GPTQQuantizer:
    """GPTQ quantization implementation."""
    
    def __init__(self, model, tokenizer, calibration_data):
        self.model = model
        self.tokenizer = tokenizer
        self.calibration_data = calibration_data
        
    def quantize_layer(self, layer, n_bits=4, group_size=128):
        """Quantize a single linear layer using GPTQ."""
        W = layer.weight.data  # (out_features, in_features)
        n_rows, n_cols = W.shape
        
        # Compute Hessian approximation using calibration data
        H = self._compute_hessian(layer, self.calibration_data)
        H_inv = torch.linalg.inv(H + 1e-6 * torch.eye(n_cols, device=H.device))
        
        # Quantize column by column
        W_quantized = torch.zeros_like(W)
        
        for col in range(n_cols):
            # Quantize current column
            w_col = W[:, col]
            
            # Compute quantization parameters
            if group_size > 0:
                # Group quantization
                for g in range(0, n_rows, group_size):
                    group = w_col[g:g+group_size]
                    scale = (group.max() - group.min()) / (2**n_bits - 1)
                    zero_point = (-group.min() / scale).round()
                    W_quantized[g:g+group_size, col] = (
                        (group / scale + zero_point).round() * scale - scale * zero_point
                    )
            
            # Compensate for quantization error in remaining columns
            if col < n_cols - 1:
                error = W[:, col] - W_quantized[:, col]
                # Distribute error using Hessian
                W[:, col+1:] -= error.unsqueeze(1) * H_inv[col, col+1:]
        
        return W_quantized
    
    def _compute_hessian(self, layer, data):
        """Compute Hessian approximation from calibration data."""
        H = torch.zeros(layer.in_features, layer.in_features)
        
        for batch in data:
            with torch.no_grad():
                # Forward pass to get input activations
                x = layer._get_input activations(batch)
                H += x.T @ x
        
        H /= len(data)
        return H

GPTQ Configuration

# Using HuggingFace Transformers
from transformers import AutoModelForCausalLM, GPTQConfig

# Configure GPTQ quantization
gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",
    group_size=128,
    desc_act=True,      # Sort by activation magnitude
    damp_percent=0.01,
    true_sequential=True
)

# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=gptq_config,
    device_map="auto"
)

# Save quantized model
model.save_pretrained("llama-2-7b-gptq")

AWQ (Activation-Aware Weight Quantization)

Theory

DfAWQ

AWQ (Lin et al., 2024) identifies important weight channels based on activation magnitudes and quantizes them with higher precision. It scales important weights before quantization to preserve their information.

AWQ Scaling

hatW=textquantize(Wcdottextdiag(s))/s\\hat{W} = \\text{quantize}(W \\cdot \\text{diag}(s)) / s

Here,

  • WW=Original weight matrix
  • ss=Per-channel scale factors
  • W^\hat{W}=Quantized weight matrix

Scale Factor Selection

AWQ Scale Optimization

s=argminsWcdottextdiag(s)textquantize(Wcdottextdiag(s))F2s^* = \\arg\\min_s \\|W \\cdot \\text{diag}(s) - \\text{quantize}(W \\cdot \\text{diag}(s))\\|_F^2

Here,

  • ss^*=Optimal scale factors
  • WW=Weight matrix

AWQ finds scale factors by minimizing the quantization error of the most important channels. Important channels are identified by the magnitude of their input activations—channels with large activations contribute more to the output and should be quantized more carefully.

Implementation

class AWQQuantizer:
    """AWQ quantization with activation-aware scaling."""
    
    def __init__(self, model, calibration_data):
        self.model = model
        self.calibration_data = calibration_data
    
    def find_optimal_scales(self, layer, n_bits=4, group_size=128):
        """Find optimal per-channel scales for AWQ."""
        W = layer.weight.data
        n_out, n_in = W.shape
        
        # Compute activation importance
        importance = self._compute_importance(layer)
        
        # Initialize scales
        scales = torch.ones(n_in, device=W.device)
        
        # Grid search for optimal scales
        best_error = float('inf')
        best_scales = scales.clone()
        
        for alpha in [0.5, 0.75, 1.0, 1.25, 1.5]:
            candidate_scales = importance.pow(alpha)
            candidate_scales = candidate_scales / candidate_scales.mean()
            
            # Quantize with candidate scales
            W_scaled = W * candidate_scales.unsqueeze(0)
            W_quant = self._quantize(W_scaled, n_bits, group_size)
            W_dequant = W_quant / candidate_scales.unsqueeze(0)
            
            # Compute error
            error = (W - W_dequant).pow(2).mean()
            
            if error < best_error:
                best_error = error
                best_scales = candidate_scales
        
        return best_scales
    
    def _compute_importance(self, layer):
        """Compute channel importance from activations."""
        importance = torch.zeros(layer.in_features)
        
        for batch in self.calibration_data:
            with torch.no_grad():
                x = layer._get_input_activations(batch)
                importance += x.abs().mean(dim=0)
        
        return importance / len(self.calibration_data)

GGUF (GGML Unified Format)

Overview

DfGGUF

GGUF is a quantization format designed for CPU and CPU+GPU inference, used by llama.cpp and other C++ inference engines. It supports multiple quantization types with per-layer mixed precision.

Quantization Types

TypeBitsQualitySpeedUse Case
Q2_K2LowFastestExtreme compression
Q3_K_M3ModerateFastLow memory
Q4_04GoodFastBalanced
Q4_K_M4BetterFastRecommended
Q5_K_M5Very GoodModerateHigh quality
Q6_K6ExcellentSlowerNear-lossless
Q8_08LosslessSlowestMaximum quality

GGUF Conversion

# Converting to GGUF format using llama.cpp
# Step 1: Convert HuggingFace model to GGML
python convert.py /path/to/model --outtype f16

# Step 2: Quantize to desired format
./quantize /path/to/model.ggml /path/to/model-q4_k_m.gguf q4_k_m

# Step 3: Run inference
./main -m /path/to/model-q4_k_m.gguf -p "Hello, world" -n 100

GGUF's mixed-precision approach allows different layers to use different quantization levels. Sensitive layers (like attention projections) can use Q6 or Q8 while FFN layers use Q4, achieving better overall quality than uniform quantization.

INT4 vs INT8 Quantization

Quality Comparison

Model SizeINT8 PPLINT4 PPLFP16 PPLINT8 DeltaINT4 Delta
1B12.112.811.9+0.2+0.9
7B5.725.915.68+0.04+0.23
13B5.485.625.45+0.03+0.17
70B3.123.213.10+0.02+0.11

Larger models are more robust to quantization. A 70B model quantized to INT4 maintains better quality than a 7B model in FP16. This is because larger models have more redundant representations that can absorb quantization noise.

Memory and Speed

FormatMemory (7B)Inference SpeedGPU Support
FP1614 GBBaselineAll GPUs
INT87 GB1.0-1.2×A100, RTX 30xx+
INT43.5 GB0.8-1.0×Limited
NF43.5 GB0.7-0.9×CPU only

INT8 quantization often provides speedup because it reduces memory bandwidth requirements. For memory-bound operations (like autoregressive generation), INT8 can be 20% faster than FP16 despite the quantization overhead.

Quantization-Aware Training (QAT)

QLoRA

DfQLoRA

QLoRA (Dettmers et al., 2023) combines NF4 quantization with LoRA adapters, enabling fine-tuning of quantized models on consumer GPUs. The base model is frozen in INT4 while only LoRA parameters are trained in FP16.

import torch
from transformers import (
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# NF4 quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# Add LoRA adapters
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 33,554,432 || all params: 3,776,126,976 || trainable%: 0.8887

Practical Deployment

Choosing the Right Quantization

ScenarioRecommendedWhy
Production servingGPTQ INT4 or AWQFast inference, good quality
CPU deploymentGGUF Q4_K_MOptimized for CPU
Fine-tuningQLoRA NF4Enables training on consumer GPUs
Edge devicesGGUF Q2_K or Q3_KMinimal memory
Maximum qualityINT8 or FP16No quality loss

Benchmarking Results

Architecture Diagram
Model: Llama-2-7B
Hardware: RTX 4090 (24 GB)

| Format  | Memory | Tokens/s | PPL  |
|---------|--------|----------|------|
| FP16    | 14 GB  | 85       | 5.68 |
| GPTQ-4  | 4 GB   | 95       | 5.91 |
| AWQ-4   | 4 GB   | 98       | 5.89 |
| GGUF-Q4 | 4.2 GB | 72*      | 5.93 |

*GGUF runs on CPU, slower than GPU

AWQ typically provides slightly better quality than GPTQ at the same bit width because it preserves important channels. However, GPTQ is more widely supported and faster to quantize.

Practice Exercises

  1. Mathematical: Calculate the theoretical memory savings of quantizing a 13B parameter model from FP16 to INT4. Include the overhead of quantization constants and group scales.

  2. Implementation: Implement a simple INT8 quantizer with per-channel scaling and compare its perplexity degradation against per-tensor quantization.

  3. Analysis: Compare GPTQ and AWQ on a 7B model using the same calibration dataset. Which method produces better quality at 3-bit quantization?

  4. Research: Investigate mixed-precision quantization where different layers use different bit widths. What is the optimal allocation strategy?

Key Takeaways:

  • GPTQ uses Hessian-based compensation for optimal INT4 quantization
  • AWQ preserves important channels using activation-aware scaling
  • GGUF supports mixed-precision quantization for CPU inference
  • Larger models are more robust to quantization (70B INT4 > 7B FP16)
  • QLoRA enables fine-tuning quantized models on consumer GPUs
  • INT8 often provides speedup by reducing memory bandwidth requirements
  • AWQ generally provides slightly better quality than GPTQ at same bit width

What to Learn Next

-> LoRA and PEFT Efficient fine-tuning using low-rank adaptation.

-> Pruning for LLMs Reducing model size through weight pruning.

-> Low-Rank Factorization SVD decomposition and weight sharing techniques.

-> Hardware-Aware LLM Design Optimizing models for GPU memory hierarchy.

-> Model Merging and Fusion Combining multiple fine-tuned models.

-> LLM Inference Optimization Speeding up model inference for production.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement