LLM Optimization

Model Compression Pipeline

End-to-end compression combining quantization, pruning, and distillation—achieving 10-100x efficiency gains while preserving capability.

Quantization — INT8, INT4, mixed-precision, and GPTQ
Pruning — Structured, unstructured, and movement pruning
Distillation — Knowledge transfer from large to small models

Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.

Model Compression Pipeline

End-to-end compression combining quantization, pruning, and distillation—achieving 10-100x efficiency gains while preserving capability.

DfModel Compression

Model compression is the process of reducing the computational cost, memory footprint, and energy consumption of a neural network while preserving task performance. The three primary techniques are quantization (reducing numerical precision), pruning (removing redundant parameters), and knowledge distillation (transferring knowledge from a larger to a smaller model).

The Compression Tradeoff

Compression-Performance Tradeoff

\\text{Quality}(\\delta) = f(\\text{Compression}(\\delta)) - \\lambda \\cdot \\text{Cost}(\\delta)

Here,

$\text{Quality}(\delta)$ =Task performance after compression
$\text{Compression}(\delta)$ =Compression ratio achieved
$\text{Cost}(\delta)$ =Performance degradation
$\lambda$ =Tradeoff parameter

The optimal compression strategy depends on the target deployment scenario:

Scenario	Priority	Recommended Pipeline
Cloud inference	Throughput	Quantization (INT4) + Pruning (20%)
Edge device	Memory	Quantization (INT4) + Distillation
Real-time	Latency	Pruning (structured) + Quantization
Mobile	All constraints	Distillation + Quantization + Pruning

Stage 1: Quantization

Post-Training Quantization (PTQ)

Apply quantization after training without retraining:

Linear Quantization

q = \\text{round}\\left(\\frac{x - z}{s}\\right)

Here,

$x$ =Original floating-point value
$z$ =Zero point
$s$ =Scale factor
$q$ =Quantized integer value

Scale and zero point:

Quantization Parameters

s = \\frac{x_{\\max} - x_{\\min}}{2^b - 1}, \\quad z = \\text{round}\\left(-\\frac{x_{\\min}}{s}\\right)

Here,

$x_{\max}, x_{\min}$ =Range of values to quantize
$b$ =Bit width (e.g., 4, 8)

Quantization-Aware Training (QAT)

Simulate quantization during training to learn quantization-friendly weights:

QAT Loss

\\mathcal{L}_{\\text{QAT}} = \\mathcal{L}(y, f_Q(\\hat{\\theta})) + \\lambda \\cdot \\text{Reg}(\\theta)

Here,

$f_Q(\hat{\theta})$ =Forward pass with quantized weights
$\text{Reg}(\theta)$ =Regularization to keep weights in quantizable range

GPTQ (Quantization for LLMs)

GPTQ uses optimal brain quantization to minimize quantization error:

GPTQ Objective

\\min_{\\hat{W}} \\| WX - \\hat{W}X \\|_2^2 \\quad \\text{s.t.} \\quad \\hat{W} \\in \\{0, \\pm 1, \\ldots, \\pm 2^{b-1}\\}^d

Here,

$W$ =Original weight matrix
$\hat{W}$ =Quantized weight matrix
$X$ =Calibration data activations
$b$ =Target bit width

GPTQ with 4-bit quantization achieves <1% perplexity degradation on most LLMs. For a 70B model, this reduces memory from 140GB (FP16) to ~35GB (INT4), enabling single-GPU inference.

Stage 2: Pruning

Unstructured Pruning

Remove individual weights based on magnitude:

Magnitude Pruning

M_{ij} = \\begin{cases} 1 & \\text{if } |W_{ij}| > \\tau \\\\ 0 & \\text{otherwise} \\end{cases}

Here,

$W_{ij}$ =Weight at position (i,j)
$\tau$ =Pruning threshold
$M_{ij}$ =Mask (0 = pruned, 1 = kept)

Structured Pruning

Remove entire neurons, heads, or layers:

Structured Pruning Score

S(g) = \\frac{\\| W_g \\|_1}{|g|} \\cdot \\text{Importance}(g)

Here,

$W_g$ =Weights in group g (neuron, head, etc.)
$|g|$ =Number of parameters in group g
$\text{Importance}(g)$ =Task-specific importance metric

Movement Pruning

Prune based on weight movement during fine-tuning:

Movement Pruning Score

S(w) = w_t \\cdot (w_t - w_0)

Here,

$w_t$ =Weight at fine-tuning step t
$w_0$ =Pre-trained weight value

Weights that move toward zero during fine-tuning are candidates for pruning—they are becoming less important for the task.

Movement pruning is particularly effective for task-specific compression: fine-tune a pre-trained model and prune weights that move toward zero. This achieves 50-70% sparsity with <2% accuracy loss.

Stage 3: Knowledge Distillation

Standard Distillation

Distillation Loss

\\mathcal{L}_{\\text{KD}} = \\alpha \\cdot \\mathcal{L}_{\\text{CE}}(y, p_S) + (1 - \\alpha) \\cdot T^2 \\cdot D_{\\text{KL}}(p_T^{(T)} \\| p_S^{(T)})

Here,

$p_S$ =Student model predictions
$p_T^{(T)}$ =Teacher model predictions at temperature T
$T$ =Softmax temperature (typically 2-4)
$\alpha$ =Balance between hard and soft labels

Feature-Level Distillation

Transfer intermediate representations, not just final outputs:

Feature Distillation

\\mathcal{L}_{\\text{feat}} = \\sum_{l \\in \\mathcal{L}} \\| f_l^S(x) - g(f_l^T(x)) \\|_2^2

Here,

$f_l^S(x)$ =Student feature map at layer l
$f_l^T(x)$ =Teacher feature map at layer l
$g(\cdot)$ =Projection function (if dimensions differ)

End-to-End Pipeline

The recommended compression pipeline:

Architecture Diagram

1. Start with pre-trained model (e.g., LLaMA 70B, 140GB FP16)
2. Knowledge Distillation → Teacher (70B) → Student (7B)
3. Quantization-Aware Training → INT4 student (~3.5GB)
4. Structured Pruning → Remove 20% of attention heads
5. Final quantization → INT4 model (~2.8GB)

Stage	Technique	Size Reduction	Quality Retention
1. Distillation	70B → 7B	10x	90-95%
2. QAT	FP16 → INT4	4x	98-99%
3. Pruning	20% sparsity	1.25x	97-99%
Total	Combined	~50x	85-93%

Applying compression sequentially can compound errors. It is often better to combine techniques (e.g., QAT + pruning simultaneously) and tune the balance on a validation set.

Practice Exercises

Conceptual: Explain why quantization-aware training typically outperforms post-training quantization. What is the fundamental difference in approach?
Mathematical: If a 70B parameter model is compressed to 7B via distillation (FP16) and then quantized to INT4, compute the final model size in GB. What is the total compression ratio?
Practical: Implement the full pipeline: distill LLaMA 7B from LLaMA 13B, apply GPTQ 4-bit quantization, and measure perplexity at each stage.
Research: Compare the performance of sequential compression (distill then quantize) with joint compression (distill and quantize simultaneously). Which approach is more effective and why?

Key Takeaways:

The three pillars of compression are quantization, pruning, and distillation
GPTQ achieves 4-bit quantization with <1% perplexity degradation
Movement pruning is effective for task-specific compression
Distillation transfers knowledge from large teacher to small student models
Combined pipelines achieve 10-100x compression with 85-95% quality retention

What to Learn Next

-> Quantization Techniques Deep Dive Detailed quantization methods and theory.

-> Pruning for LLMs Structured and unstructured pruning methods.

-> Knowledge Distillation Distillation theory and practice for language models.

-> Low-Rank Factorization SVD and LoRA for parameter-efficient compression.

-> Hardware-Aware LLM Design Co-designing models and hardware for efficiency.

-> LLM Optimization for Mobile Edge deployment and on-device inference.

Model Compression Pipeline

Model Compression Pipeline

Model Compression Pipeline

DfModel Compression

The Compression Tradeoff

Compression-Performance Tradeoff

Stage 1: Quantization

Post-Training Quantization (PTQ)

Linear Quantization

Quantization Parameters

Quantization-Aware Training (QAT)

QAT Loss

GPTQ (Quantization for LLMs)

GPTQ Objective

Stage 2: Pruning

Unstructured Pruning

Magnitude Pruning

Structured Pruning

Structured Pruning Score

Movement Pruning

Movement Pruning Score

Stage 3: Knowledge Distillation

Standard Distillation

Distillation Loss

Feature-Level Distillation

Feature Distillation

End-to-End Pipeline

Practice Exercises

What to Learn Next

Need Expert LLM Help?