LLM Optimization
Model Compression Pipeline
End-to-end compression combining quantization, pruning, and distillation—achieving 10-100x efficiency gains while preserving capability.
- Quantization — INT8, INT4, mixed-precision, and GPTQ
- Pruning — Structured, unstructured, and movement pruning
- Distillation — Knowledge transfer from large to small models
Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.
Model Compression Pipeline
End-to-end compression combining quantization, pruning, and distillation—achieving 10-100x efficiency gains while preserving capability.
DfModel Compression
Model compression is the process of reducing the computational cost, memory footprint, and energy consumption of a neural network while preserving task performance. The three primary techniques are quantization (reducing numerical precision), pruning (removing redundant parameters), and knowledge distillation (transferring knowledge from a larger to a smaller model).
The Compression Tradeoff
Compression-Performance Tradeoff
Here,
- =Task performance after compression
- =Compression ratio achieved
- =Performance degradation
- =Tradeoff parameter
The optimal compression strategy depends on the target deployment scenario:
| Scenario | Priority | Recommended Pipeline |
|---|---|---|
| Cloud inference | Throughput | Quantization (INT4) + Pruning (20%) |
| Edge device | Memory | Quantization (INT4) + Distillation |
| Real-time | Latency | Pruning (structured) + Quantization |
| Mobile | All constraints | Distillation + Quantization + Pruning |
Stage 1: Quantization
Post-Training Quantization (PTQ)
Apply quantization after training without retraining:
Linear Quantization
Here,
- =Original floating-point value
- =Zero point
- =Scale factor
- =Quantized integer value
Scale and zero point:
Quantization Parameters
Here,
- =Range of values to quantize
- =Bit width (e.g., 4, 8)
Quantization-Aware Training (QAT)
Simulate quantization during training to learn quantization-friendly weights:
QAT Loss
Here,
- =Forward pass with quantized weights
- =Regularization to keep weights in quantizable range
GPTQ (Quantization for LLMs)
GPTQ uses optimal brain quantization to minimize quantization error:
GPTQ Objective
Here,
- =Original weight matrix
- =Quantized weight matrix
- =Calibration data activations
- =Target bit width
GPTQ with 4-bit quantization achieves <1% perplexity degradation on most LLMs. For a 70B model, this reduces memory from 140GB (FP16) to ~35GB (INT4), enabling single-GPU inference.
Stage 2: Pruning
Unstructured Pruning
Remove individual weights based on magnitude:
Magnitude Pruning
Here,
- =Weight at position (i,j)
- =Pruning threshold
- =Mask (0 = pruned, 1 = kept)
Structured Pruning
Remove entire neurons, heads, or layers:
Structured Pruning Score
Here,
- =Weights in group g (neuron, head, etc.)
- =Number of parameters in group g
- =Task-specific importance metric
Movement Pruning
Prune based on weight movement during fine-tuning:
Movement Pruning Score
Here,
- =Weight at fine-tuning step t
- =Pre-trained weight value
Weights that move toward zero during fine-tuning are candidates for pruning—they are becoming less important for the task.
Movement pruning is particularly effective for task-specific compression: fine-tune a pre-trained model and prune weights that move toward zero. This achieves 50-70% sparsity with <2% accuracy loss.
Stage 3: Knowledge Distillation
Standard Distillation
Distillation Loss
Here,
- =Student model predictions
- =Teacher model predictions at temperature T
- =Softmax temperature (typically 2-4)
- =Balance between hard and soft labels
Feature-Level Distillation
Transfer intermediate representations, not just final outputs:
Feature Distillation
Here,
- =Student feature map at layer l
- =Teacher feature map at layer l
- =Projection function (if dimensions differ)
End-to-End Pipeline
The recommended compression pipeline:
1. Start with pre-trained model (e.g., LLaMA 70B, 140GB FP16)
2. Knowledge Distillation → Teacher (70B) → Student (7B)
3. Quantization-Aware Training → INT4 student (~3.5GB)
4. Structured Pruning → Remove 20% of attention heads
5. Final quantization → INT4 model (~2.8GB)
| Stage | Technique | Size Reduction | Quality Retention |
|---|---|---|---|
| 1. Distillation | 70B → 7B | 10x | 90-95% |
| 2. QAT | FP16 → INT4 | 4x | 98-99% |
| 3. Pruning | 20% sparsity | 1.25x | 97-99% |
| Total | Combined | ~50x | 85-93% |
Applying compression sequentially can compound errors. It is often better to combine techniques (e.g., QAT + pruning simultaneously) and tune the balance on a validation set.
Practice Exercises
-
Conceptual: Explain why quantization-aware training typically outperforms post-training quantization. What is the fundamental difference in approach?
-
Mathematical: If a 70B parameter model is compressed to 7B via distillation (FP16) and then quantized to INT4, compute the final model size in GB. What is the total compression ratio?
-
Practical: Implement the full pipeline: distill LLaMA 7B from LLaMA 13B, apply GPTQ 4-bit quantization, and measure perplexity at each stage.
-
Research: Compare the performance of sequential compression (distill then quantize) with joint compression (distill and quantize simultaneously). Which approach is more effective and why?
Key Takeaways:
- The three pillars of compression are quantization, pruning, and distillation
- GPTQ achieves 4-bit quantization with <1% perplexity degradation
- Movement pruning is effective for task-specific compression
- Distillation transfers knowledge from large teacher to small student models
- Combined pipelines achieve 10-100x compression with 85-95% quality retention
What to Learn Next
-> Quantization Techniques Deep Dive Detailed quantization methods and theory.
-> Pruning for LLMs Structured and unstructured pruning methods.
-> Knowledge Distillation Distillation theory and practice for language models.
-> Low-Rank Factorization SVD and LoRA for parameter-efficient compression.
-> Hardware-Aware LLM Design Co-designing models and hardware for efficiency.
-> LLM Optimization for Mobile Edge deployment and on-device inference.