CW

Model Compression Pipeline

OptimizationCompressionFree Lesson

Advertisement

LLM Optimization

Model Compression Pipeline

End-to-end compression combining quantization, pruning, and distillation—achieving 10-100x efficiency gains while preserving capability.

  • Quantization — INT8, INT4, mixed-precision, and GPTQ
  • Pruning — Structured, unstructured, and movement pruning
  • Distillation — Knowledge transfer from large to small models

Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away.

Model Compression Pipeline

End-to-end compression combining quantization, pruning, and distillation—achieving 10-100x efficiency gains while preserving capability.

DfModel Compression

Model compression is the process of reducing the computational cost, memory footprint, and energy consumption of a neural network while preserving task performance. The three primary techniques are quantization (reducing numerical precision), pruning (removing redundant parameters), and knowledge distillation (transferring knowledge from a larger to a smaller model).

The Compression Tradeoff

Compression-Performance Tradeoff

textQuality(delta)=f(textCompression(delta))lambdacdottextCost(delta)\\text{Quality}(\\delta) = f(\\text{Compression}(\\delta)) - \\lambda \\cdot \\text{Cost}(\\delta)

Here,

  • Quality(δ)\text{Quality}(\delta)=Task performance after compression
  • Compression(δ)\text{Compression}(\delta)=Compression ratio achieved
  • Cost(δ)\text{Cost}(\delta)=Performance degradation
  • λ\lambda=Tradeoff parameter

The optimal compression strategy depends on the target deployment scenario:

ScenarioPriorityRecommended Pipeline
Cloud inferenceThroughputQuantization (INT4) + Pruning (20%)
Edge deviceMemoryQuantization (INT4) + Distillation
Real-timeLatencyPruning (structured) + Quantization
MobileAll constraintsDistillation + Quantization + Pruning

Stage 1: Quantization

Post-Training Quantization (PTQ)

Apply quantization after training without retraining:

Linear Quantization

q=textroundleft(fracxzsright)q = \\text{round}\\left(\\frac{x - z}{s}\\right)

Here,

  • xx=Original floating-point value
  • zz=Zero point
  • ss=Scale factor
  • qq=Quantized integer value

Scale and zero point:

Quantization Parameters

s=fracxmaxxmin2b1,quadz=textroundleft(fracxminsright)s = \\frac{x_{\\max} - x_{\\min}}{2^b - 1}, \\quad z = \\text{round}\\left(-\\frac{x_{\\min}}{s}\\right)

Here,

  • xmax,xminx_{\max}, x_{\min}=Range of values to quantize
  • bb=Bit width (e.g., 4, 8)

Quantization-Aware Training (QAT)

Simulate quantization during training to learn quantization-friendly weights:

QAT Loss

mathcalLtextQAT=mathcalL(y,fQ(hattheta))+lambdacdottextReg(theta)\\mathcal{L}_{\\text{QAT}} = \\mathcal{L}(y, f_Q(\\hat{\\theta})) + \\lambda \\cdot \\text{Reg}(\\theta)

Here,

  • fQ(θ^)f_Q(\hat{\theta})=Forward pass with quantized weights
  • Reg(θ)\text{Reg}(\theta)=Regularization to keep weights in quantizable range

GPTQ (Quantization for LLMs)

GPTQ uses optimal brain quantization to minimize quantization error:

GPTQ Objective

minhatWWXhatWX22quadtexts.t.quadhatWin0,pm1,ldots,pm2b1d\\min_{\\hat{W}} \\| WX - \\hat{W}X \\|_2^2 \\quad \\text{s.t.} \\quad \\hat{W} \\in \\{0, \\pm 1, \\ldots, \\pm 2^{b-1}\\}^d

Here,

  • WW=Original weight matrix
  • W^\hat{W}=Quantized weight matrix
  • XX=Calibration data activations
  • bb=Target bit width

GPTQ with 4-bit quantization achieves <1% perplexity degradation on most LLMs. For a 70B model, this reduces memory from 140GB (FP16) to ~35GB (INT4), enabling single-GPU inference.

Stage 2: Pruning

Unstructured Pruning

Remove individual weights based on magnitude:

Magnitude Pruning

M_{ij} = \\begin{cases} 1 & \\text{if } |W_{ij}| > \\tau \\\\ 0 & \\text{otherwise} \\end{cases}

Here,

  • WijW_{ij}=Weight at position (i,j)
  • τ\tau=Pruning threshold
  • MijM_{ij}=Mask (0 = pruned, 1 = kept)

Structured Pruning

Remove entire neurons, heads, or layers:

Structured Pruning Score

S(g)=fracWg1gcdottextImportance(g)S(g) = \\frac{\\| W_g \\|_1}{|g|} \\cdot \\text{Importance}(g)

Here,

  • WgW_g=Weights in group g (neuron, head, etc.)
  • g|g|=Number of parameters in group g
  • Importance(g)\text{Importance}(g)=Task-specific importance metric

Movement Pruning

Prune based on weight movement during fine-tuning:

Movement Pruning Score

S(w)=wtcdot(wtw0)S(w) = w_t \\cdot (w_t - w_0)

Here,

  • wtw_t=Weight at fine-tuning step t
  • w0w_0=Pre-trained weight value

Weights that move toward zero during fine-tuning are candidates for pruning—they are becoming less important for the task.

Movement pruning is particularly effective for task-specific compression: fine-tune a pre-trained model and prune weights that move toward zero. This achieves 50-70% sparsity with <2% accuracy loss.

Stage 3: Knowledge Distillation

Standard Distillation

Distillation Loss

mathcalLtextKD=alphacdotmathcalLtextCE(y,pS)+(1alpha)cdotT2cdotDtextKL(pT(T)pS(T))\\mathcal{L}_{\\text{KD}} = \\alpha \\cdot \\mathcal{L}_{\\text{CE}}(y, p_S) + (1 - \\alpha) \\cdot T^2 \\cdot D_{\\text{KL}}(p_T^{(T)} \\| p_S^{(T)})

Here,

  • pSp_S=Student model predictions
  • pT(T)p_T^{(T)}=Teacher model predictions at temperature T
  • TT=Softmax temperature (typically 2-4)
  • α\alpha=Balance between hard and soft labels

Feature-Level Distillation

Transfer intermediate representations, not just final outputs:

Feature Distillation

mathcalLtextfeat=sumlinmathcalLflS(x)g(flT(x))22\\mathcal{L}_{\\text{feat}} = \\sum_{l \\in \\mathcal{L}} \\| f_l^S(x) - g(f_l^T(x)) \\|_2^2

Here,

  • flS(x)f_l^S(x)=Student feature map at layer l
  • flT(x)f_l^T(x)=Teacher feature map at layer l
  • g()g(\cdot)=Projection function (if dimensions differ)

End-to-End Pipeline

The recommended compression pipeline:

Architecture Diagram
1. Start with pre-trained model (e.g., LLaMA 70B, 140GB FP16)
2. Knowledge Distillation → Teacher (70B) → Student (7B)
3. Quantization-Aware Training → INT4 student (~3.5GB)
4. Structured Pruning → Remove 20% of attention heads
5. Final quantization → INT4 model (~2.8GB)
StageTechniqueSize ReductionQuality Retention
1. Distillation70B → 7B10x90-95%
2. QATFP16 → INT44x98-99%
3. Pruning20% sparsity1.25x97-99%
TotalCombined~50x85-93%

Applying compression sequentially can compound errors. It is often better to combine techniques (e.g., QAT + pruning simultaneously) and tune the balance on a validation set.

Practice Exercises

  1. Conceptual: Explain why quantization-aware training typically outperforms post-training quantization. What is the fundamental difference in approach?

  2. Mathematical: If a 70B parameter model is compressed to 7B via distillation (FP16) and then quantized to INT4, compute the final model size in GB. What is the total compression ratio?

  3. Practical: Implement the full pipeline: distill LLaMA 7B from LLaMA 13B, apply GPTQ 4-bit quantization, and measure perplexity at each stage.

  4. Research: Compare the performance of sequential compression (distill then quantize) with joint compression (distill and quantize simultaneously). Which approach is more effective and why?

Key Takeaways:

  • The three pillars of compression are quantization, pruning, and distillation
  • GPTQ achieves 4-bit quantization with <1% perplexity degradation
  • Movement pruning is effective for task-specific compression
  • Distillation transfers knowledge from large teacher to small student models
  • Combined pipelines achieve 10-100x compression with 85-95% quality retention

What to Learn Next

-> Quantization Techniques Deep Dive Detailed quantization methods and theory.

-> Pruning for LLMs Structured and unstructured pruning methods.

-> Knowledge Distillation Distillation theory and practice for language models.

-> Low-Rank Factorization SVD and LoRA for parameter-efficient compression.

-> Hardware-Aware LLM Design Co-designing models and hardware for efficiency.

-> LLM Optimization for Mobile Edge deployment and on-device inference.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement