Model Compression — Pruning, Quantization, Distillation

ProductionCompressionFree Lesson

Advertisement

Model Compression

Deep learning models are often over-parameterized for deployment. Model compression reduces model size and computation while maintaining accuracy, enabling deployment on edge devices and reducing inference costs.


Unstructured Pruning

DfUnstructured Pruning

Set individual weights to zero based on magnitude:

wij={0if wij<θwijotherwisew_{ij} = \begin{cases} 0 & \text{if } |w_{ij}| < \theta \\ w_{ij} & \text{otherwise} \end{cases}

Creates sparse weight matrices that require specialized hardware/SparseTensor support for speedup.

Magnitude Pruning Threshold

wij=wij1(wijθ)w_{ij} = w_{ij} \cdot \mathbb{1}(|w_{ij}| \geq \theta)

Here,

  • wijw_{ij}=Weight to prune
  • θ\theta=Threshold (percentile of absolute weights)
  • 1\mathbb{1}=Indicator function

Structured Pruning

DfStructured Pruning

Remove entire structures (filters, channels, heads, layers) rather than individual weights:

  • Filter pruning: Remove entire convolutional filters
  • Channel pruning: Remove output channels (equivalent to filter pruning)
  • Head pruning: Remove attention heads in Transformers
  • Layer pruning: Remove entire layers

Structured pruning produces smaller dense matrices that achieve actual speedup without specialized hardware.

Filter Importance (L1-norm)

If=c,h,wWf,c,h,wI_f = \sum_{c,h,w} |W_{f,c,h,w}|

Here,

  • IfI_f=Importance score for filter f
  • Wf,c,h,wW_{f,c,h,w}=Weight of filter f at position (c,h,w)

Quantization

DfQuantization

Reduce precision of weights and activations from FP32 to lower-bit representations:

  • FP16: 16-bit floating point (half precision) — 2x memory reduction
  • INT8: 8-bit integer — 4x memory reduction, 2-3x speedup
  • INT4: 4-bit integer — 8x memory reduction
  • Binary: 1-bit — extreme compression

Quantization-aware training (QAT) simulates quantization during training for better accuracy.

Linear Quantization

xq=round(xscale)+zero_pointx_q = \text{round}\left(\frac{x}{\text{scale}}\right) + \text{zero\_point}

Here,

  • xqx_q=Quantized value
  • xx=Original floating-point value
  • scalescale=Scaling factor
  • zero_pointzero\_point=Zero-point offset

Scale and Zero Point

scale=xmaxxmax2n1,zero_point=round(xminscale)\text{scale} = \frac{x_{\max} - x_{\max}}{2^n - 1}, \quad \text{zero\_point} = \text{round}\left(-\frac{x_{\min}}{\text{scale}}\right)

Here,

  • xmin,xmaxx_{\min}, x_{\max}=Range of floating-point values
  • nn=Number of bits (e.g., 8 for INT8)

Knowledge Distillation

L=αLCE(y,ps)+(1α)T2DKL(ptsoftpssoft)\mathcal{L} = \alpha \cdot \mathcal{L}_{\text{CE}}(y, p_s) + (1 - \alpha) \cdot T^2 \cdot D_{\text{KL}}(p_t^{\text{soft}} \| p_s^{\text{soft}})

DfKnowledge Distillation

Distill knowledge from a large teacher model TT to a smaller student model SS:

  1. Soft targets: Teacher's softened outputs with temperature TT
  2. Hard targets: Ground truth labels
  3. Combined loss: Weighted sum of soft and hard losses

The soft targets contain "dark knowledge" — relationships between classes that hard labels miss.

Softened Probabilities

pisoft=exp(zi/T)jexp(zj/T)p_i^{\text{soft}} = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}

Here,

  • ziz_i=Logit for class i
  • TT=Temperature (higher = softer distribution)
  • pisoftp_i^{\text{soft}}=Softened probability

Advertisement

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement