LLM Optimization
LLM Optimization for Mobile
Deploying language models on phones, tablets, and edge devices—TFLite, CoreML, quantization for mobile hardware, and the future of on-device AI.
- Frameworks — TFLite, CoreML, ONNX Runtime, MLC-LLM
- Hardware — NPUs, GPUs, and neural accelerators in mobile chips
- Techniques — INT4/INT8, pruning, speculative decoding for mobile
The future of AI is not in the cloud—it's in your pocket.
LLM Optimization for Mobile
Deploying language models on phones, tablets, and edge devices—TFLite, CoreML, quantization for mobile hardware, and the future of on-device AI.
DfOn-Device LLM
An on-device LLM is a language model that runs entirely on mobile or edge hardware without requiring cloud connectivity. This requires extreme compression (4-8x), efficient inference engines, and hardware-aware optimization to fit within the constraints of mobile devices (1-4GB RAM, 1-5W power).
Mobile Hardware Constraints
Resource Budgets
| Device Class | RAM | Power | Compute (TOPS) | Example |
|---|---|---|---|---|
| Budget phone | 2-4 GB | 1-2W | 1-2 | Snapdragon 6-series |
| Flagship phone | 8-12 GB | 3-5W | 10-15 | Snapdragon 8 Gen 3 |
| Tablet | 8-16 GB | 5-10W | 15-25 | Apple M2 iPad |
| Edge device | 4-16 GB | 5-20W | 20-50 | Jetson Orin |
Mobile Neural Processing Units
Modern mobile SoCs include dedicated NPUs:
NPU Throughput
Here,
- =Tera-Operations Per Second on NPU
- =Total multiply-accumulate operations
| Chip | NPU | TOPS | INT8 Support |
|---|---|---|---|
| Snapdragon 8 Gen 3 | Hexagon | 45 | Yes |
| Apple A17 Pro | Neural Engine | 35 | Yes |
| MediaTek Dimensity 9300 | APU 790 | 46 | Yes |
| Google Tensor G4 | TPU | 15 | Yes |
Mobile NPUs are optimized for INT8 and INT4 inference. This is why quantization is not just optional but essential for mobile deployment—the hardware is designed for low-precision computation.
Framework Comparison
TensorFlow Lite (TFLite)
DfTFLite
TensorFlow Lite is Google's framework for on-device ML inference. It supports quantization (INT8, UINT8), hardware acceleration (NNAPI, GPU delegate), and models converted from TensorFlow or PyTorch via ONNX.
Key features:
- Delegates: GPU, NNAPI, Core ML, Hexagon DSP
- Quantization: Post-training INT8, dynamic range quantization
- Optimization: Operator fusion, constant folding, pruning
- Deployment: Android, iOS, embedded Linux, microcontrollers
CoreML (Apple)
DfCoreML
CoreML is Apple's framework for on-device ML inference on iOS, macOS, and watchOS. It automatically optimizes models for Apple's Neural Engine and GPU, with support for INT8 quantization and model compression.
Key features:
- Automatic optimization: Neural Engine, GPU, CPU selection
- Quantization: INT8, FP16, and mixed-precision
- Tools: coremltools for conversion from PyTorch/TensorFlow
- Privacy: All computation on-device, no data leaves the device
MLC-LLM
DfMLC-LLM
MLC-LLM (Machine Learning Compilation for LLMs) is a framework for deploying LLMs on any hardware platform. It uses TVM for compiler-level optimization and supports INT4/INT8 quantization with hardware-specific code generation.
Key features:
- Universal deployment: Android, iOS, web, GPUs, NPUs
- INT4 quantization: GPTQ and AWQ support
- Speculative decoding: For faster on-device generation
- Vulkan/Metal acceleration: Cross-platform GPU support
Quantization for Mobile
INT8 Quantization
The standard for mobile deployment:
Mobile INT8 Quantization
Here,
- =Original weight
- =Scale factor
- =Zero point (typically 0 for symmetric)
INT4 Quantization
For maximum compression on memory-constrained devices:
| Quantization | Model Size (7B) | Memory | Quality (PPL) |
|---|---|---|---|
| FP16 | 14 GB | 16 GB | Baseline |
| INT8 | 7 GB | 8 GB | +0.1-0.3 |
| INT4 (GPTQ) | 3.5 GB | 4 GB | +0.5-1.5 |
| INT4 (AWQ) | 3.5 GB | 4 GB | +0.3-1.0 |
For mobile deployment, AWQ (Activation-aware Weight Quantization) is preferred over GPTQ because it preserves the activations that matter most for generation quality, which is critical for autoregressive decoding.
On-Device Inference Optimization
Speculative Decoding for Mobile
Use a small draft model on-device to accelerate generation:
Mobile Speculative Decoding
Here,
- =Tokens generated by draft model per target verification
- =Fraction of draft tokens rejected by target model
On mobile, the draft model is typically a 100-500M parameter model running on the NPU, while the target model is a 1-7B model.
KV Cache Optimization
Mobile devices have limited memory for KV cache:
KV Cache Memory
Here,
- =Number of layers
- =Number of attention heads
- =Head dimension
- =Sequence length
- =Bytes per element (2 for FP16, 1 for INT8)
For a 7B model with 32 layers, 32 heads, d=128, and T=2048:
- FP16 KV cache: 2 × 32 × 32 × 128 × 2048 × 2 = 1.07 GB
- INT8 KV cache: 536 MB
KV cache is often the memory bottleneck for mobile LLM deployment. Solutions include: (1) INT8 KV cache quantization, (2) grouped-query attention (reduces KV cache by 4-8x), (3) sliding window attention, and (4) paged attention for dynamic memory management.
Practical Deployment Guide
Model Selection for Mobile
| Use Case | Model Size | Quantization | Framework |
|---|---|---|---|
| Text completion | 1-3B | INT4 | TFLite/CoreML |
| Chat assistant | 3-7B | INT4 | MLC-LLM |
| On-device RAG | 1-3B + embeddings | INT8 | TFLite |
| Voice assistant | 1-3B + ASR/TTS | INT4 | CoreML |
Performance Benchmarks
| Model | Device | Framework | Tokens/sec | Memory |
|---|---|---|---|---|
| LLaMA 2 7B (INT4) | Snapdragon 8 Gen 3 | MLC-LLM | 12 | 4.2 GB |
| Phi-2 2.7B (INT4) | iPhone 15 Pro | CoreML | 18 | 1.8 GB |
| Gemma 2B (INT4) | Pixel 8 | TFLite | 15 | 1.5 GB |
| LLaMA 3 8B (INT4) | Snapdragon 8 Gen 3 | MLC-LLM | 10 | 5.1 GB |
For best mobile performance: (1) use INT4 quantization, (2) enable grouped-query attention if available, (3) limit KV cache to 512-1024 tokens, (4) use speculative decoding with a small draft model, and (5) profile on the target device before deployment.
Practice Exercises
-
Conceptual: Explain why INT4 quantization is preferred for mobile deployment over INT8. What are the tradeoffs in terms of quality, memory, and speed?
-
Mathematical: Compute the KV cache memory for a 3B model with 24 layers, 16 heads, d=64, and T=1024 using INT8 quantization. Can this fit in 512MB of available memory?
-
Practical: Convert a small LLM (e.g., Phi-2 2.7B) to TFLite format with INT4 quantization and benchmark inference speed on an Android device.
-
Research: Compare the energy efficiency (tokens per joule) of on-device inference vs. cloud inference for a 3B model. Under what conditions is on-device more efficient?
Key Takeaways:
- Mobile NPUs are optimized for INT8/INT4 inference, making quantization essential
- TFLite, CoreML, and MLC-LLM are the primary frameworks for mobile deployment
- INT4 quantization (AWQ preferred) achieves 4x compression with acceptable quality loss
- KV cache is the primary memory bottleneck; INT8 KV cache and GQA help
- Speculative decoding with small draft models accelerates on-device generation
What to Learn Next
-> Model Compression Pipeline End-to-end compression: quantization + pruning + distillation.
-> Quantization Techniques Deep Dive Detailed quantization methods and theory.
-> Hardware-Aware LLM Design Co-designing models and hardware for efficiency.
-> LLM Inference Optimization General inference optimization techniques.
-> Pruning for LLMs Structured and unstructured pruning methods.
-> Environmental Impact Energy efficiency and sustainable AI practices.