LLM Optimization

LLM Optimization for Mobile

Deploying language models on phones, tablets, and edge devices—TFLite, CoreML, quantization for mobile hardware, and the future of on-device AI.

Frameworks — TFLite, CoreML, ONNX Runtime, MLC-LLM
Hardware — NPUs, GPUs, and neural accelerators in mobile chips
Techniques — INT4/INT8, pruning, speculative decoding for mobile

The future of AI is not in the cloud—it's in your pocket.

LLM Optimization for Mobile

Deploying language models on phones, tablets, and edge devices—TFLite, CoreML, quantization for mobile hardware, and the future of on-device AI.

DfOn-Device LLM

An on-device LLM is a language model that runs entirely on mobile or edge hardware without requiring cloud connectivity. This requires extreme compression (4-8x), efficient inference engines, and hardware-aware optimization to fit within the constraints of mobile devices (1-4GB RAM, 1-5W power).

Mobile Hardware Constraints

Resource Budgets

Device Class	RAM	Power	Compute (TOPS)	Example
Budget phone	2-4 GB	1-2W	1-2	Snapdragon 6-series
Flagship phone	8-12 GB	3-5W	10-15	Snapdragon 8 Gen 3
Tablet	8-16 GB	5-10W	15-25	Apple M2 iPad
Edge device	4-16 GB	5-20W	20-50	Jetson Orin

Mobile Neural Processing Units

Modern mobile SoCs include dedicated NPUs:

NPU Throughput

\\text{TOPS}_{\\text{NPU}} = \\frac{\\text{Operations per inference}}{\\text{Inference time (s)} \\times 10^{12}}

Here,

$\text{TOPS}_{\text{NPU}}$ =Tera-Operations Per Second on NPU
$\text{Operations}$ =Total multiply-accumulate operations

Chip	NPU	TOPS	INT8 Support
Snapdragon 8 Gen 3	Hexagon	45	Yes
Apple A17 Pro	Neural Engine	35	Yes
MediaTek Dimensity 9300	APU 790	46	Yes
Google Tensor G4	TPU	15	Yes

Mobile NPUs are optimized for INT8 and INT4 inference. This is why quantization is not just optional but essential for mobile deployment—the hardware is designed for low-precision computation.

Framework Comparison

TensorFlow Lite (TFLite)

DfTFLite

TensorFlow Lite is Google's framework for on-device ML inference. It supports quantization (INT8, UINT8), hardware acceleration (NNAPI, GPU delegate), and models converted from TensorFlow or PyTorch via ONNX.

Key features:

Delegates: GPU, NNAPI, Core ML, Hexagon DSP
Quantization: Post-training INT8, dynamic range quantization
Optimization: Operator fusion, constant folding, pruning
Deployment: Android, iOS, embedded Linux, microcontrollers

CoreML (Apple)

DfCoreML

CoreML is Apple's framework for on-device ML inference on iOS, macOS, and watchOS. It automatically optimizes models for Apple's Neural Engine and GPU, with support for INT8 quantization and model compression.

Key features:

Automatic optimization: Neural Engine, GPU, CPU selection
Quantization: INT8, FP16, and mixed-precision
Tools: coremltools for conversion from PyTorch/TensorFlow
Privacy: All computation on-device, no data leaves the device

MLC-LLM

DfMLC-LLM

MLC-LLM (Machine Learning Compilation for LLMs) is a framework for deploying LLMs on any hardware platform. It uses TVM for compiler-level optimization and supports INT4/INT8 quantization with hardware-specific code generation.

Key features:

Universal deployment: Android, iOS, web, GPUs, NPUs
INT4 quantization: GPTQ and AWQ support
Speculative decoding: For faster on-device generation
Vulkan/Metal acceleration: Cross-platform GPU support

Quantization for Mobile

INT8 Quantization

The standard for mobile deployment:

Mobile INT8 Quantization

Q(w) = \\text{round}\\left(\\frac{w}{s}\\right) + z, \\quad s = \\frac{\\max(|W|)}{127}

Here,

$w$ =Original weight
$s$ =Scale factor
$z$ =Zero point (typically 0 for symmetric)

INT4 Quantization

For maximum compression on memory-constrained devices:

Quantization	Model Size (7B)	Memory	Quality (PPL)
FP16	14 GB	16 GB	Baseline
INT8	7 GB	8 GB	+0.1-0.3
INT4 (GPTQ)	3.5 GB	4 GB	+0.5-1.5
INT4 (AWQ)	3.5 GB	4 GB	+0.3-1.0

For mobile deployment, AWQ (Activation-aware Weight Quantization) is preferred over GPTQ because it preserves the activations that matter most for generation quality, which is critical for autoregressive decoding.

On-Device Inference Optimization

Speculative Decoding for Mobile

Use a small draft model on-device to accelerate generation:

Mobile Speculative Decoding

\\text{Speedup} = \\frac{\\text{Draft tokens per verification}}{1 + \\text{Rejection rate}}

Here,

$\text{Draft tokens}$ =Tokens generated by draft model per target verification
$\text{Rejection rate}$ =Fraction of draft tokens rejected by target model

On mobile, the draft model is typically a 100-500M parameter model running on the NPU, while the target model is a 1-7B model.

KV Cache Optimization

Mobile devices have limited memory for KV cache:

KV Cache Memory

\\text{Memory}_{\\text{KV}} = 2 \\times L \\times n_{\\text{heads}} \\times d_{\\text{head}} \\times T \\times \\text{bytes}

Here,

$L$ =Number of layers
$n_{\text{heads}}$ =Number of attention heads
$d_{\text{head}}$ =Head dimension
$T$ =Sequence length
$bytes$ =Bytes per element (2 for FP16, 1 for INT8)

For a 7B model with 32 layers, 32 heads, d=128, and T=2048:

FP16 KV cache: 2 × 32 × 32 × 128 × 2048 × 2 = 1.07 GB
INT8 KV cache: 536 MB

KV cache is often the memory bottleneck for mobile LLM deployment. Solutions include: (1) INT8 KV cache quantization, (2) grouped-query attention (reduces KV cache by 4-8x), (3) sliding window attention, and (4) paged attention for dynamic memory management.

Practical Deployment Guide

Model Selection for Mobile

Use Case	Model Size	Quantization	Framework
Text completion	1-3B	INT4	TFLite/CoreML
Chat assistant	3-7B	INT4	MLC-LLM
On-device RAG	1-3B + embeddings	INT8	TFLite
Voice assistant	1-3B + ASR/TTS	INT4	CoreML

Performance Benchmarks

Model	Device	Framework	Tokens/sec	Memory
LLaMA 2 7B (INT4)	Snapdragon 8 Gen 3	MLC-LLM	12	4.2 GB
Phi-2 2.7B (INT4)	iPhone 15 Pro	CoreML	18	1.8 GB
Gemma 2B (INT4)	Pixel 8	TFLite	15	1.5 GB
LLaMA 3 8B (INT4)	Snapdragon 8 Gen 3	MLC-LLM	10	5.1 GB

For best mobile performance: (1) use INT4 quantization, (2) enable grouped-query attention if available, (3) limit KV cache to 512-1024 tokens, (4) use speculative decoding with a small draft model, and (5) profile on the target device before deployment.

Practice Exercises

Conceptual: Explain why INT4 quantization is preferred for mobile deployment over INT8. What are the tradeoffs in terms of quality, memory, and speed?
Mathematical: Compute the KV cache memory for a 3B model with 24 layers, 16 heads, d=64, and T=1024 using INT8 quantization. Can this fit in 512MB of available memory?
Practical: Convert a small LLM (e.g., Phi-2 2.7B) to TFLite format with INT4 quantization and benchmark inference speed on an Android device.
Research: Compare the energy efficiency (tokens per joule) of on-device inference vs. cloud inference for a 3B model. Under what conditions is on-device more efficient?

Key Takeaways:

Mobile NPUs are optimized for INT8/INT4 inference, making quantization essential
TFLite, CoreML, and MLC-LLM are the primary frameworks for mobile deployment
INT4 quantization (AWQ preferred) achieves 4x compression with acceptable quality loss
KV cache is the primary memory bottleneck; INT8 KV cache and GQA help
Speculative decoding with small draft models accelerates on-device generation

What to Learn Next

-> Model Compression Pipeline End-to-end compression: quantization + pruning + distillation.

-> Quantization Techniques Deep Dive Detailed quantization methods and theory.

-> Hardware-Aware LLM Design Co-designing models and hardware for efficiency.

-> LLM Inference Optimization General inference optimization techniques.

-> Pruning for LLMs Structured and unstructured pruning methods.

-> Environmental Impact Energy efficiency and sustainable AI practices.

LLM Optimization for Mobile

LLM Optimization for Mobile

LLM Optimization for Mobile

DfOn-Device LLM

Mobile Hardware Constraints

Resource Budgets

Mobile Neural Processing Units

NPU Throughput

Framework Comparison

TensorFlow Lite (TFLite)

DfTFLite

CoreML (Apple)

DfCoreML

MLC-LLM

DfMLC-LLM

Quantization for Mobile

INT8 Quantization

Mobile INT8 Quantization

INT4 Quantization

On-Device Inference Optimization

Speculative Decoding for Mobile

Mobile Speculative Decoding

KV Cache Optimization

KV Cache Memory

Practical Deployment Guide

Model Selection for Mobile

Performance Benchmarks

Practice Exercises

What to Learn Next

Need Expert LLM Help?