CW

LLM Optimization for Mobile

OptimizationDeploymentFree Lesson

Advertisement

LLM Optimization

LLM Optimization for Mobile

Deploying language models on phones, tablets, and edge devices—TFLite, CoreML, quantization for mobile hardware, and the future of on-device AI.

  • Frameworks — TFLite, CoreML, ONNX Runtime, MLC-LLM
  • Hardware — NPUs, GPUs, and neural accelerators in mobile chips
  • Techniques — INT4/INT8, pruning, speculative decoding for mobile

The future of AI is not in the cloud—it's in your pocket.

LLM Optimization for Mobile

Deploying language models on phones, tablets, and edge devices—TFLite, CoreML, quantization for mobile hardware, and the future of on-device AI.

DfOn-Device LLM

An on-device LLM is a language model that runs entirely on mobile or edge hardware without requiring cloud connectivity. This requires extreme compression (4-8x), efficient inference engines, and hardware-aware optimization to fit within the constraints of mobile devices (1-4GB RAM, 1-5W power).

Mobile Hardware Constraints

Resource Budgets

Device ClassRAMPowerCompute (TOPS)Example
Budget phone2-4 GB1-2W1-2Snapdragon 6-series
Flagship phone8-12 GB3-5W10-15Snapdragon 8 Gen 3
Tablet8-16 GB5-10W15-25Apple M2 iPad
Edge device4-16 GB5-20W20-50Jetson Orin

Mobile Neural Processing Units

Modern mobile SoCs include dedicated NPUs:

NPU Throughput

textTOPStextNPU=fractextOperationsperinferencetextInferencetime(s)times1012\\text{TOPS}_{\\text{NPU}} = \\frac{\\text{Operations per inference}}{\\text{Inference time (s)} \\times 10^{12}}

Here,

  • TOPSNPU\text{TOPS}_{\text{NPU}}=Tera-Operations Per Second on NPU
  • Operations\text{Operations}=Total multiply-accumulate operations
ChipNPUTOPSINT8 Support
Snapdragon 8 Gen 3Hexagon45Yes
Apple A17 ProNeural Engine35Yes
MediaTek Dimensity 9300APU 79046Yes
Google Tensor G4TPU15Yes

Mobile NPUs are optimized for INT8 and INT4 inference. This is why quantization is not just optional but essential for mobile deployment—the hardware is designed for low-precision computation.

Framework Comparison

TensorFlow Lite (TFLite)

DfTFLite

TensorFlow Lite is Google's framework for on-device ML inference. It supports quantization (INT8, UINT8), hardware acceleration (NNAPI, GPU delegate), and models converted from TensorFlow or PyTorch via ONNX.

Key features:

  • Delegates: GPU, NNAPI, Core ML, Hexagon DSP
  • Quantization: Post-training INT8, dynamic range quantization
  • Optimization: Operator fusion, constant folding, pruning
  • Deployment: Android, iOS, embedded Linux, microcontrollers

CoreML (Apple)

DfCoreML

CoreML is Apple's framework for on-device ML inference on iOS, macOS, and watchOS. It automatically optimizes models for Apple's Neural Engine and GPU, with support for INT8 quantization and model compression.

Key features:

  • Automatic optimization: Neural Engine, GPU, CPU selection
  • Quantization: INT8, FP16, and mixed-precision
  • Tools: coremltools for conversion from PyTorch/TensorFlow
  • Privacy: All computation on-device, no data leaves the device

MLC-LLM

DfMLC-LLM

MLC-LLM (Machine Learning Compilation for LLMs) is a framework for deploying LLMs on any hardware platform. It uses TVM for compiler-level optimization and supports INT4/INT8 quantization with hardware-specific code generation.

Key features:

  • Universal deployment: Android, iOS, web, GPUs, NPUs
  • INT4 quantization: GPTQ and AWQ support
  • Speculative decoding: For faster on-device generation
  • Vulkan/Metal acceleration: Cross-platform GPU support

Quantization for Mobile

INT8 Quantization

The standard for mobile deployment:

Mobile INT8 Quantization

Q(w)=textroundleft(fracwsright)+z,quads=fracmax(W)127Q(w) = \\text{round}\\left(\\frac{w}{s}\\right) + z, \\quad s = \\frac{\\max(|W|)}{127}

Here,

  • ww=Original weight
  • ss=Scale factor
  • zz=Zero point (typically 0 for symmetric)

INT4 Quantization

For maximum compression on memory-constrained devices:

QuantizationModel Size (7B)MemoryQuality (PPL)
FP1614 GB16 GBBaseline
INT87 GB8 GB+0.1-0.3
INT4 (GPTQ)3.5 GB4 GB+0.5-1.5
INT4 (AWQ)3.5 GB4 GB+0.3-1.0

For mobile deployment, AWQ (Activation-aware Weight Quantization) is preferred over GPTQ because it preserves the activations that matter most for generation quality, which is critical for autoregressive decoding.

On-Device Inference Optimization

Speculative Decoding for Mobile

Use a small draft model on-device to accelerate generation:

Mobile Speculative Decoding

textSpeedup=fractextDrafttokensperverification1+textRejectionrate\\text{Speedup} = \\frac{\\text{Draft tokens per verification}}{1 + \\text{Rejection rate}}

Here,

  • Draft tokens\text{Draft tokens}=Tokens generated by draft model per target verification
  • Rejection rate\text{Rejection rate}=Fraction of draft tokens rejected by target model

On mobile, the draft model is typically a 100-500M parameter model running on the NPU, while the target model is a 1-7B model.

KV Cache Optimization

Mobile devices have limited memory for KV cache:

KV Cache Memory

textMemorytextKV=2timesLtimesntextheadstimesdtextheadtimesTtimestextbytes\\text{Memory}_{\\text{KV}} = 2 \\times L \\times n_{\\text{heads}} \\times d_{\\text{head}} \\times T \\times \\text{bytes}

Here,

  • LL=Number of layers
  • nheadsn_{\text{heads}}=Number of attention heads
  • dheadd_{\text{head}}=Head dimension
  • TT=Sequence length
  • bytesbytes=Bytes per element (2 for FP16, 1 for INT8)

For a 7B model with 32 layers, 32 heads, d=128, and T=2048:

  • FP16 KV cache: 2 × 32 × 32 × 128 × 2048 × 2 = 1.07 GB
  • INT8 KV cache: 536 MB

KV cache is often the memory bottleneck for mobile LLM deployment. Solutions include: (1) INT8 KV cache quantization, (2) grouped-query attention (reduces KV cache by 4-8x), (3) sliding window attention, and (4) paged attention for dynamic memory management.

Practical Deployment Guide

Model Selection for Mobile

Use CaseModel SizeQuantizationFramework
Text completion1-3BINT4TFLite/CoreML
Chat assistant3-7BINT4MLC-LLM
On-device RAG1-3B + embeddingsINT8TFLite
Voice assistant1-3B + ASR/TTSINT4CoreML

Performance Benchmarks

ModelDeviceFrameworkTokens/secMemory
LLaMA 2 7B (INT4)Snapdragon 8 Gen 3MLC-LLM124.2 GB
Phi-2 2.7B (INT4)iPhone 15 ProCoreML181.8 GB
Gemma 2B (INT4)Pixel 8TFLite151.5 GB
LLaMA 3 8B (INT4)Snapdragon 8 Gen 3MLC-LLM105.1 GB

For best mobile performance: (1) use INT4 quantization, (2) enable grouped-query attention if available, (3) limit KV cache to 512-1024 tokens, (4) use speculative decoding with a small draft model, and (5) profile on the target device before deployment.

Practice Exercises

  1. Conceptual: Explain why INT4 quantization is preferred for mobile deployment over INT8. What are the tradeoffs in terms of quality, memory, and speed?

  2. Mathematical: Compute the KV cache memory for a 3B model with 24 layers, 16 heads, d=64, and T=1024 using INT8 quantization. Can this fit in 512MB of available memory?

  3. Practical: Convert a small LLM (e.g., Phi-2 2.7B) to TFLite format with INT4 quantization and benchmark inference speed on an Android device.

  4. Research: Compare the energy efficiency (tokens per joule) of on-device inference vs. cloud inference for a 3B model. Under what conditions is on-device more efficient?

Key Takeaways:

  • Mobile NPUs are optimized for INT8/INT4 inference, making quantization essential
  • TFLite, CoreML, and MLC-LLM are the primary frameworks for mobile deployment
  • INT4 quantization (AWQ preferred) achieves 4x compression with acceptable quality loss
  • KV cache is the primary memory bottleneck; INT8 KV cache and GQA help
  • Speculative decoding with small draft models accelerates on-device generation

What to Learn Next

-> Model Compression Pipeline End-to-end compression: quantization + pruning + distillation.

-> Quantization Techniques Deep Dive Detailed quantization methods and theory.

-> Hardware-Aware LLM Design Co-designing models and hardware for efficiency.

-> LLM Inference Optimization General inference optimization techniques.

-> Pruning for LLMs Structured and unstructured pruning methods.

-> Environmental Impact Energy efficiency and sustainable AI practices.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement