LLM Foundations

Tokenization — How LLMs Break Text Into Manageable Pieces

Tokenization is the critical first step in processing text for LLMs—converting raw text into a sequence of integer tokens that the model can process. This guide covers BPE, WordPiece, SentencePiece, and Unigram algorithms with practical implementation details.

BPE — The most widely used tokenization algorithm in modern LLMs
Vocabulary Design — Balancing sequence length against embedding parameters
Multilingual Support — How tokenization choices affect cross-language performance

Good tokenization is the invisible foundation of every great language model.

Tokenization for LLMs

Tokenization is the critical first step in processing text for LLMs---converting raw text into a sequence of integer tokens that the model can process. This tutorial covers the main tokenization algorithms, their mathematical foundations, and practical considerations.

Why Tokenization Matters

The choice of tokenization algorithm affects:

Vocabulary size: Larger vocabularies reduce sequence length but increase embedding parameters
Out-of-vocabulary (OOV) handling: How the model handles unseen words
Multilingual support: How well the tokenizer works across languages
Model performance: Better tokenization can improve downstream task performance

Byte Pair Encoding (BPE)

BPE is the most widely used tokenization algorithm in modern LLMs (GPT-2, GPT-3, LLaMA, Mistral).

BPE Training Algorithm

Initialize vocabulary with all unique characters (bytes)
Count frequency of all adjacent token pairs
Merge the most frequent pair into a new token
Add the new token to the vocabulary
Repeat until desired vocabulary size is reached

The merge score (also called pointwise mutual information) measures how often two tokens appear together relative to their individual frequencies.

BPE Example

Starting with text: {"low": 5, "lower": 2, "newest": 6, "widest": 3}

Initial vocabulary: {l, o, w, e, r, s, t, n, i, d}

Iteration 1: Most frequent pair is e, s (appears 9 times). Merge to create es token. Iteration 2: Most frequent pair is es, t (appears 9 times). Merge to create est token. Iteration 3: Most frequent pair is l, o (appears 7 times). Merge to create lo token.

WordPiece

WordPiece was originally developed for Google Translate and is used in BERT and other encoder models.

The key difference from BPE: WordPiece optimizes for likelihood rather than raw frequency. This can lead to different tokenizations, especially for rare words.

SentencePiece

SentencePiece is a language-agnostic tokenization library that treats text as a raw stream of Unicode characters, without requiring pre-tokenization.

Key advantages:

Language-agnostic: Works without language-specific tokenizers
Reversible: Perfectly reversible tokenization (no information loss)
Pre-tokenization free: Handles whitespace natively

Unigram Language Model

Unigram is an alternative to BPE that starts with a large vocabulary and prunes it down.

The optimal tokenization is found by:

Tiktoken Implementation

Tiktoken is OpenAI's fast BPE tokenizer used for GPT-3.5 and GPT-4. It is implemented in Rust for speed.

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

text = "Hello, world! This is a tokenization example."
tokens = enc.encode(text)
print(f"Tokens: {tokens}")
print(f"Token strings: {[enc.decode([t]) for t in tokens]}")

decoded = enc.decode(tokens)
assert decoded == text

HuggingFace Tokenizers

The HuggingFace okenizers library provides fast, customizable tokenization:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.processors import TemplateProcessing

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)

trainer = BpeTrainer(
    vocab_size=32000,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    min_frequency=2
)

files = ["corpus.txt"]
tokenizer.train(files, trainer)

output = tokenizer.encode("Hello, how are you?")
print(f"Tokens: &#123;output.tokens&#125;")
print(f"IDs: &#123;output.ids&#125;")
`

## Impact on Model Performance

<MathFormula
  title="Tokenization Efficiency"
  tex={`\\text{efficiency} = \\frac{\\text{meaningful\\_tokens}}{\\text{total\\_tokens}}`}
/>

Key considerations:
- **Multilingual tokenization**: Languages like Chinese/Japanese may require 2-3x more tokens than English for the same content
- **Code tokenization**: Indentation and syntax must be preserved
- **Numerical tokenization**: Numbers should be tokenized consistently

<MathNote type="tip">
When evaluating tokenizers, consider: (1) compression ratio (bytes per token), (2) reconstruction accuracy, (3) multilingual coverage, and (4) inference speed. The GPT-4 tokenizer achieves approximately 3.5 bytes per token on English text.
</MathNote>

## Practice Exercises

1. **Implementation**: Implement BPE from scratch. Train a tokenizer on a small corpus and compare its vocabulary with SentencePiece.

2. **Analysis**: For a given text in English vs Chinese, compare the number of tokens produced by GPT-4's tokenizer. What is the tokenization overhead for Chinese?

3. **Mathematical**: Given a vocabulary of 50K tokens and a text of 10,000 characters, estimate the expected number of tokens if the average token covers 3.5 characters.

4. **Research**: Investigate how tokenization affects multilingual LLM performance. Why do some languages require more tokens per word?

<MathSummary>
**Key Takeaways:**
- BPE iteratively merges the most frequent token pairs
- WordPiece optimizes for likelihood rather than raw frequency
- SentencePiece is language-agnostic and handles whitespace natively
- Unigram uses probabilistic models with Viterbi decoding
- Tokenization choice affects vocabulary size, sequence length, and model performance
- Modern LLMs use 32K-100K token vocabularies with BPE or Unigram
</MathSummary>

---

## What to Learn Next

<div className="grid gap-4 md:grid-cols-3">

<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">

**-> [LLM Architecture Deep Dive](/learn/llm/02-llm-architecture-deep-dive/)**
How transformers power language models with self-attention and KV cache.

</div>

<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">

**-> [Pretraining Language Models](/learn/llm/04-pretraining-language-models/)**
Learning language from the internet with CLM, scaling laws, and data curation.

</div>

<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">

**-> [Fine-Tuning LLMs](/learn/llm/05-fine-tuning-llms/)**
Customizing language models for your specific tasks and domains.

</div>

<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">

**-> [Prompt Engineering](/learn/llm/08-prompt-engineering/)**
Getting the most out of language models through effective input design.

</div>

<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">

**-> [In-Context Learning](/learn/llm/09-in-context-learning/)**
Teaching LLMs new tasks without training—purely through prompts.

</div>

<div className="rounded-xl border p-4 hover:shadow-md transition-shadow">

**-> [Chain-of-Thought Reasoning](/learn/llm/10-chain-of-thought-reasoning/)**
Making LLMs think step by step for complex reasoning problems.

</div>

</div>

Tokenization for LLMs

Tokenization — How LLMs Break Text Into Manageable Pieces

Tokenization for LLMs

Why Tokenization Matters

Byte Pair Encoding (BPE)

BPE Training Algorithm

BPE Example

WordPiece

SentencePiece

Unigram Language Model

Tiktoken Implementation

HuggingFace Tokenizers

Need Expert LLM Help?