Tokenization for LLMs
Tokenization is the critical first step in processing text for LLMs---converting raw text into a sequence of integer tokens that the model can process. This tutorial covers the main tokenization algorithms, their mathematical foundations, and practical considerations.
DfTokenization
Tokenization is the process of mapping a sequence of characters to a sequence of tokens (integers) from a fixed vocabulary. The tokenization algorithm determines how the vocabulary is constructed and how text is segmented into tokens.
Why Tokenization Matters
The choice of tokenization algorithm affects:
- Vocabulary size: Larger vocabularies reduce sequence length but increase embedding parameters
- Out-of-vocabulary (OOV) handling: How the model handles unseen words
- Multilingual support: How well the tokenizer works across languages
- Model performance: Better tokenization can improve downstream task performance
Modern LLMs use vocabularies of 32K-100K tokens. The choice of vocabulary size involves a tradeoff: larger vocabularies reduce sequence length (faster inference) but increase the embedding matrix size (more parameters).
Byte Pair Encoding (BPE)
BPE is the most widely used tokenization algorithm in modern LLMs (GPT-2, GPT-3, LLaMA, Mistral).
DfByte Pair Encoding (BPE)
BPE is a subword tokenization algorithm that iteratively merges the most frequent pair of adjacent tokens in a corpus. Starting from individual characters, it builds up a vocabulary of subword units by repeatedly merging the most frequent byte pairs.
BPE Training Algorithm
- Initialize vocabulary with all unique characters (bytes)
- Count frequency of all adjacent token pairs
- Merge the most frequent pair into a new token
- Add the new token to the vocabulary
- Repeat until desired vocabulary size is reached
BPE Merge Frequency
Here,
- =Adjacent tokens
- =Frequency of the pair (a, b)
- =Frequency of token a
- =Frequency of token b
The merge score (also called pointwise mutual information) measures how often two tokens appear together relative to their individual frequencies.
BPE Example
Starting with text: {"low": 5, "lower": 2, "newest": 6, "widest": 3}
Initial vocabulary: {l, o, w, e, r, s, t, n, i, d}
Iteration 1: Most frequent pair is e, s (appears 9 times). Merge to create es token. Iteration 2: Most frequent pair is es, t (appears 9 times). Merge to create est token. Iteration 3: Most frequent pair is l, o (appears 7 times). Merge to create lo token.
BPE is greedy: it always merges the most frequent pair. This can lead to suboptimal tokenizations in some cases, but works well in practice due to its simplicity and efficiency.
WordPiece
WordPiece was originally developed for Google Translate and is used in BERT and other encoder models.
DfWordPiece
WordPiece is similar to BPE but uses a different merge criterion: instead of merging the most frequent pair, it merges the pair that maximizes the likelihood of the training data when added to the vocabulary.
WordPiece Merge Criterion
Here,
- =Adjacent tokens
- =Frequency of the pair (a, b)
The key difference from BPE: WordPiece optimizes for likelihood rather than raw frequency. This can lead to different tokenizations, especially for rare words.
SentencePiece
SentencePiece is a language-agnostic tokenization library that treats text as a raw stream of Unicode characters, without requiring pre-tokenization.
DfSentencePiece
SentencePiece is a tokenization framework that: (1) treats input as raw Unicode (no language-specific preprocessing), (2) treats spaces as regular characters (using \u2581 for space), and (3) can use BPE or Unigram as the underlying algorithm.
Key advantages:
- Language-agnostic: Works without language-specific tokenizers
- Reversible: Perfectly reversible tokenization (no information loss)
- Pre-tokenization free: Handles whitespace natively
LLaMA and many modern LLMs use SentencePiece with BPE. The \u2581 symbol represents a space, allowing the tokenizer to distinguish between hello (beginning of word) and ello (continuation).
Unigram Language Model
Unigram is an alternative to BPE that starts with a large vocabulary and prunes it down.
DfUnigram Tokenization
Unigram tokenization uses a probabilistic model where each token has an associated probability. The most likely tokenization (under the Unigram model) is found using the Viterbi algorithm. Vocabulary is pruned by removing tokens that contribute least to the likelihood.
Unigram Probability
Here,
- =Tokenized sequence (x_1, ..., x_n)
- =Probability of token x_i
The optimal tokenization is found by:
Tiktoken Implementation
Tiktoken is OpenAI's fast BPE tokenizer used for GPT-3.5 and GPT-4. It is implemented in Rust for speed.
`python import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
text = "Hello, world! This is a tokenization example." tokens = enc.encode(text) print(f"Tokens: {tokens}") print(f"Token strings: {[enc.decode([t]) for t in tokens]}")
decoded = enc.decode(tokens) assert decoded == text `
HuggingFace Tokenizers
The HuggingFace okenizers library provides fast, customizable tokenization:
`python from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import ByteLevel from tokenizers.processors import TemplateProcessing
tokenizer = Tokenizer(BPE(unk_token="[UNK]")) tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)
trainer = BpeTrainer( vocab_size=32000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], min_frequency=2 )
files = ["corpus.txt"] tokenizer.train(files, trainer)
output = tokenizer.encode("Hello, how are you?") print(f"Tokens: {output.tokens}") print(f"IDs: {output.ids}") `
Impact on Model Performance
Tokenization Efficiency
Here,
- =Number of tokens carrying semantic meaning
- =Total number of tokens in the sequence
Key considerations:
- Multilingual tokenization: Languages like Chinese/Japanese may require 2-3x more tokens than English for the same content
- Code tokenization: Indentation and syntax must be preserved
- Numerical tokenization: Numbers should be tokenized consistently
When evaluating tokenizers, consider: (1) compression ratio (bytes per token), (2) reconstruction accuracy, (3) multilingual coverage, and (4) inference speed. The GPT-4 tokenizer achieves approximately 3.5 bytes per token on English text.
Practice Exercises
-
Implementation: Implement BPE from scratch. Train a tokenizer on a small corpus and compare its vocabulary with SentencePiece.
-
Analysis: For a given text in English vs Chinese, compare the number of tokens produced by GPT-4's tokenizer. What is the tokenization overhead for Chinese?
-
Mathematical: Given a vocabulary of 50K tokens and a text of 10,000 characters, estimate the expected number of tokens if the average token covers 3.5 characters.
-
Research: Investigate how tokenization affects multilingual LLM performance. Why do some languages require more tokens per word?
Key Takeaways:
- BPE iteratively merges the most frequent token pairs
- WordPiece optimizes for likelihood rather than raw frequency
- SentencePiece is language-agnostic and handles whitespace natively
- Unigram uses probabilistic models with Viterbi decoding
- Tokenization choice affects vocabulary size, sequence length, and model performance
- Modern LLMs use 32K-100K token vocabularies with BPE or Unigram