Tokenization for LLMs

FoundationsTokenizationFree Lesson

Advertisement

Tokenization for LLMs

Tokenization is the critical first step in processing text for LLMs---converting raw text into a sequence of integer tokens that the model can process. This tutorial covers the main tokenization algorithms, their mathematical foundations, and practical considerations.

DfTokenization

Tokenization is the process of mapping a sequence of characters to a sequence of tokens (integers) from a fixed vocabulary. The tokenization algorithm determines how the vocabulary is constructed and how text is segmented into tokens.

Why Tokenization Matters

The choice of tokenization algorithm affects:

  • Vocabulary size: Larger vocabularies reduce sequence length but increase embedding parameters
  • Out-of-vocabulary (OOV) handling: How the model handles unseen words
  • Multilingual support: How well the tokenizer works across languages
  • Model performance: Better tokenization can improve downstream task performance

Modern LLMs use vocabularies of 32K-100K tokens. The choice of vocabulary size involves a tradeoff: larger vocabularies reduce sequence length (faster inference) but increase the embedding matrix size (more parameters).

Byte Pair Encoding (BPE)

BPE is the most widely used tokenization algorithm in modern LLMs (GPT-2, GPT-3, LLaMA, Mistral).

DfByte Pair Encoding (BPE)

BPE is a subword tokenization algorithm that iteratively merges the most frequent pair of adjacent tokens in a corpus. Starting from individual characters, it builds up a vocabulary of subword units by repeatedly merging the most frequent byte pairs.

BPE Training Algorithm

  1. Initialize vocabulary with all unique characters (bytes)
  2. Count frequency of all adjacent token pairs
  3. Merge the most frequent pair into a new token
  4. Add the new token to the vocabulary
  5. Repeat until desired vocabulary size is reached

BPE Merge Frequency

textmergescore(a,b)=fractextfreq(ab)textfreq(a)cdottextfreq(b)\\text{merge\\_score}(a, b) = \\frac{\\text{freq}(ab)}{\\text{freq}(a) \\cdot \\text{freq}(b)}

Here,

  • a,ba, b=Adjacent tokens
  • freq(ab)\text{freq}(ab)=Frequency of the pair (a, b)
  • freq(a)\text{freq}(a)=Frequency of token a
  • freq(b)\text{freq}(b)=Frequency of token b

The merge score (also called pointwise mutual information) measures how often two tokens appear together relative to their individual frequencies.

BPE Example

Starting with text: {"low": 5, "lower": 2, "newest": 6, "widest": 3}

Initial vocabulary: {l, o, w, e, r, s, t, n, i, d}

Iteration 1: Most frequent pair is e, s (appears 9 times). Merge to create es token. Iteration 2: Most frequent pair is es, t (appears 9 times). Merge to create est token. Iteration 3: Most frequent pair is l, o (appears 7 times). Merge to create lo token.

BPE is greedy: it always merges the most frequent pair. This can lead to suboptimal tokenizations in some cases, but works well in practice due to its simplicity and efficiency.

WordPiece

WordPiece was originally developed for Google Translate and is used in BERT and other encoder models.

DfWordPiece

WordPiece is similar to BPE but uses a different merge criterion: instead of merging the most frequent pair, it merges the pair that maximizes the likelihood of the training data when added to the vocabulary.

WordPiece Merge Criterion

textscore(a,b)=fractextfreq(ab)textfreq(a)timestextfreq(b)\\text{score}(a, b) = \\frac{\\text{freq}(ab)}{\\text{freq}(a) \\times \\text{freq}(b)}

Here,

  • a,ba, b=Adjacent tokens
  • freq(ab)\text{freq}(ab)=Frequency of the pair (a, b)

The key difference from BPE: WordPiece optimizes for likelihood rather than raw frequency. This can lead to different tokenizations, especially for rare words.

SentencePiece

SentencePiece is a language-agnostic tokenization library that treats text as a raw stream of Unicode characters, without requiring pre-tokenization.

DfSentencePiece

SentencePiece is a tokenization framework that: (1) treats input as raw Unicode (no language-specific preprocessing), (2) treats spaces as regular characters (using \u2581 for space), and (3) can use BPE or Unigram as the underlying algorithm.

Key advantages:

  • Language-agnostic: Works without language-specific tokenizers
  • Reversible: Perfectly reversible tokenization (no information loss)
  • Pre-tokenization free: Handles whitespace natively

LLaMA and many modern LLMs use SentencePiece with BPE. The \u2581 symbol represents a space, allowing the tokenizer to distinguish between hello (beginning of word) and ello (continuation).

Unigram Language Model

Unigram is an alternative to BPE that starts with a large vocabulary and prunes it down.

DfUnigram Tokenization

Unigram tokenization uses a probabilistic model where each token has an associated probability. The most likely tokenization (under the Unigram model) is found using the Viterbi algorithm. Vocabulary is pruned by removing tokens that contribute least to the likelihood.

Unigram Probability

P(mathbfx)=prodi=1nP(xi)P(\\mathbf{x}) = \\prod_{i=1}^{n} P(x_i)

Here,

  • x\mathbf{x}=Tokenized sequence (x_1, ..., x_n)
  • P(xi)P(x_i)=Probability of token x_i

The optimal tokenization is found by:

mathbfx=argmaxmathbfxinS(X)sumi=1nlogP(xi)\\mathbf{x}^* = \\arg\\max_{\\mathbf{x} \\in S(X)} \\sum_{i=1}^{n} \\log P(x_i)

Tiktoken Implementation

Tiktoken is OpenAI's fast BPE tokenizer used for GPT-3.5 and GPT-4. It is implemented in Rust for speed.

`python import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

text = "Hello, world! This is a tokenization example." tokens = enc.encode(text) print(f"Tokens: {tokens}") print(f"Token strings: {[enc.decode([t]) for t in tokens]}")

decoded = enc.decode(tokens) assert decoded == text `

HuggingFace Tokenizers

The HuggingFace okenizers library provides fast, customizable tokenization:

`python from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import ByteLevel from tokenizers.processors import TemplateProcessing

tokenizer = Tokenizer(BPE(unk_token="[UNK]")) tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)

trainer = BpeTrainer( vocab_size=32000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], min_frequency=2 )

files = ["corpus.txt"] tokenizer.train(files, trainer)

output = tokenizer.encode("Hello, how are you?") print(f"Tokens: {output.tokens}") print(f"IDs: {output.ids}") `

Impact on Model Performance

Tokenization Efficiency

textefficiency=fractextmeaningfultokenstexttotaltokens\\text{efficiency} = \\frac{\\text{meaningful\\_tokens}}{\\text{total\\_tokens}}

Here,

  • meaningful_tokensmeaningful\_tokens=Number of tokens carrying semantic meaning
  • total_tokenstotal\_tokens=Total number of tokens in the sequence

Key considerations:

  • Multilingual tokenization: Languages like Chinese/Japanese may require 2-3x more tokens than English for the same content
  • Code tokenization: Indentation and syntax must be preserved
  • Numerical tokenization: Numbers should be tokenized consistently

When evaluating tokenizers, consider: (1) compression ratio (bytes per token), (2) reconstruction accuracy, (3) multilingual coverage, and (4) inference speed. The GPT-4 tokenizer achieves approximately 3.5 bytes per token on English text.

Practice Exercises

  1. Implementation: Implement BPE from scratch. Train a tokenizer on a small corpus and compare its vocabulary with SentencePiece.

  2. Analysis: For a given text in English vs Chinese, compare the number of tokens produced by GPT-4's tokenizer. What is the tokenization overhead for Chinese?

  3. Mathematical: Given a vocabulary of 50K tokens and a text of 10,000 characters, estimate the expected number of tokens if the average token covers 3.5 characters.

  4. Research: Investigate how tokenization affects multilingual LLM performance. Why do some languages require more tokens per word?

Key Takeaways:

  • BPE iteratively merges the most frequent token pairs
  • WordPiece optimizes for likelihood rather than raw frequency
  • SentencePiece is language-agnostic and handles whitespace natively
  • Unigram uses probabilistic models with Viterbi decoding
  • Tokenization choice affects vocabulary size, sequence length, and model performance
  • Modern LLMs use 32K-100K token vocabularies with BPE or Unigram

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement