Introduction
Hugging Face tokenizers convert text to token IDs and vice versa with optimized implementations.
AutoTokenizer
from transformers import AutoTokenizer
# Load pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# Tokenize single string
tokens = tokenizer.encode("Hello world")
print(tokens) # [101, 7592, 2088, 102]
# Decode
text = tokenizer.decode(tokens)
print(text) # "[CLS] hello world [SEP]"
Batch Tokenization
texts = ["First sentence", "Second sentence", "Third"]
# Encode multiple texts
encoded = tokenizer(texts, padding=True, truncation=True, max_length=128, return_tensors='pt')
print(encoded['input_ids'])
print(encoded['attention_mask'])
Special Tokens
# Check special tokens
print(tokenizer.pad_token) # [PAD]
print(tokenizer.unk_token) # [UNK]
print(tokenizer.cls_token) # [CLS]
print(tokenizer.sep_token) # [SEP]
print(tokenizer.mask_token) # [MASK]
WordPiece Tokenization
# For BERT-like models
from transformers import BertTokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = bert_tokenizer.tokenize("unhappiness")
print(tokens) # ['un', '##happiness']
Custom Vocabulary
# Train custom tokenizer
from tokenizers import Tokenizer
from tokenizers.trainers import BpeTrainer
from tokenizers.normalizers import NFD, StripAccents
from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(UnicodeNormalizer())
tokenizer.normalizer = NFD()
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=30000, special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])
tokenizer.train(files=["data.txt"], trainer=trainer)
Practice Problems
- Load tokenizer from pretrained
- Tokenize single and batch text
- Handle special tokens
- Decode token IDs
- Train custom tokenizer