Hugging Face Tokenizers

NLPHugging FaceFree Lesson

Advertisement

Introduction

Hugging Face tokenizers convert text to token IDs and vice versa with optimized implementations.

AutoTokenizer

from transformers import AutoTokenizer

# Load pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize single string
tokens = tokenizer.encode("Hello world")
print(tokens)  # [101, 7592, 2088, 102]

# Decode
text = tokenizer.decode(tokens)
print(text)  # "[CLS] hello world [SEP]"

Batch Tokenization

texts = ["First sentence", "Second sentence", "Third"]

# Encode multiple texts
encoded = tokenizer(texts, padding=True, truncation=True, max_length=128, return_tensors='pt')

print(encoded['input_ids'])
print(encoded['attention_mask'])

Special Tokens

# Check special tokens
print(tokenizer.pad_token)       # [PAD]
print(tokenizer.unk_token)       # [UNK]
print(tokenizer.cls_token)      # [CLS]
print(tokenizer.sep_token)       # [SEP]
print(tokenizer.mask_token)      # [MASK]

WordPiece Tokenization

# For BERT-like models
from transformers import BertTokenizer

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = bert_tokenizer.tokenize("unhappiness")
print(tokens)  # ['un', '##happiness']

Custom Vocabulary

# Train custom tokenizer
from tokenizers import Tokenizer
from tokenizers.trainers import BpeTrainer
from tokenizers.normalizers import NFD, StripAccents
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(UnicodeNormalizer())
tokenizer.normalizer = NFD()
tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(vocab_size=30000, special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])
tokenizer.train(files=["data.txt"], trainer=trainer)

Practice Problems

  1. Load tokenizer from pretrained
  2. Tokenize single and batch text
  3. Handle special tokens
  4. Decode token IDs
  5. Train custom tokenizer

Advertisement

Need Expert Python Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement