NLP Basics: Tokenization, Embeddings
š” Natural Language Processing bridges human language and machine understanding. This lesson covers text preprocessing, tokenization, word embeddings, and sequence models ā the building blocks of modern NLP.
1. The NLP Pipeline
Raw Text ā Tokenization ā Cleaning ā Encoding ā Embedding ā Model ā Output
"I love NLP!" ā ["I","love","NLP!"] ā ["i","love","nlp"] ā [45, 892, 3] ā [0.2, -0.5, ...] ā classifier ā POSITIVE
Each step transforms text into a numerical representation that models can process.
2. Text Preprocessing
Cleaning Operations
import re
import string
def clean_text(text):
# Lowercase
text = text.lower()
# Remove URLs
text = re.sub(r'https?://\S+|www\.\S+', '', text)
# Remove HTML tags
text = re.sub(r'<.*?>', '', text)
# Remove email addresses
text = re.sub(r'\S+@\S+', '', text)
# Remove numbers (optional ā depends on task)
text = re.sub(r'\d+', '', text)
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
# Example
raw = "Check out https://example.com! Email me at test@email.com. Price: $29.99"
print(clean_text(raw))
# Output: "check out email me at price"
Stopword Removal
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
def remove_stopwords(tokens):
return [t for t in tokens if t not in stop_words]
tokens = ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
print(remove_stopwords(tokens))
# Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
Stemming vs Lemmatization
Word Stemming (Porter) Lemmatization
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
running run run
ran ran run
better better good
studies studi study
geese gees goose
from nltk.stem import PorterStemmer, WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["running", "ran", "better", "studies", "geese"]
for word in words:
stem = stemmer.stem(word)
lemma = lemmatizer.lemmatize(word, pos='v') # verb form
print(f"{word:12s} ā Stem: {stem:10s} Lemma: {lemma}")
# running ā Stem: run Lemma: run
# ran ā Stem: ran Lemma: run
# better ā Stem: better Lemma: good (with POS tagging)
# studies ā Stem: studi Lemma: study
# geese ā Stem: gees Lemma: goose
3. Tokenization Methods
Word-Level Tokenization
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
corpus = [
"The cat sat on the mat",
"The dog chased the cat",
"The cat and the dog are friends"
]
# Bag of Words
bow = CountVectorizer()
X_bow = bow.fit_transform(corpus)
print("Vocabulary:", bow.get_feature_names_out())
print("BoW matrix:\n", X_bow.toarray())
# TF-IDF
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
print("TF-IDF matrix:\n", X_tfidf.toarray().round(3))
TF-IDF Formula
TF-IDF
Here,
- =Term frequency of term t in document d
- =Inverse document frequency of term t
TF and IDF Components
Here,
- =Frequency of term t in document d
- =Total number of documents in the corpus
- =Number of documents containing term t
ā¹ļø TF-IDF Intuition
TF-IDF balances two signals: term frequency (how often a word appears in a document) and inverse document frequency (how rare the word is across all documents). Common words like "the" have high TF but low IDF (they appear everywhere), so they get low TF-IDF scores. Rare but informative words like "neural" have moderate TF but high IDF, giving them high TF-IDF scores.
Subword Tokenization (BPE)
Byte Pair Encoding iteratively merges the most frequent character pairs:
Vocabulary: {'l', 'o', 'w', 'e', 'r', 's', 't', 'n', 'i', 'd', 'unk'}
Corpus frequencies:
'low': 5, 'lower': 2, 'newest': 6, 'widest': 3
Iteration 1: Most frequent pair ā 'es' (9 occurrences)
Vocabulary: {..., 'es'}
Iteration 2: Most frequent pair ā 'est' (9 occurrences)
Vocabulary: {..., 'est'}
Iteration 3: Most frequent pair ā 'new' (6)
Vocabulary: {..., 'new'}
Result: 'newest' ā ['new', 'est'] (not character-by-character)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "Transfer learning is incredibly powerful!"
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
# ['transfer', 'learning', 'is', 'incredibly', 'powerful', '!']
ids = tokenizer.encode(text)
print("Token IDs:", ids)
# [101, 3849, 3776, 2003, 14916, 3928, 999, 102]
decoded = tokenizer.decode(ids)
print("Decoded:", decoded)
# [CLS] transfer learning is incredibly powerful! [SEP]
4. Word Embeddings
The Problem with One-Hot Encoding
One-Hot Encoding
Here,
- =Vocabulary size (number of unique words)
- =Vector with 1 at the index of 'cat', 0 elsewhere
Problems:
- No notion of similarity: (all orthogonal)
- Sparse and high-dimensional ( = vocabulary size, often 50K+)
- No semantic relationships captured
ā¹ļø Why Embeddings Are Better
Word embeddings solve these problems by mapping words to dense, low-dimensional vectors (typically 100-300 dimensions) where similar words are close together. The embedding matrix is learned during training, and each row represents the vector for word . This is equivalent to looking up the -th row of a weight matrix.
Word2Vec (Mikolov et al., 2013)
Learns dense vectors by predicting context.
DfWord2Vec Skip-gram
Given a center word, predict surrounding context words. The model learns dense vector representations where words with similar contexts have similar vectors. The key insight is that "you shall know a word by the company it keeps" (Firth, 1957).
Word2Vec Skip-gram Objective
Here,
- =Length of the corpus
- =Context window size
- =Center word at position t
- =Context word at position t+j
Skip-gram Probability (Softmax)
Here,
- =Input (center) word
- =Output (context) word
- =Input vector for word w_I
- =Output vector for word w_O
- =Vocabulary size
š” Negative Sampling
Computing the softmax denominator over the entire vocabulary is expensive ( per word). Negative sampling approximates this by training on the true context word plus a few randomly sampled "negative" words. This reduces computation to where is the number of negatives (typically 5-20).
š” Key Property: Word Arithmetic
Word embeddings capture semantic relationships that can be expressed through vector arithmetic:
from gensim.models import Word2Vec
import numpy as np
# Training data
sentences = [
["the", "cat", "sat", "on", "the", "mat"],
["the", "dog", "played", "in", "the", "park"],
["cats", "and", "dogs", "are", "friends"],
["the", "cat", "chased", "the", "mouse"],
]
# Train Word2Vec
model = Word2Vec(
sentences,
vector_size=100, # embedding dimension
window=5, # context window size
min_count=1, # minimum word frequency
sg=1, # 1=skip-gram, 0=CBOW
epochs=100,
)
# Get word vector
cat_vector = model.wv['cat']
print(f"Vector shape: {cat_vector.shape}") # (100,)
# Find similar words
similar = model.wv.most_similar('cat', topn=5)
print("Similar to 'cat':", similar)
# Word analogy: king - man + woman = ?
result = model.wv.most_similar(
positive=['king', 'woman'],
negative=['man'],
topn=1
)
print("king - man + woman =", result)
GloVe (Global Vectors, Pennington et al., 2014)
Combines global co-occurrence statistics with local context windows.
GloVe Objective
Here,
- =Co-occurrence count of words i and j
- =Word vector for word i
- =Context word vector for word j
- =Bias terms
- =Weighting function capping rare and frequent co-occurrences
where is the co-occurrence count and is a weighting function:
import numpy as np
from gensim.models import KeyedVectors
# Load pre-trained GloVe embeddings
# Download from: https://nlp.stanford.edu/projects/glove/
# glove.6B.100d.txt format: word float float float ...
def load_glove(filepath):
embeddings = {}
with open(filepath, 'r', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
vector = np.asarray(values[1:], dtype='float32')
embeddings[word] = vector
return embeddings
# Usage
# glove_embeddings = load_glove('glove.6B.100d.txt')
# print(glove_embeddings['king'].shape) # (100,)
# Create embedding matrix for a vocabulary
def create_embedding_matrix(vocab, embeddings_dict, dim=100):
matrix = np.zeros((len(vocab), dim))
found = 0
for word, idx in vocab.items():
if word in embeddings_dict:
matrix[idx] = embeddings_dict[word]
found += 1
print(f"Found {found}/{len(vocab)} words in embeddings")
return matrix
Embedding Comparison
| Method | Training | Semantics | Speed | Memory |
|---|---|---|---|---|
| Word2Vec | Local context | Strong (syntactic) | Fast | Low |
| GloVe | Global co-occurrence | Strong (semantic) | Medium | Low |
| FastText | Subword info | Handles OOV | Slow | High |
| BERT | Contextual | Best (context-dependent) | Slow | High |
5. Sequence Models
Recurrent Neural Network (RNN)
RNN Hidden State
Here,
- =Hidden state at time t
- =Input at time t
- =Recurrent weight matrix
- =Input weight matrix
RNN Output
Here,
- =Output at time t
- =Hidden state at time t
- =Hidden-to-output weight matrix
- =Output bias
h_0 -> [RNN] -> h_1 -> [RNN] -> h_2 -> [RNN] -> h_3 -> y
^ ^ ^
x_1 x_2 x_3
Problem: Vanishing gradient ā struggles with long-range dependencies.
Long Short-Term Memory (LSTM)
DfLSTM Gates
LSTM uses three gates (forget, input, output) to control the flow of information, allowing it to learn long-range dependencies by selectively remembering or forgetting information.
LSTM Forget Gate
Here,
- =Forget gate output (0 = forget, 1 = keep)
- =Sigmoid activation (outputs between 0 and 1)
- =Concatenation of previous hidden state and current input
LSTM Input Gate and Candidate
Here,
- =Input gate (what to update)
- =Candidate cell state (new information)
LSTM Cell State Update
Here,
- =New cell state
- =What to forget from old cell state
- =What to add as new information
LSTM Output Gate
Here,
- =Output gate (what to output)
- =Final hidden state
import torch
import torch.nn as nn
class LSTMClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=2,
batch_first=True, dropout=0.3, bidirectional=True)
self.classifier = nn.Linear(hidden_dim * 2, num_classes) # *2 for bidirectional
def forward(self, x):
embeds = self.embedding(x) # (batch, seq_len, embed_dim)
lstm_out, (h_n, c_n) = self.lstm(embeds) # lstm_out: (batch, seq_len, hidden*2)
# Concatenate final hidden states from both directions
hidden = torch.cat([h_n[-2], h_n[-1]], dim=1) # (batch, hidden*2)
output = self.classifier(hidden)
return output
# Example
model = LSTMClassifier(vocab_size=10000, embed_dim=128, hidden_dim=256, num_classes=2)
x = torch.randint(0, 10000, (32, 50)) # batch of 32, sequence length 50
print(model(x).shape) # torch.Size([32, 2])
GRU (Gated Recurrent Unit)
Simplified LSTM with 2 gates instead of 3:
GRU Update and Reset Gates
Here,
- =Update gate (interpolation between old and new)
- =Reset gate (how much past to forget)
GRU Hidden State Update
Here,
- =New hidden state
- =What to keep from old hidden state
- =What to add as new information
ā¹ļø LSTM vs. GRU
GRU combines the forget and input gates into a single update gate , and merges the cell state and hidden state. This results in fewer parameters (2 gates vs. 3) and faster training. GRU performs comparably to LSTM on many tasks, especially when data is limited. Use LSTM when you need fine-grained control over information flow; use GRU for simpler, faster models.
ThVanishing Gradient Problem in RNNs
In vanilla RNNs, gradients are multiplied by the recurrent weight matrix at each time step. After steps, the gradient scales as . If , gradients vanish exponentially (long-range dependencies are lost). If , gradients explode. LSTMs solve this by maintaining a cell state with additive updates (not multiplicative), allowing gradients to flow unchanged through the cell state.
šSentiment Analysis Pipeline
Task: Classify movie reviews as positive/negative.
Pipeline:
- Preprocessing: Lowercase, remove punctuation, tokenize
- Vocabulary: Build word-to-index mapping (top 25K words)
- Embedding: Load pre-trained GloVe vectors (300d) or train from scratch
- Model: Bidirectional LSTM with 2 layers, hidden dim=256
- Classification: Use final hidden state as features for a linear classifier
Expected Performance:
- TF-IDF + Logistic Regression: ~85% accuracy
- LSTM (random init): ~87% accuracy
- LSTM + GloVe: ~89% accuracy
- LSTM + fine-tuned GloVe: ~90% accuracy
- BERT fine-tuned: ~93% accuracy
The improvement from GloVe shows the value of pre-trained embeddings. The jump to BERT shows the power of contextual representations.
6. Complete Example: Sentiment Classification
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from collections import Counter
class Vocabulary:
def __init__(self, max_size=25000):
self.word2idx = {'<PAD>': 0, '<UNK>': 1}
self.idx2word = {0: '<PAD>', 1: '<UNK>'}
self.max_size = max_size
def build(self, texts):
counter = Counter()
for text in texts:
counter.update(text.split())
for word, _ in counter.most_common(self.max_size - 2):
idx = len(self.word2idx)
self.word2idx[word] = idx
self.idx2word[idx] = word
def encode(self, text, max_len=100):
tokens = text.lower().split()
ids = [self.word2idx.get(w, 1) for w in tokens]
# Pad or truncate
if len(ids) < max_len:
ids += [0] * (max_len - len(ids))
else:
ids = ids[:max_len]
return ids
class SentimentDataset(Dataset):
def __init__(self, texts, labels, vocab, max_len=100):
self.texts = texts
self.labels = labels
self.vocab = vocab
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
ids = self.vocab.encode(self.texts[idx], self.max_len)
return torch.tensor(ids), torch.tensor(self.labels[idx])
# Training
vocab = Vocabulary(max_size=25000)
vocab.build(train_texts) # assuming train_texts is a list of strings
model = LSTMClassifier(vocab_size=25000, embed_dim=128, hidden_dim=256, num_classes=2)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(10):
model.train()
for batch_ids, batch_labels in train_loader:
optimizer.zero_grad()
outputs = model(batch_ids)
loss = criterion(outputs, batch_labels)
loss.backward()
optimizer.step()
7. Key Takeaways
šSummary: NLP Basics
- Preprocessing (cleaning, normalization) is task-dependent ā modern models like BERT handle raw text well, but TF-IDF models still benefit from preprocessing
- TF-IDF captures word importance by balancing term frequency and inverse document frequency; useful as a strong baseline
- Word2Vec learns embeddings by predicting context (skip-gram) or predicting words from context (CBOW); captures syntactic and semantic relationships
- GloVe combines global co-occurrence statistics with local context, often capturing better semantic relationships
- Subword tokenization (BPE) handles rare words and morphological variation; enables open-vocabulary modeling
- Word analogies () demonstrate that embeddings capture linear semantic structure
- LSTM/GRU solve the vanishing gradient problem through gating mechanisms; LSTM has 3 gates (forget, input, output), GRU has 2 (update, reset)
- Pre-trained embeddings transfer knowledge and reduce training data requirements; fine-tuning them during training often improves performance
8. Practice Exercises
Exercise 1: Build a Vocabulary
# TODO: Build a vocabulary from a corpus of 10K documents
# Support: min frequency threshold, max vocab size, special tokens
# Test: encode and decode a sample sentence
Exercise 2: Train Word2Vec
# TODO: Train Word2Vec on a custom corpus (e.g., Wikipedia dump)
# Evaluate with word similarity benchmarks
# Visualize word vectors using t-SNE
Exercise 3: Sentiment Classifier
# TODO: Build a bidirectional LSTM for sentiment analysis
# Use the IMDB dataset (50K reviews)
# Target: >85% accuracy
# Compare with TF-IDF + Logistic Regression baseline
Exercise 4: Compare Embeddings
# TODO: Compare Word2Vec, GloVe, and FastText on:
# 1. Word similarity (WS-353 dataset)
# 2. Word analogy task
# 3. Text classification accuracy