NLP Basics: Tokenization, Embeddings and Word Vectors
Natural Language Processing (NLP) bridges human language and computational understanding. This lesson covers the foundational building blocks — from raw text to dense vector representations that capture semantic meaning.
1. The NLP Pipeline
Every NLP system follows a sequential pipeline that transforms raw text into machine-understandable representations.
Key stages:
- Raw Text — Unprocessed input (documents, sentences, tweets)
- Tokenization — Splitting text into atomic units (tokens)
- Normalization — Lowercasing, removing noise, stemming/lemmatization
- Feature Extraction — Converting tokens to numerical vectors
- Modeling — Applying statistical or neural models
- Output — Classification, generation, translation, etc.
2. Text Preprocessing
Preprocessing cleans and normalizes text to reduce vocabulary size and noise.
2.1 Lowercasing and Noise Removal
import re
def preprocess(text):
text = text.lower()
text = re.sub(r'<.*?>', '', text) # Remove HTML tags
text = re.sub(r'http\S+', '', text) # Remove URLs
text = re.sub(r'[^a-zA-Z\s]', '', text) # Keep only letters
text = re.sub(r'\s+', ' ', text).strip() # Normalize whitespace
return text
sample = " <p>The Cat sat on a MAT!!! Visit http://example.com </p> "
print(preprocess(sample))
# Output: "the cat sat on a mat"
2.2 Stopword Removal
Stopwords are high-frequency, low-information words (the, is, at, which).
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
tokens = ["the", "cat", "sat", "on", "a", "mat"]
filtered = [w for w in tokens if w not in stop_words]
# ['cat', 'sat', 'mat']
2.3 Stemming vs Lemmatization
| Method | Approach | Example | Pros | Cons |
|---|---|---|---|---|
| Stemming | Rule-based suffix stripping | "running" → "run", "studies" → "studi" | Fast, no lookup | Can over-stem or under-stem |
| Lemmatization | Dictionary + morphological analysis | "better" → "good", "ran" → "run" | Linguistically accurate | Slower, requires POS tags |
from nltk.stem import PorterStemmer, WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["running", "studies", "better", "geese"]
# Stemming
print([stemmer.stem(w) for w in words])
# ['run', 'studi', 'better', 'gees']
# Lemmatization (requires POS for best results)
print([lemmatizer.lemmatize(w, pos='v') for w in words])
# ['run', 'studies', 'good', 'geese']
When to use which:
- Stemming: Information retrieval, search engines, when speed matters
- Lemmatization: Text analysis, chatbots, when semantic accuracy matters
3. Tokenization Strategies
Tokenization determines how text is segmented into processable units.
3.1 Word-Level Tokenization
# Simple whitespace + punctuation tokenizer
import re
def word_tokenize(text):
return re.findall(r"\b\w+\b", text.lower())
text = "Don't tokenize — it's easier!"
print(word_tokenize(text))
# ["don't", "tokenize", "it's", "easier"]
3.2 Byte-Pair Encoding (BPE)
BPE iteratively merges the most frequent character pairs, building a subword vocabulary.
Algorithm:
- Start with character-level vocabulary
- Count all adjacent symbol pairs
- Merge the most frequent pair into a new symbol
- Repeat until desired vocabulary size is reached
# Simplified BPE training
corpus = ["low", "low", "low", "lowest", "newer", "wider"]
# Initial vocab: all unique characters
vocab = set(''.join(corpus)) # {'l','o','w','e','s','t','n','r','i','d'}
# Iteration 1: most frequent pair is ('l','o') → merge to 'lo'
# Iteration 2: ('lo','w') → 'low' (high frequency)
# ... continues until vocab size limit
3.3 WordPiece Tokenization
Used by BERT. Similar to BPE but merges pairs that maximize likelihood of the training data rather than pure frequency.
BERT tokenizer output:
"unhappiness" → ["##un", "##happi", "##ness"]
"tokenization" → ["token", "##ization"]
3.4 SentencePiece
Language-agnostic tokenization that treats input as raw Unicode, handling languages without whitespace.
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file='model.model')
tokens = sp.encode("unhappiness", out_type=str)
# ['â–un', 'happy', 'ness']
Tokenizer Comparison
| Method | Vocab Size | OOV Handling | Speed | Used By |
|---|---|---|---|---|
| Word | 100K-1M | Poor | Fast | spaCy, NLTK |
| BPE | 30K-50K | Good | Medium | GPT-2/3/4 |
| WordPiece | 30K | Good | Medium | BERT, DistilBERT |
| SentencePiece | 32K-64K | Good | Medium | T5, LLaMA, mBART |
4. Bag of Words and TF-IDF
4.1 Bag of Words (BoW)
BoW represents documents as fixed-length vectors of word counts, ignoring order.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"the cat sat on the mat",
"the dog sat on the log",
"the cat chased the dog"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
# ['cat', 'chased', 'dog', 'log', 'mat', 'on', 'sat', 'the']
print(X.toarray())
# [[1, 0, 0, 0, 1, 1, 1, 2],
# [0, 0, 1, 1, 0, 1, 1, 2],
# [1, 1, 1, 0, 0, 0, 0, 2]]
Limitations:
- Loses word order: "dog bites man" = "man bites dog"
- High dimensionality: vocabulary-sized sparse vectors
- No semantic information: "good" and "excellent" are unrelated
4.2 TF-IDF (Term Frequency–Inverse Document Frequency)
TF-IDF weights words by their importance within a document relative to the corpus.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
# "cat" has high TF-IDF in doc 0 (appears there, rare overall)
# "the" has near-zero TF-IDF (appears everywhere)
5. Word Embeddings
Word embeddings map words to dense, low-dimensional vectors where geometric relationships encode semantic similarity.
5.1 Why Not One-Hot Encoding?
One-hot vectors are orthogonal — no notion of similarity:
Dense embeddings solve this by learning a continuous vector space.
5.2 Word2Vec (Mikolov et al., 2013)
Word2Vec learns embeddings by predicting context from words (or words from context).
CBOW (Continuous Bag of Words)
Predicts the center word given surrounding context words.
where is the averaged context embedding.
import torch
import torch.nn as nn
class CBOW(nn.Module):
def __init__(self, vocab_size, embed_dim):
super().__init__()
self.embeddings = nn.Embedding(vocab_size, embed_dim)
self.output = nn.Linear(embed_dim, vocab_size)
def forward(self, context_ids):
# context_ids: (batch, 2*context_size)
embeds = self.embeddings(context_ids) # (batch, 2c, dim)
hidden = embeds.mean(dim=1) # (batch, dim)
logits = self.output(hidden) # (batch, vocab_size)
return logits
# Training loop
model = CBOW(vocab_size=10000, embed_dim=300)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
for epoch in range(num_epochs):
for context, target in dataloader:
logits = model(context)
loss = criterion(logits, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
Skip-gram
Predicts context words given the center word — the reverse of CBOW.
Negative Sampling (approximation):
class SkipGram(nn.Module):
def __init__(self, vocab_size, embed_dim):
super().__init__()
self.center_embed = nn.Embedding(vocab_size, embed_dim)
self.context_embed = nn.Embedding(vocab_size, embed_dim)
def forward(self, center, context, negatives):
c = self.center_embed(center) # (batch, dim)
p = self.context_embed(context) # (batch, dim)
n = self.context_embed(negatives) # (batch, k, dim)
pos_score = torch.sum(c * p, dim=1) # (batch,)
neg_score = torch.bmm(n, c.unsqueeze(2)).squeeze() # (batch, k)
pos_loss = -torch.log(torch.sigmoid(pos_score) + 1e-8)
neg_loss = -torch.sum(torch.log(torch.sigmoid(-neg_score) + 1e-8), dim=1)
return (pos_loss + neg_loss).mean()
5.3 GloVe (Global Vectors, Pennington et al., 2014)
GloVe combines global co-occurrence statistics with local context learning.
where is the co-occurrence count and is a weighting function:
# Using pretrained GloVe embeddings
import numpy as np
def load_glove(path):
embeddings = {}
with open(path, 'r', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
vector = np.asarray(values[1:], dtype='float32')
embeddings[word] = vector
return embeddings
glove = load_glove('glove.6B.300d.txt')
print(glove['king'].shape) # (300,)
Word2Vec vs GloVe
| Aspect | Word2Vec | GloVe |
|---|---|---|
| Training | Local context windows | Global co-occurrence matrix |
| Objective | Predict context (prediction-based) | Reconstruct log co-occurrence (count-based) |
| Speed | Faster per epoch | Faster convergence |
| Performance | Comparable | Comparable |
| Intuition | "You shall know a word by the company it keeps" | "Word co-occurrence ratios encode meaning" |
6. Embedding Properties
6.1 Word Analogies
Word embeddings capture linear relationships: king - man + woman ≈ queen.
def analogy(word_a, word_b, word_c, embeddings, top_k=5):
"""a - b + c = ?"""
vec = embeddings[word_a] - embeddings[word_b] + embeddings[word_c]
# Normalize and compute cosine similarity
vec = vec / np.linalg.norm(vec)
similarities = {
word: np.dot(vec, emb / np.linalg.norm(emb))
for word, emb in embeddings.items()
if word not in {word_a, word_b, word_c}
}
return sorted(similarities.items(), key=lambda x: -x[1])[:top_k]
# king - man + woman → queen
analogy('king', 'man', 'woman', glove)
# [('queen', 0.85), ('throne', 0.72), ...]
6.2 Clustering and Semantic Groups
Embeddings form clusters where semantically related words are proximal:
Cluster 1 (royalty): king, queen, prince, throne, crown
Cluster 2 (food): pizza, pasta, burger, restaurant, menu
Cluster 3 (emotions): happy, sad, angry, joyful, depressed
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
words = list(glove.keys())[:5000]
vectors = np.array([glove[w] for w in words])
# Dimensionality reduction
tsne = TSNE(n_components=2, random_state=42)
coords = tsne.fit_transform(vectors)
# Clustering
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(vectors)
# Visualize
import matplotlib.pyplot as plt
plt.scatter(coords[:, 0], coords[:, 1], c=labels, cmap='tab10', s=5)
plt.show()
6.3 Cosine Similarity
| Similarity | Score |
|---|---|
| king ↔ queen | 0.85 |
| king ↔ throne | 0.72 |
| king ↔ banana | 0.05 |
7. Sequence Representations
7.1 One-Hot Encoding
Each word is represented as a binary vector of size :
Problem: For , each word is a 50K-dimensional sparse vector.
7.2 Word2Vec Averaging (Sentence Embeddings)
A simple sentence representation by averaging word vectors:
def sentence_embedding(sentence, embeddings, dim=300):
words = sentence.lower().split()
vectors = [embeddings[w] for w in words if w in embeddings]
if not vectors:
return np.zeros(dim)
return np.mean(vectors, axis=0)
s1 = sentence_embedding("the king sat on the throne", glove)
s2 = sentence_embedding("the queen sat on the throne", glove)
s3 = sentence_embedding("the cat sat on the mat", glove)
# s1 and s2 are more similar than s1 and s3
Limitations: Averaging loses word order — "dog bites man" and "man bites dog" produce identical embeddings.
7.3 Comparison of Representations
| Representation | Dimensionality | Semantic Info | Order Info | Sparsity |
|---|---|---|---|---|
| One-hot | (50K+) | None | None | 100% |
| TF-IDF | Document-level | None | ~99% | |
| Word2Vec | 100-300 | Yes | No | 0% |
| GloVe | 100-300 | Yes | No | 0% |
| Averaged W2V | 100-300 | Partial | No | 0% |
| RNN/LSTM | Variable | Yes | Yes | 0% |
| Transformer | Variable | Yes | Yes | 0% |
8. Complete Implementation
8.1 End-to-End NLP Pipeline
import numpy as np
import re
from collections import Counter
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
class NLPBaseline:
def __init__(self, use_stemming=False, use_lemmatization=False):
self.stemmer = PorterStemmer() if use_stemming else None
self.lemmatizer = WordNetLemmatizer() if use_lemmatization else None
self.stop_words = set(stopwords.words('english'))
self.vectorizer = None
def preprocess(self, text):
text = text.lower()
text = re.sub(r'<.*?>', '', text)
text = re.sub(r'[^a-zA-Z\s]', '', text)
tokens = text.split()
tokens = [t for t in tokens if t not in self.stop_words]
if self.stemmer:
tokens = [self.stemmer.stem(t) for t in tokens]
if self.lemmatizer:
tokens = [self.lemmatizer.lemmatize(t) for t in tokens]
return ' '.join(tokens)
def fit(self, corpus):
processed = [self.preprocess(doc) for doc in corpus]
self.vectorizer = TfidfVectorizer(max_features=10000)
self.X = self.vectorizer.fit_transform(processed)
return self
def transform(self, texts):
processed = [self.preprocess(doc) for doc in texts]
return self.vectorizer.transform(processed)
def similar_docs(self, query, top_k=5):
q_vec = self.transform([query])
sims = cosine_similarity(q_vec, self.X).flatten()
return np.argsort(sims)[::-1][:top_k], sims
# Usage
corpus = [
"Natural language processing enables computers to understand text",
"Machine learning algorithms learn patterns from data",
"Deep learning uses neural networks for complex tasks",
"NLP combines linguistics and computer science",
]
pipeline = NLPBaseline(use_lemmatization=True)
pipeline.fit(corpus)
indices, scores = pipeline.similar_docs("computational linguistics", top_k=3)
for i in indices:
print(f"[{scores[i]:.3f}] {corpus[i]}")
8.2 Training Word2Vec from Scratch
import torch
from torch.utils.data import Dataset, DataLoader
class Word2VecDataset(Dataset):
def __init__(self, token_ids, window_size=5):
self.data = []
for i, center in enumerate(token_ids):
context = token_ids[max(0, i-window_size):i] + \
token_ids[i+1:i+window_size+1]
for ctx in context:
self.data.append((center, ctx))
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx]
def train_word2vec(corpus, vocab_size, embed_dim=100, epochs=5):
# Build vocab
word_counts = Counter(w for doc in corpus for w in doc.split())
vocab = {w: i for i, (w, _) in enumerate(word_counts.most_common(vocab_size))}
# Prepare data
token_ids = [vocab[w] for doc in corpus for w in doc.split() if w in vocab]
dataset = Word2VecDataset(token_ids, window_size=3)
loader = DataLoader(dataset, batch_size=512, shuffle=True)
# Model
model = SkipGram(vocab_size, embed_dim)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(epochs):
total_loss = 0
for center, context in loader:
# Negative sampling
negatives = torch.randint(0, vocab_size, (center.size(0), 5))
loss = model(center, context, negatives)
loss.backward()
optimizer.step()
optimizer.zero_grad()
total_loss += loss.item()
print(f"Epoch {epoch+1}: loss={total_loss/len(dataset):.4f}")
# Extract embeddings
return model.center_embed.weight.detach().numpy(), vocab
Key Takeaways
- Preprocessing matters — lowercasing, stopword removal, and lemmatization significantly affect downstream performance
- Subword tokenization (BPE, WordPiece) balances vocabulary size with OOV handling
- TF-IDF weights words by importance: frequent in a document but rare across the corpus
- Word2Vec learns embeddings via local context prediction; GloVe leverages global co-occurrence statistics
- Vector arithmetic captures semantic relationships:
king - man + woman ≈ queen - Cosine similarity measures semantic closeness in embedding space
- Limitations of bag-of-words approaches: lose word order, syntax, and context — leading to contextual embeddings (BERT, GPT) in modern NLP
Next: Contextual Embeddings and Transformers — How BERT and GPT solve the polysemy problem with context-dependent representations.