CW

NLP Basics: Tokenization, Embeddings and Word Vectors

Module 14: NLPFree Lesson

Advertisement

NLP Basics: Tokenization, Embeddings and Word Vectors

Natural Language Processing (NLP) bridges human language and computational understanding. This lesson covers the foundational building blocks — from raw text to dense vector representations that capture semantic meaning.

1. The NLP Pipeline

Every NLP system follows a sequential pipeline that transforms raw text into machine-understandable representations.

Raw TextTokenizationNormalizationFeature ExtractionModelingOutput"The cat sat"["the","cat","sat"]["cat","sat"][0.2, 0.8, ...]P(cat|context)"noun phrase"NLP PipelineEach stage reduces ambiguity and enriches representationTokenization → Stemming/Lemmatization → Vectorization → Modeling → Inference

Key stages:

  1. Raw Text — Unprocessed input (documents, sentences, tweets)
  2. Tokenization — Splitting text into atomic units (tokens)
  3. Normalization — Lowercasing, removing noise, stemming/lemmatization
  4. Feature Extraction — Converting tokens to numerical vectors
  5. Modeling — Applying statistical or neural models
  6. Output — Classification, generation, translation, etc.

2. Text Preprocessing

Preprocessing cleans and normalizes text to reduce vocabulary size and noise.

2.1 Lowercasing and Noise Removal

import re

def preprocess(text):
    text = text.lower()
    text = re.sub(r'<.*?>', '', text)        # Remove HTML tags
    text = re.sub(r'http\S+', '', text)      # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Keep only letters
    text = re.sub(r'\s+', ' ', text).strip()  # Normalize whitespace
    return text

sample = "  <p>The Cat sat on a MAT!!! Visit http://example.com  </p> "
print(preprocess(sample))
# Output: "the cat sat on a mat"

2.2 Stopword Removal

Stopwords are high-frequency, low-information words (the, is, at, which).

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

tokens = ["the", "cat", "sat", "on", "a", "mat"]
filtered = [w for w in tokens if w not in stop_words]
# ['cat', 'sat', 'mat']

2.3 Stemming vs Lemmatization

MethodApproachExampleProsCons
StemmingRule-based suffix stripping"running" → "run", "studies" → "studi"Fast, no lookupCan over-stem or under-stem
LemmatizationDictionary + morphological analysis"better" → "good", "ran" → "run"Linguistically accurateSlower, requires POS tags
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "studies", "better", "geese"]

# Stemming
print([stemmer.stem(w) for w in words])
# ['run', 'studi', 'better', 'gees']

# Lemmatization (requires POS for best results)
print([lemmatizer.lemmatize(w, pos='v') for w in words])
# ['run', 'studies', 'good', 'geese']

When to use which:

  • Stemming: Information retrieval, search engines, when speed matters
  • Lemmatization: Text analysis, chatbots, when semantic accuracy matters

3. Tokenization Strategies

Tokenization determines how text is segmented into processable units.

Input: "unhappiness"Word-LevelTokens:["unhappiness"]Vocab size: ~1MHandles: known wordsFails: OOV wordsExample: NLTK, spaCySubword-LevelTokens:["un", "happi", "ness"]Vocab size: ~30K-50KHandles: rare + commonBest trade-offExample: BPE, WordPieceCharacter-LevelTokens:["u","n","h","a","p","i"]Vocab size: ~26-256Handles: any textLong sequencesExample: CharCNN, ByT5

3.1 Word-Level Tokenization

# Simple whitespace + punctuation tokenizer
import re

def word_tokenize(text):
    return re.findall(r"\b\w+\b", text.lower())

text = "Don't tokenize — it's easier!"
print(word_tokenize(text))
# ["don't", "tokenize", "it's", "easier"]

3.2 Byte-Pair Encoding (BPE)

BPE iteratively merges the most frequent character pairs, building a subword vocabulary.

Algorithm:

  1. Start with character-level vocabulary
  2. Count all adjacent symbol pairs
  3. Merge the most frequent pair into a new symbol
  4. Repeat until desired vocabulary size is reached
# Simplified BPE training
corpus = ["low", "low", "low", "lowest", "newer", "wider"]

# Initial vocab: all unique characters
vocab = set(''.join(corpus))  # {'l','o','w','e','s','t','n','r','i','d'}

# Iteration 1: most frequent pair is ('l','o') → merge to 'lo'
# Iteration 2: ('lo','w') → 'low' (high frequency)
# ... continues until vocab size limit

3.3 WordPiece Tokenization

Used by BERT. Similar to BPE but merges pairs that maximize likelihood of the training data rather than pure frequency.

score(x,y)=freq(xy)freq(x)×freq(y)\text{score}(x, y) = \frac{\text{freq}(xy)}{\text{freq}(x) \times \text{freq}(y)}
Architecture Diagram
BERT tokenizer output:
"unhappiness" → ["##un", "##happi", "##ness"]
"tokenization" → ["token", "##ization"]

3.4 SentencePiece

Language-agnostic tokenization that treats input as raw Unicode, handling languages without whitespace.

import sentencepiece as spm

sp = spm.SentencePieceProcessor(model_file='model.model')
tokens = sp.encode("unhappiness", out_type=str)
# ['▁un', 'happy', 'ness']

Tokenizer Comparison

MethodVocab SizeOOV HandlingSpeedUsed By
Word100K-1MPoorFastspaCy, NLTK
BPE30K-50KGoodMediumGPT-2/3/4
WordPiece30KGoodMediumBERT, DistilBERT
SentencePiece32K-64KGoodMediumT5, LLaMA, mBART

4. Bag of Words and TF-IDF

4.1 Bag of Words (BoW)

BoW represents documents as fixed-length vectors of word counts, ignoring order.

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat chased the dog"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
# ['cat', 'chased', 'dog', 'log', 'mat', 'on', 'sat', 'the']

print(X.toarray())
# [[1, 0, 0, 0, 1, 1, 1, 2],
#  [0, 0, 1, 1, 0, 1, 1, 2],
#  [1, 1, 1, 0, 0, 0, 0, 2]]

Limitations:

  • Loses word order: "dog bites man" = "man bites dog"
  • High dimensionality: vocabulary-sized sparse vectors
  • No semantic information: "good" and "excellent" are unrelated

4.2 TF-IDF (Term Frequency–Inverse Document Frequency)

TF-IDF weights words by their importance within a document relative to the corpus.

TF-IDF Calculation FlowTF(t, d)Frequency of term t in doc dTF(t,d) = f(t,d) / |d|"cat" appears 2x in 10-word doc → TF = 0.2IDF(t)Rareness across corpusIDF(t) = log(N / df(t))"the" in 100/100 docs → IDF = 0 (useless)TF-IDFCombined importanceTF × IDFHigh TF + High IDF = ImportantExample: "cat" (TF=0.2, IDF=1.5) → TF-IDF = 0.30 | "the" (TF=0.3, IDF=0.0) → TF-IDF = 0.00
TF-IDF(t,d,D)=ft,dtdft,dTF×logD{dD:td}IDF\text{TF-IDF}(t, d, D) = \underbrace{\frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}}_{\text{TF}} \times \underbrace{\log \frac{|D|}{|\{d \in D : t \in d\}|}}_{\text{IDF}}
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# "cat" has high TF-IDF in doc 0 (appears there, rare overall)
# "the" has near-zero TF-IDF (appears everywhere)

5. Word Embeddings

Word embeddings map words to dense, low-dimensional vectors where geometric relationships encode semantic similarity.

5.1 Why Not One-Hot Encoding?

One-hot vectors are orthogonal — no notion of similarity:

king=[0,0,,1,,0]queen=[0,0,,0,,1]\text{king} = [0, 0, \ldots, 1, \ldots, 0] \quad \text{queen} = [0, 0, \ldots, 0, \ldots, 1]
sim(king,queen)=sim(king,toaster)=0\text{sim}(\text{king}, \text{queen}) = \text{sim}(\text{king}, \text{toaster}) = 0

Dense embeddings solve this by learning a continuous vector space.

5.2 Word2Vec (Mikolov et al., 2013)

Word2Vec learns embeddings by predicting context from words (or words from context).

Word2Vec ArchitecturesCBOWPredict center word from contextw(t-2)w(t-1)w(t+1)w(t+2)Projectionw(t)Loss: -log P(w(t)|context)Skip-gramPredict context words from centerw(t)Projectionw(t-1)w(t+1)Loss: -∑ log P(w(context)|w(t))

CBOW (Continuous Bag of Words)

Predicts the center word given surrounding context words.

P(wtwtc,,wt1,wt+1,,wt+c)=softmax(Wvˉ)P(w_t | w_{t-c}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+c}) = \text{softmax}(W' \cdot \bar{v})

where vˉ=12cj[c,c],j0Wxt+j\bar{v} = \frac{1}{2c} \sum_{j \in [-c, c], j \neq 0} W \cdot x_{t+j} is the averaged context embedding.

import torch
import torch.nn as nn

class CBOW(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_dim)
        self.output = nn.Linear(embed_dim, vocab_size)

    def forward(self, context_ids):
        # context_ids: (batch, 2*context_size)
        embeds = self.embeddings(context_ids)       # (batch, 2c, dim)
        hidden = embeds.mean(dim=1)                  # (batch, dim)
        logits = self.output(hidden)                 # (batch, vocab_size)
        return logits

# Training loop
model = CBOW(vocab_size=10000, embed_dim=300)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    for context, target in dataloader:
        logits = model(context)
        loss = criterion(logits, target)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

Skip-gram

Predicts context words given the center word — the reverse of CBOW.

P(wt+jwt)=exp(vwt+jvwt)w=1Vexp(vwvwt)P(w_{t+j} | w_t) = \frac{\exp(v'_{w_{t+j}} \cdot v_{w_t})}{\sum_{w=1}^{V} \exp(v'_w \cdot v_{w_t})}

Negative Sampling (approximation):

logσ(vwOvwI)+i=1kEwiPn(w)[logσ(vwivwI)]\log \sigma(v'_{w_O} \cdot v_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)}[\log \sigma(-v'_{w_i} \cdot v_{w_I})]
class SkipGram(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.center_embed = nn.Embedding(vocab_size, embed_dim)
        self.context_embed = nn.Embedding(vocab_size, embed_dim)

    def forward(self, center, context, negatives):
        c = self.center_embed(center)           # (batch, dim)
        p = self.context_embed(context)          # (batch, dim)
        n = self.context_embed(negatives)        # (batch, k, dim)

        pos_score = torch.sum(c * p, dim=1)     # (batch,)
        neg_score = torch.bmm(n, c.unsqueeze(2)).squeeze()  # (batch, k)

        pos_loss = -torch.log(torch.sigmoid(pos_score) + 1e-8)
        neg_loss = -torch.sum(torch.log(torch.sigmoid(-neg_score) + 1e-8), dim=1)
        return (pos_loss + neg_loss).mean()

5.3 GloVe (Global Vectors, Pennington et al., 2014)

GloVe combines global co-occurrence statistics with local context learning.

J=i,j=1Vf(Xij)(wiTw~j+bi+b~jlogXij)2J = \sum_{i,j=1}^{V} f(X_{ij})(w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2

where XijX_{ij} is the co-occurrence count and f(x)f(x) is a weighting function:

f(x)={(x/xmax)αif x<xmax1otherwisef(x) = \begin{cases} (x / x_{\max})^\alpha & \text{if } x < x_{\max} \\ 1 & \text{otherwise} \end{cases}
# Using pretrained GloVe embeddings
import numpy as np

def load_glove(path):
    embeddings = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

glove = load_glove('glove.6B.300d.txt')
print(glove['king'].shape)  # (300,)

Word2Vec vs GloVe

AspectWord2VecGloVe
TrainingLocal context windowsGlobal co-occurrence matrix
ObjectivePredict context (prediction-based)Reconstruct log co-occurrence (count-based)
SpeedFaster per epochFaster convergence
PerformanceComparableComparable
Intuition"You shall know a word by the company it keeps""Word co-occurrence ratios encode meaning"

6. Embedding Properties

6.1 Word Analogies

Word embeddings capture linear relationships: king - man + woman ≈ queen.

Word Analogy: Vector Arithmetickingmanwomanqueenking - man+ (king - man)king − man + woman ≈ queenv(king) − v(man) + v(woman) ≈ v(queen)More AnalogiesParis − France + Japan ≈ Tokyobigger − big + small ≈ smallestwalked − walk + swim ≈ swamcomputer − software + hardware ≈ ???← Gender axis in embedding space →
def analogy(word_a, word_b, word_c, embeddings, top_k=5):
    """a - b + c = ?"""
    vec = embeddings[word_a] - embeddings[word_b] + embeddings[word_c]

    # Normalize and compute cosine similarity
    vec = vec / np.linalg.norm(vec)
    similarities = {
        word: np.dot(vec, emb / np.linalg.norm(emb))
        for word, emb in embeddings.items()
        if word not in {word_a, word_b, word_c}
    }
    return sorted(similarities.items(), key=lambda x: -x[1])[:top_k]

# king - man + woman → queen
analogy('king', 'man', 'woman', glove)
# [('queen', 0.85), ('throne', 0.72), ...]

6.2 Clustering and Semantic Groups

Embeddings form clusters where semantically related words are proximal:

Architecture Diagram
Cluster 1 (royalty):    king, queen, prince, throne, crown
Cluster 2 (food):       pizza, pasta, burger, restaurant, menu
Cluster 3 (emotions):   happy, sad, angry, joyful, depressed
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

words = list(glove.keys())[:5000]
vectors = np.array([glove[w] for w in words])

# Dimensionality reduction
tsne = TSNE(n_components=2, random_state=42)
coords = tsne.fit_transform(vectors)

# Clustering
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(vectors)

# Visualize
import matplotlib.pyplot as plt
plt.scatter(coords[:, 0], coords[:, 1], c=labels, cmap='tab10', s=5)
plt.show()

6.3 Cosine Similarity

sim(u,v)=uvuv=cos(θ)\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}|| \cdot ||\mathbf{v}||} = \cos(\theta)
SimilarityScore
king ↔ queen0.85
king ↔ throne0.72
king ↔ banana0.05

7. Sequence Representations

7.1 One-Hot Encoding

Each word is represented as a binary vector of size V|V|:

xcat=[0,0,,1,,0]{0,1}Vx_{\text{cat}} = [0, 0, \ldots, 1, \ldots, 0] \in \{0,1\}^{|V|}

Problem: For V=50,000|V| = 50{,}000, each word is a 50K-dimensional sparse vector.

7.2 Word2Vec Averaging (Sentence Embeddings)

A simple sentence representation by averaging word vectors:

s=1ni=1nvwi\mathbf{s} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{v}_{w_i}
def sentence_embedding(sentence, embeddings, dim=300):
    words = sentence.lower().split()
    vectors = [embeddings[w] for w in words if w in embeddings]
    if not vectors:
        return np.zeros(dim)
    return np.mean(vectors, axis=0)

s1 = sentence_embedding("the king sat on the throne", glove)
s2 = sentence_embedding("the queen sat on the throne", glove)
s3 = sentence_embedding("the cat sat on the mat", glove)

# s1 and s2 are more similar than s1 and s3

Limitations: Averaging loses word order — "dog bites man" and "man bites dog" produce identical embeddings.

7.3 Comparison of Representations

RepresentationDimensionalitySemantic InfoOrder InfoSparsity
One-hotV|V| (50K+)NoneNone100%
TF-IDFV|V|Document-levelNone~99%
Word2Vec100-300YesNo0%
GloVe100-300YesNo0%
Averaged W2V100-300PartialNo0%
RNN/LSTMVariableYesYes0%
TransformerVariableYesYes0%

8. Complete Implementation

8.1 End-to-End NLP Pipeline

import numpy as np
import re
from collections import Counter
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

class NLPBaseline:
    def __init__(self, use_stemming=False, use_lemmatization=False):
        self.stemmer = PorterStemmer() if use_stemming else None
        self.lemmatizer = WordNetLemmatizer() if use_lemmatization else None
        self.stop_words = set(stopwords.words('english'))
        self.vectorizer = None

    def preprocess(self, text):
        text = text.lower()
        text = re.sub(r'<.*?>', '', text)
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        tokens = text.split()
        tokens = [t for t in tokens if t not in self.stop_words]
        if self.stemmer:
            tokens = [self.stemmer.stem(t) for t in tokens]
        if self.lemmatizer:
            tokens = [self.lemmatizer.lemmatize(t) for t in tokens]
        return ' '.join(tokens)

    def fit(self, corpus):
        processed = [self.preprocess(doc) for doc in corpus]
        self.vectorizer = TfidfVectorizer(max_features=10000)
        self.X = self.vectorizer.fit_transform(processed)
        return self

    def transform(self, texts):
        processed = [self.preprocess(doc) for doc in texts]
        return self.vectorizer.transform(processed)

    def similar_docs(self, query, top_k=5):
        q_vec = self.transform([query])
        sims = cosine_similarity(q_vec, self.X).flatten()
        return np.argsort(sims)[::-1][:top_k], sims

# Usage
corpus = [
    "Natural language processing enables computers to understand text",
    "Machine learning algorithms learn patterns from data",
    "Deep learning uses neural networks for complex tasks",
    "NLP combines linguistics and computer science",
]

pipeline = NLPBaseline(use_lemmatization=True)
pipeline.fit(corpus)
indices, scores = pipeline.similar_docs("computational linguistics", top_k=3)
for i in indices:
    print(f"[{scores[i]:.3f}] {corpus[i]}")

8.2 Training Word2Vec from Scratch

import torch
from torch.utils.data import Dataset, DataLoader

class Word2VecDataset(Dataset):
    def __init__(self, token_ids, window_size=5):
        self.data = []
        for i, center in enumerate(token_ids):
            context = token_ids[max(0, i-window_size):i] + \
                      token_ids[i+1:i+window_size+1]
            for ctx in context:
                self.data.append((center, ctx))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

def train_word2vec(corpus, vocab_size, embed_dim=100, epochs=5):
    # Build vocab
    word_counts = Counter(w for doc in corpus for w in doc.split())
    vocab = {w: i for i, (w, _) in enumerate(word_counts.most_common(vocab_size))}

    # Prepare data
    token_ids = [vocab[w] for doc in corpus for w in doc.split() if w in vocab]
    dataset = Word2VecDataset(token_ids, window_size=3)
    loader = DataLoader(dataset, batch_size=512, shuffle=True)

    # Model
    model = SkipGram(vocab_size, embed_dim)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    for epoch in range(epochs):
        total_loss = 0
        for center, context in loader:
            # Negative sampling
            negatives = torch.randint(0, vocab_size, (center.size(0), 5))
            loss = model(center, context, negatives)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            total_loss += loss.item()
        print(f"Epoch {epoch+1}: loss={total_loss/len(dataset):.4f}")

    # Extract embeddings
    return model.center_embed.weight.detach().numpy(), vocab

Key Takeaways

  1. Preprocessing matters — lowercasing, stopword removal, and lemmatization significantly affect downstream performance
  2. Subword tokenization (BPE, WordPiece) balances vocabulary size with OOV handling
  3. TF-IDF weights words by importance: frequent in a document but rare across the corpus
  4. Word2Vec learns embeddings via local context prediction; GloVe leverages global co-occurrence statistics
  5. Vector arithmetic captures semantic relationships: king - man + woman ≈ queen
  6. Cosine similarity measures semantic closeness in embedding space
  7. Limitations of bag-of-words approaches: lose word order, syntax, and context — leading to contextual embeddings (BERT, GPT) in modern NLP

Next: Contextual Embeddings and Transformers — How BERT and GPT solve the polysemy problem with context-dependent representations.

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement