NLP Basics: Tokenization, Embeddings

💡 Natural Language Processing bridges human language and machine understanding. This lesson covers text preprocessing, tokenization, word embeddings, and sequence models — the building blocks of modern NLP.

1. The NLP Pipeline

Architecture Diagram

Raw Text → Tokenization → Cleaning → Encoding → Embedding → Model → Output
"I love NLP!" → ["I","love","NLP!"] → ["i","love","nlp"] → [45, 892, 3] → [0.2, -0.5, ...] → classifier → POSITIVE

Each step transforms text into a numerical representation that models can process.

2. Text Preprocessing

Cleaning Operations

import re
import string

def clean_text(text):
    # Lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)

    # Remove numbers (optional — depends on task)
    text = re.sub(r'\d+', '', text)

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Example
raw = "Check out https://example.com! Email me at test@email.com. Price: $29.99"
print(clean_text(raw))
# Output: "check out email me at price"

Stopword Removal

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    return [t for t in tokens if t not in stop_words]

tokens = ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
print(remove_stopwords(tokens))
# Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

Stemming vs Lemmatization

Architecture Diagram

Word          Stemming (Porter)    Lemmatization
────────────────────────────────────────────────
running       run                  run
ran           ran                  run
better        better               good
studies       studi                study
geese        gees                 goose

from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "ran", "better", "studies", "geese"]

for word in words:
    stem = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word, pos='v')  # verb form
    print(f"{word:12s} → Stem: {stem:10s} Lemma: {lemma}")

# running       → Stem: run        Lemma: run
# ran           → Stem: ran        Lemma: run
# better        → Stem: better     Lemma: good  (with POS tagging)
# studies       → Stem: studi      Lemma: study
# geese         → Stem: gees       Lemma: goose

3. Tokenization Methods

Word-Level Tokenization

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

corpus = [
    "The cat sat on the mat",
    "The dog chased the cat",
    "The cat and the dog are friends"
]

# Bag of Words
bow = CountVectorizer()
X_bow = bow.fit_transform(corpus)
print("Vocabulary:", bow.get_feature_names_out())
print("BoW matrix:\n", X_bow.toarray())

# TF-IDF
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
print("TF-IDF matrix:\n", X_tfidf.toarray().round(3))

TF-IDF Formula

TF-IDF

\text{TF-IDF}(t, d) = \text{TF}(t,d) \times \text{IDF}(t)

Here,

$TF(t,d)$ =Term frequency of term t in document d
$IDF(t)$ =Inverse document frequency of term t

TF and IDF Components

\text{TF}(t,d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}} \qquad \text{IDF}(t) = \log\frac{N}{|\{d : t \in d\}|}

Here,

$f_{t,d}$ =Frequency of term t in document d
$N$ =Total number of documents in the corpus
$|\{d : t \in d\}|$ =Number of documents containing term t

ℹ️ TF-IDF Intuition

TF-IDF balances two signals: term frequency (how often a word appears in a document) and inverse document frequency (how rare the word is across all documents). Common words like "the" have high TF but low IDF (they appear everywhere), so they get low TF-IDF scores. Rare but informative words like "neural" have moderate TF but high IDF, giving them high TF-IDF scores.

Subword Tokenization (BPE)

Byte Pair Encoding iteratively merges the most frequent character pairs:

Architecture Diagram

Vocabulary: {'l', 'o', 'w', 'e', 'r', 's', 't', 'n', 'i', 'd', 'unk'}

Corpus frequencies:
  'low': 5, 'lower': 2, 'newest': 6, 'widest': 3

Iteration 1: Most frequent pair → 'es' (9 occurrences)
  Vocabulary: {..., 'es'}

Iteration 2: Most frequent pair → 'est' (9 occurrences)
  Vocabulary: {..., 'est'}

Iteration 3: Most frequent pair → 'new' (6)
  Vocabulary: {..., 'new'}

Result: 'newest' → ['new', 'est'] (not character-by-character)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

text = "Transfer learning is incredibly powerful!"
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
# ['transfer', 'learning', 'is', 'incredibly', 'powerful', '!']

ids = tokenizer.encode(text)
print("Token IDs:", ids)
# [101, 3849, 3776, 2003, 14916, 3928, 999, 102]

decoded = tokenizer.decode(ids)
print("Decoded:", decoded)
# [CLS] transfer learning is incredibly powerful! [SEP]

4. Word Embeddings

The Problem with One-Hot Encoding

One-Hot Encoding

\text{one-hot}(\text{"cat"}) = [0, 0, 1, 0, \ldots, 0] \in \mathbb{R}^{V}

Here,

$V$ =Vocabulary size (number of unique words)
$[0, 0, 1, 0, ...]$ =Vector with 1 at the index of 'cat', 0 elsewhere

Problems:

No notion of similarity: $\text{sim}(\text{cat}, \text{dog}) = \text{sim}(\text{cat}, \text{democracy}) = 0$ (all orthogonal)
Sparse and high-dimensional ( $V$ = vocabulary size, often 50K+)
No semantic relationships captured

ℹ️ Why Embeddings Are Better

Word embeddings solve these problems by mapping words to dense, low-dimensional vectors (typically 100-300 dimensions) where similar words are close together. The embedding matrix $E \in \mathbb{R}^{V \times d}$ is learned during training, and each row $E_i$ represents the vector for word $i$ . This is equivalent to looking up the $i$ -th row of a weight matrix.

Word2Vec (Mikolov et al., 2013)

Learns dense vectors by predicting context.

DfWord2Vec Skip-gram

Given a center word, predict surrounding context words. The model learns dense vector representations where words with similar contexts have similar vectors. The key insight is that "you shall know a word by the company it keeps" (Firth, 1957).

\mathcal{L}(\theta) = -\sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)

Word2Vec Skip-gram Objective

\mathcal{L}(\theta) = -\sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)

Here,

$T$ =Length of the corpus
$c$ =Context window size
$w_t$ =Center word at position t
$w_{t+j}$ =Context word at position t+j

Skip-gram Probability (Softmax)

P(w_O | w_I) = \frac{\exp(\mathbf{v}_{w_O}' \cdot \mathbf{v}_{w_I})}{\sum_{w=1}^{V} \exp(\mathbf{v}_w' \cdot \mathbf{v}_{w_I})}

Here,

$w_I$ =Input (center) word
$w_O$ =Output (context) word
$\mathbf{v}_{w_I}$ =Input vector for word w_I
$\mathbf{v}_{w_O}'$ =Output vector for word w_O
$V$ =Vocabulary size

💡 Negative Sampling

Computing the softmax denominator over the entire vocabulary is expensive ( $O(V)$ per word). Negative sampling approximates this by training on the true context word plus a few randomly sampled "negative" words. This reduces computation to $O(K)$ where $K$ is the number of negatives (typically 5-20).

💡 Key Property: Word Arithmetic

Word embeddings capture semantic relationships that can be expressed through vector arithmetic:

\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}

from gensim.models import Word2Vec
import numpy as np

# Training data
sentences = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "played", "in", "the", "park"],
    ["cats", "and", "dogs", "are", "friends"],
    ["the", "cat", "chased", "the", "mouse"],
]

# Train Word2Vec
model = Word2Vec(
    sentences,
    vector_size=100,  # embedding dimension
    window=5,         # context window size
    min_count=1,      # minimum word frequency
    sg=1,             # 1=skip-gram, 0=CBOW
    epochs=100,
)

# Get word vector
cat_vector = model.wv['cat']
print(f"Vector shape: {cat_vector.shape}")  # (100,)

# Find similar words
similar = model.wv.most_similar('cat', topn=5)
print("Similar to 'cat':", similar)

# Word analogy: king - man + woman = ?
result = model.wv.most_similar(
    positive=['king', 'woman'],
    negative=['man'],
    topn=1
)
print("king - man + woman =", result)

GloVe (Global Vectors, Pennington et al., 2014)

Combines global co-occurrence statistics with local context windows.

GloVe Objective

J = \sum_{i,j=1}^{V} f(X_{ij}) \left( \mathbf{w}_i^T \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2

Here,

$X_{ij}$ =Co-occurrence count of words i and j
$\mathbf{w}_i$ =Word vector for word i
$\tilde{\mathbf{w}}_j$ =Context word vector for word j
$b_i, \tilde{b}_j$ =Bias terms
$f(x)$ =Weighting function capping rare and frequent co-occurrences

where $X_{ij}$ is the co-occurrence count and $f(x)$ is a weighting function:

f(x) = \\begin{cases} (x/x_{\\max})^\\alpha & \\text{if } x < x_{\\max} \\\\ 1 & \\text{otherwise} \\end{cases}

import numpy as np
from gensim.models import KeyedVectors

# Load pre-trained GloVe embeddings
# Download from: https://nlp.stanford.edu/projects/glove/
# glove.6B.100d.txt format: word float float float ...

def load_glove(filepath):
    embeddings = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

# Usage
# glove_embeddings = load_glove('glove.6B.100d.txt')
# print(glove_embeddings['king'].shape)  # (100,)

# Create embedding matrix for a vocabulary
def create_embedding_matrix(vocab, embeddings_dict, dim=100):
    matrix = np.zeros((len(vocab), dim))
    found = 0
    for word, idx in vocab.items():
        if word in embeddings_dict:
            matrix[idx] = embeddings_dict[word]
            found += 1
    print(f"Found {found}/{len(vocab)} words in embeddings")
    return matrix

Embedding Comparison

Method	Training	Semantics	Speed	Memory
Word2Vec	Local context	Strong (syntactic)	Fast	Low
GloVe	Global co-occurrence	Strong (semantic)	Medium	Low
FastText	Subword info	Handles OOV	Slow	High
BERT	Contextual	Best (context-dependent)	Slow	High

5. Sequence Models

Recurrent Neural Network (RNN)

RNN Hidden State

h_t = \\tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)

Here,

$h_t$ =Hidden state at time t
$x_t$ =Input at time t
$W_{hh}$ =Recurrent weight matrix
$W_{xh}$ =Input weight matrix

RNN Output

y_t = W_{hy} h_t + b_y

Here,

$y_t$ =Output at time t
$h_t$ =Hidden state at time t
$W_{hy}$ =Hidden-to-output weight matrix
$b_y$ =Output bias

Architecture Diagram

        h_0 -> [RNN] -> h_1 -> [RNN] -> h_2 -> [RNN] -> h_3 -> y
              ^              ^              ^
              x_1            x_2            x_3

Problem: Vanishing gradient — struggles with long-range dependencies.

Long Short-Term Memory (LSTM)

DfLSTM Gates

LSTM uses three gates (forget, input, output) to control the flow of information, allowing it to learn long-range dependencies by selectively remembering or forgetting information.

LSTM Forget Gate

f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)

Here,

$f_t$ =Forget gate output (0 = forget, 1 = keep)
$\sigma$ =Sigmoid activation (outputs between 0 and 1)
$[h_{t-1}, x_t]$ =Concatenation of previous hidden state and current input

LSTM Input Gate and Candidate

i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \\ \tilde{C}_t = \tanh(W_C [h_{t-1}, x_t] + b_C)

Here,

$i_t$ =Input gate (what to update)
$\tilde{C}_t$ =Candidate cell state (new information)

LSTM Cell State Update

C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

Here,

$C_t$ =New cell state
$f_t \odot C_{t-1}$ =What to forget from old cell state
$i_t \odot \tilde{C}_t$ =What to add as new information

LSTM Output Gate

o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \\ h_t = o_t \odot \tanh(C_t)

Here,

$o_t$ =Output gate (what to output)
$h_t$ =Final hidden state

import torch
import torch.nn as nn

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=2,
                           batch_first=True, dropout=0.3, bidirectional=True)
        self.classifier = nn.Linear(hidden_dim * 2, num_classes)  # *2 for bidirectional

    def forward(self, x):
        embeds = self.embedding(x)              # (batch, seq_len, embed_dim)
        lstm_out, (h_n, c_n) = self.lstm(embeds) # lstm_out: (batch, seq_len, hidden*2)

        # Concatenate final hidden states from both directions
        hidden = torch.cat([h_n[-2], h_n[-1]], dim=1)  # (batch, hidden*2)
        output = self.classifier(hidden)
        return output

# Example
model = LSTMClassifier(vocab_size=10000, embed_dim=128, hidden_dim=256, num_classes=2)
x = torch.randint(0, 10000, (32, 50))  # batch of 32, sequence length 50
print(model(x).shape)  # torch.Size([32, 2])

GRU (Gated Recurrent Unit)

Simplified LSTM with 2 gates instead of 3:

GRU Update and Reset Gates

z_t = \sigma(W_z [h_{t-1}, x_t]) \quad \text{(update gate)} \\ r_t = \sigma(W_r [h_{t-1}, x_t]) \quad \text{(reset gate)}

Here,

$z_t$ =Update gate (interpolation between old and new)
$r_t$ =Reset gate (how much past to forget)

GRU Hidden State Update

h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tanh(W [r_t \odot h_{t-1}, x_t])

Here,

$h_t$ =New hidden state
$(1 - z_t) \odot h_{t-1}$ =What to keep from old hidden state
$z_t \odot \tilde{h}_t$ =What to add as new information

ℹ️ LSTM vs. GRU

GRU combines the forget and input gates into a single update gate $z_t$ , and merges the cell state and hidden state. This results in fewer parameters (2 gates vs. 3) and faster training. GRU performs comparably to LSTM on many tasks, especially when data is limited. Use LSTM when you need fine-grained control over information flow; use GRU for simpler, faster models.

ThVanishing Gradient Problem in RNNs

In vanilla RNNs, gradients are multiplied by the recurrent weight matrix $W_{hh}$ at each time step. After $T$ steps, the gradient scales as $\|W_{hh}\|^T$ . If $\|W_{hh}\| < 1$ , gradients vanish exponentially (long-range dependencies are lost). If $\|W_{hh}\| > 1$ , gradients explode. LSTMs solve this by maintaining a cell state with additive updates (not multiplicative), allowing gradients to flow unchanged through the cell state.

📝Sentiment Analysis Pipeline

Task: Classify movie reviews as positive/negative.

Pipeline:

Preprocessing: Lowercase, remove punctuation, tokenize
Vocabulary: Build word-to-index mapping (top 25K words)
Embedding: Load pre-trained GloVe vectors (300d) or train from scratch
Model: Bidirectional LSTM with 2 layers, hidden dim=256
Classification: Use final hidden state $[h_{\text{forward}}; h_{\text{backward}}]$ as features for a linear classifier

Expected Performance:

TF-IDF + Logistic Regression: ~85% accuracy
LSTM (random init): ~87% accuracy
LSTM + GloVe: ~89% accuracy
LSTM + fine-tuned GloVe: ~90% accuracy
BERT fine-tuned: ~93% accuracy

The improvement from GloVe shows the value of pre-trained embeddings. The jump to BERT shows the power of contextual representations.

6. Complete Example: Sentiment Classification

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from collections import Counter

class Vocabulary:
    def __init__(self, max_size=25000):
        self.word2idx = {'<PAD>': 0, '<UNK>': 1}
        self.idx2word = {0: '<PAD>', 1: '<UNK>'}
        self.max_size = max_size

    def build(self, texts):
        counter = Counter()
        for text in texts:
            counter.update(text.split())
        for word, _ in counter.most_common(self.max_size - 2):
            idx = len(self.word2idx)
            self.word2idx[word] = idx
            self.idx2word[idx] = word

    def encode(self, text, max_len=100):
        tokens = text.lower().split()
        ids = [self.word2idx.get(w, 1) for w in tokens]
        # Pad or truncate
        if len(ids) < max_len:
            ids += [0] * (max_len - len(ids))
        else:
            ids = ids[:max_len]
        return ids

class SentimentDataset(Dataset):
    def __init__(self, texts, labels, vocab, max_len=100):
        self.texts = texts
        self.labels = labels
        self.vocab = vocab
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        ids = self.vocab.encode(self.texts[idx], self.max_len)
        return torch.tensor(ids), torch.tensor(self.labels[idx])

# Training
vocab = Vocabulary(max_size=25000)
vocab.build(train_texts)  # assuming train_texts is a list of strings

model = LSTMClassifier(vocab_size=25000, embed_dim=128, hidden_dim=256, num_classes=2)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(10):
    model.train()
    for batch_ids, batch_labels in train_loader:
        optimizer.zero_grad()
        outputs = model(batch_ids)
        loss = criterion(outputs, batch_labels)
        loss.backward()
        optimizer.step()

7. Key Takeaways

📋Summary: NLP Basics

Preprocessing (cleaning, normalization) is task-dependent — modern models like BERT handle raw text well, but TF-IDF models still benefit from preprocessing
TF-IDF captures word importance by balancing term frequency and inverse document frequency; useful as a strong baseline
Word2Vec learns embeddings by predicting context (skip-gram) or predicting words from context (CBOW); captures syntactic and semantic relationships
GloVe combines global co-occurrence statistics with local context, often capturing better semantic relationships
Subword tokenization (BPE) handles rare words and morphological variation; enables open-vocabulary modeling
Word analogies ( $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$ ) demonstrate that embeddings capture linear semantic structure
LSTM/GRU solve the vanishing gradient problem through gating mechanisms; LSTM has 3 gates (forget, input, output), GRU has 2 (update, reset)
Pre-trained embeddings transfer knowledge and reduce training data requirements; fine-tuning them during training often improves performance

8. Practice Exercises

Exercise 1: Build a Vocabulary

# TODO: Build a vocabulary from a corpus of 10K documents
# Support: min frequency threshold, max vocab size, special tokens
# Test: encode and decode a sample sentence

Exercise 2: Train Word2Vec

# TODO: Train Word2Vec on a custom corpus (e.g., Wikipedia dump)
# Evaluate with word similarity benchmarks
# Visualize word vectors using t-SNE

Exercise 3: Sentiment Classifier

# TODO: Build a bidirectional LSTM for sentiment analysis
# Use the IMDB dataset (50K reviews)
# Target: >85% accuracy
# Compare with TF-IDF + Logistic Regression baseline

Exercise 4: Compare Embeddings

# TODO: Compare Word2Vec, GloVe, and FastText on:
# 1. Word similarity (WS-353 dataset)
# 2. Word analogy task
# 3. Text classification accuracy

NLP Basics: Tokenization, Embeddings