NLP Basics: Tokenization, Embeddings

Module 3: Advanced ML + Deep LearningFree Lesson

Advertisement

NLP Basics: Tokenization, Embeddings

šŸ’” Natural Language Processing bridges human language and machine understanding. This lesson covers text preprocessing, tokenization, word embeddings, and sequence models — the building blocks of modern NLP.


1. The NLP Pipeline

Architecture Diagram
Raw Text → Tokenization → Cleaning → Encoding → Embedding → Model → Output
"I love NLP!" → ["I","love","NLP!"] → ["i","love","nlp"] → [45, 892, 3] → [0.2, -0.5, ...] → classifier → POSITIVE

Each step transforms text into a numerical representation that models can process.


2. Text Preprocessing

Cleaning Operations

import re
import string

def clean_text(text):
    # Lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)

    # Remove numbers (optional — depends on task)
    text = re.sub(r'\d+', '', text)

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Example
raw = "Check out https://example.com! Email me at test@email.com. Price: $29.99"
print(clean_text(raw))
# Output: "check out email me at price"

Stopword Removal

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    return [t for t in tokens if t not in stop_words]

tokens = ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
print(remove_stopwords(tokens))
# Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

Stemming vs Lemmatization

Architecture Diagram
Word          Stemming (Porter)    Lemmatization
────────────────────────────────────────────────
running       run                  run
ran           ran                  run
better        better               good
studies       studi                study
geese        gees                 goose
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "ran", "better", "studies", "geese"]

for word in words:
    stem = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word, pos='v')  # verb form
    print(f"{word:12s} → Stem: {stem:10s} Lemma: {lemma}")

# running       → Stem: run        Lemma: run
# ran           → Stem: ran        Lemma: run
# better        → Stem: better     Lemma: good  (with POS tagging)
# studies       → Stem: studi      Lemma: study
# geese         → Stem: gees       Lemma: goose

3. Tokenization Methods

Word-Level Tokenization

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

corpus = [
    "The cat sat on the mat",
    "The dog chased the cat",
    "The cat and the dog are friends"
]

# Bag of Words
bow = CountVectorizer()
X_bow = bow.fit_transform(corpus)
print("Vocabulary:", bow.get_feature_names_out())
print("BoW matrix:\n", X_bow.toarray())

# TF-IDF
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
print("TF-IDF matrix:\n", X_tfidf.toarray().round(3))

TF-IDF Formula

TF-IDF

TF-IDF(t,d)=TF(t,d)ƗIDF(t)\text{TF-IDF}(t, d) = \text{TF}(t,d) \times \text{IDF}(t)

Here,

  • TF(t,d)TF(t,d)=Term frequency of term t in document d
  • IDF(t)IDF(t)=Inverse document frequency of term t

TF and IDF Components

TF(t,d)=ft,dāˆ‘tā€²āˆˆdft′,dIDF(t)=log⁔N∣{d:t∈d}∣\text{TF}(t,d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}} \qquad \text{IDF}(t) = \log\frac{N}{|\{d : t \in d\}|}

Here,

  • ft,df_{t,d}=Frequency of term t in document d
  • NN=Total number of documents in the corpus
  • ∣{d:t∈d}∣|\{d : t \in d\}|=Number of documents containing term t

ā„¹ļø TF-IDF Intuition

TF-IDF balances two signals: term frequency (how often a word appears in a document) and inverse document frequency (how rare the word is across all documents). Common words like "the" have high TF but low IDF (they appear everywhere), so they get low TF-IDF scores. Rare but informative words like "neural" have moderate TF but high IDF, giving them high TF-IDF scores.

Subword Tokenization (BPE)

Byte Pair Encoding iteratively merges the most frequent character pairs:

Architecture Diagram
Vocabulary: {'l', 'o', 'w', 'e', 'r', 's', 't', 'n', 'i', 'd', 'unk'}

Corpus frequencies:
  'low': 5, 'lower': 2, 'newest': 6, 'widest': 3

Iteration 1: Most frequent pair → 'es' (9 occurrences)
  Vocabulary: {..., 'es'}

Iteration 2: Most frequent pair → 'est' (9 occurrences)
  Vocabulary: {..., 'est'}

Iteration 3: Most frequent pair → 'new' (6)
  Vocabulary: {..., 'new'}

Result: 'newest' → ['new', 'est'] (not character-by-character)
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

text = "Transfer learning is incredibly powerful!"
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
# ['transfer', 'learning', 'is', 'incredibly', 'powerful', '!']

ids = tokenizer.encode(text)
print("Token IDs:", ids)
# [101, 3849, 3776, 2003, 14916, 3928, 999, 102]

decoded = tokenizer.decode(ids)
print("Decoded:", decoded)
# [CLS] transfer learning is incredibly powerful! [SEP]

4. Word Embeddings

The Problem with One-Hot Encoding

One-Hot Encoding

one-hot("cat")=[0,0,1,0,…,0]∈RV\text{one-hot}(\text{"cat"}) = [0, 0, 1, 0, \ldots, 0] \in \mathbb{R}^{V}

Here,

  • VV=Vocabulary size (number of unique words)
  • [0,0,1,0,...][0, 0, 1, 0, ...]=Vector with 1 at the index of 'cat', 0 elsewhere

Problems:

  • No notion of similarity: sim(cat,dog)=sim(cat,democracy)=0\text{sim}(\text{cat}, \text{dog}) = \text{sim}(\text{cat}, \text{democracy}) = 0 (all orthogonal)
  • Sparse and high-dimensional (VV = vocabulary size, often 50K+)
  • No semantic relationships captured

ā„¹ļø Why Embeddings Are Better

Word embeddings solve these problems by mapping words to dense, low-dimensional vectors (typically 100-300 dimensions) where similar words are close together. The embedding matrix E∈RVƗdE \in \mathbb{R}^{V \times d} is learned during training, and each row EiE_i represents the vector for word ii. This is equivalent to looking up the ii-th row of a weight matrix.

Word2Vec (Mikolov et al., 2013)

Learns dense vectors by predicting context.

DfWord2Vec Skip-gram

Given a center word, predict surrounding context words. The model learns dense vector representations where words with similar contexts have similar vectors. The key insight is that "you shall know a word by the company it keeps" (Firth, 1957).

L(Īø)=āˆ’āˆ‘t=1Tāˆ‘āˆ’c≤j≤c,j≠0log⁔P(wt+j∣wt)\mathcal{L}(\theta) = -\sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)

Word2Vec Skip-gram Objective

L(Īø)=āˆ’āˆ‘t=1Tāˆ‘āˆ’c≤j≤c,j≠0log⁔P(wt+j∣wt)\mathcal{L}(\theta) = -\sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)

Here,

  • TT=Length of the corpus
  • cc=Context window size
  • wtw_t=Center word at position t
  • wt+jw_{t+j}=Context word at position t+j

Skip-gram Probability (Softmax)

P(wO∣wI)=exp⁔(vwO′⋅vwI)āˆ‘w=1Vexp⁔(vw′⋅vwI)P(w_O | w_I) = \frac{\exp(\mathbf{v}_{w_O}' \cdot \mathbf{v}_{w_I})}{\sum_{w=1}^{V} \exp(\mathbf{v}_w' \cdot \mathbf{v}_{w_I})}

Here,

  • wIw_I=Input (center) word
  • wOw_O=Output (context) word
  • vwI\mathbf{v}_{w_I}=Input vector for word w_I
  • vwO′\mathbf{v}_{w_O}'=Output vector for word w_O
  • VV=Vocabulary size

šŸ’” Negative Sampling

Computing the softmax denominator over the entire vocabulary is expensive (O(V)O(V) per word). Negative sampling approximates this by training on the true context word plus a few randomly sampled "negative" words. This reduces computation to O(K)O(K) where KK is the number of negatives (typically 5-20).

šŸ’” Key Property: Word Arithmetic

Word embeddings capture semantic relationships that can be expressed through vector arithmetic:

kingāƒ—āˆ’manāƒ—+womanāƒ—ā‰ˆqueenāƒ—\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}
from gensim.models import Word2Vec
import numpy as np

# Training data
sentences = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "played", "in", "the", "park"],
    ["cats", "and", "dogs", "are", "friends"],
    ["the", "cat", "chased", "the", "mouse"],
]

# Train Word2Vec
model = Word2Vec(
    sentences,
    vector_size=100,  # embedding dimension
    window=5,         # context window size
    min_count=1,      # minimum word frequency
    sg=1,             # 1=skip-gram, 0=CBOW
    epochs=100,
)

# Get word vector
cat_vector = model.wv['cat']
print(f"Vector shape: {cat_vector.shape}")  # (100,)

# Find similar words
similar = model.wv.most_similar('cat', topn=5)
print("Similar to 'cat':", similar)

# Word analogy: king - man + woman = ?
result = model.wv.most_similar(
    positive=['king', 'woman'],
    negative=['man'],
    topn=1
)
print("king - man + woman =", result)

GloVe (Global Vectors, Pennington et al., 2014)

Combines global co-occurrence statistics with local context windows.

GloVe Objective

J=āˆ‘i,j=1Vf(Xij)(wiTw~j+bi+b~jāˆ’log⁔Xij)2J = \sum_{i,j=1}^{V} f(X_{ij}) \left( \mathbf{w}_i^T \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2

Here,

  • XijX_{ij}=Co-occurrence count of words i and j
  • wi\mathbf{w}_i=Word vector for word i
  • w~j\tilde{\mathbf{w}}_j=Context word vector for word j
  • bi,b~jb_i, \tilde{b}_j=Bias terms
  • f(x)f(x)=Weighting function capping rare and frequent co-occurrences

where XijX_{ij} is the co-occurrence count and f(x)f(x) is a weighting function:

f(x) = \\begin{cases} (x/x_{\\max})^\\alpha & \\text{if } x < x_{\\max} \\\\ 1 & \\text{otherwise} \\end{cases}
import numpy as np
from gensim.models import KeyedVectors

# Load pre-trained GloVe embeddings
# Download from: https://nlp.stanford.edu/projects/glove/
# glove.6B.100d.txt format: word float float float ...

def load_glove(filepath):
    embeddings = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

# Usage
# glove_embeddings = load_glove('glove.6B.100d.txt')
# print(glove_embeddings['king'].shape)  # (100,)

# Create embedding matrix for a vocabulary
def create_embedding_matrix(vocab, embeddings_dict, dim=100):
    matrix = np.zeros((len(vocab), dim))
    found = 0
    for word, idx in vocab.items():
        if word in embeddings_dict:
            matrix[idx] = embeddings_dict[word]
            found += 1
    print(f"Found {found}/{len(vocab)} words in embeddings")
    return matrix

Embedding Comparison

MethodTrainingSemanticsSpeedMemory
Word2VecLocal contextStrong (syntactic)FastLow
GloVeGlobal co-occurrenceStrong (semantic)MediumLow
FastTextSubword infoHandles OOVSlowHigh
BERTContextualBest (context-dependent)SlowHigh

5. Sequence Models

Recurrent Neural Network (RNN)

RNN Hidden State

ht=tanh(Whhhtāˆ’1+Wxhxt+bh)h_t = \\tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)

Here,

  • hth_t=Hidden state at time t
  • xtx_t=Input at time t
  • WhhW_{hh}=Recurrent weight matrix
  • WxhW_{xh}=Input weight matrix

RNN Output

yt=Whyht+byy_t = W_{hy} h_t + b_y

Here,

  • yty_t=Output at time t
  • hth_t=Hidden state at time t
  • WhyW_{hy}=Hidden-to-output weight matrix
  • byb_y=Output bias
Architecture Diagram
        h_0 -> [RNN] -> h_1 -> [RNN] -> h_2 -> [RNN] -> h_3 -> y
              ^              ^              ^
              x_1            x_2            x_3

Problem: Vanishing gradient — struggles with long-range dependencies.

Long Short-Term Memory (LSTM)

DfLSTM Gates

LSTM uses three gates (forget, input, output) to control the flow of information, allowing it to learn long-range dependencies by selectively remembering or forgetting information.

LSTM Forget Gate

ft=σ(Wf[htāˆ’1,xt]+bf)f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)

Here,

  • ftf_t=Forget gate output (0 = forget, 1 = keep)
  • σ\sigma=Sigmoid activation (outputs between 0 and 1)
  • [htāˆ’1,xt][h_{t-1}, x_t]=Concatenation of previous hidden state and current input

LSTM Input Gate and Candidate

it=σ(Wi[htāˆ’1,xt]+bi)C~t=tanh⁔(WC[htāˆ’1,xt]+bC)i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \\ \tilde{C}_t = \tanh(W_C [h_{t-1}, x_t] + b_C)

Here,

  • iti_t=Input gate (what to update)
  • C~t\tilde{C}_t=Candidate cell state (new information)

LSTM Cell State Update

Ct=ftāŠ™Ctāˆ’1+itāŠ™C~tC_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

Here,

  • CtC_t=New cell state
  • ftāŠ™Ctāˆ’1f_t \odot C_{t-1}=What to forget from old cell state
  • itāŠ™C~ti_t \odot \tilde{C}_t=What to add as new information

LSTM Output Gate

ot=σ(Wo[htāˆ’1,xt]+bo)ht=otāŠ™tanh⁔(Ct)o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \\ h_t = o_t \odot \tanh(C_t)

Here,

  • oto_t=Output gate (what to output)
  • hth_t=Final hidden state
import torch
import torch.nn as nn

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=2,
                           batch_first=True, dropout=0.3, bidirectional=True)
        self.classifier = nn.Linear(hidden_dim * 2, num_classes)  # *2 for bidirectional

    def forward(self, x):
        embeds = self.embedding(x)              # (batch, seq_len, embed_dim)
        lstm_out, (h_n, c_n) = self.lstm(embeds) # lstm_out: (batch, seq_len, hidden*2)

        # Concatenate final hidden states from both directions
        hidden = torch.cat([h_n[-2], h_n[-1]], dim=1)  # (batch, hidden*2)
        output = self.classifier(hidden)
        return output

# Example
model = LSTMClassifier(vocab_size=10000, embed_dim=128, hidden_dim=256, num_classes=2)
x = torch.randint(0, 10000, (32, 50))  # batch of 32, sequence length 50
print(model(x).shape)  # torch.Size([32, 2])

GRU (Gated Recurrent Unit)

Simplified LSTM with 2 gates instead of 3:

GRU Update and Reset Gates

zt=σ(Wz[htāˆ’1,xt])(updateĀ gate)rt=σ(Wr[htāˆ’1,xt])(resetĀ gate)z_t = \sigma(W_z [h_{t-1}, x_t]) \quad \text{(update gate)} \\ r_t = \sigma(W_r [h_{t-1}, x_t]) \quad \text{(reset gate)}

Here,

  • ztz_t=Update gate (interpolation between old and new)
  • rtr_t=Reset gate (how much past to forget)

GRU Hidden State Update

ht=(1āˆ’zt)āŠ™htāˆ’1+ztāŠ™tanh⁔(W[rtāŠ™htāˆ’1,xt])h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tanh(W [r_t \odot h_{t-1}, x_t])

Here,

  • hth_t=New hidden state
  • (1āˆ’zt)āŠ™htāˆ’1(1 - z_t) \odot h_{t-1}=What to keep from old hidden state
  • ztāŠ™h~tz_t \odot \tilde{h}_t=What to add as new information

ā„¹ļø LSTM vs. GRU

GRU combines the forget and input gates into a single update gate ztz_t, and merges the cell state and hidden state. This results in fewer parameters (2 gates vs. 3) and faster training. GRU performs comparably to LSTM on many tasks, especially when data is limited. Use LSTM when you need fine-grained control over information flow; use GRU for simpler, faster models.


ThVanishing Gradient Problem in RNNs

In vanilla RNNs, gradients are multiplied by the recurrent weight matrix WhhW_{hh} at each time step. After TT steps, the gradient scales as ∄Whh∄T\|W_{hh}\|^T. If ∄Whh∄<1\|W_{hh}\| < 1, gradients vanish exponentially (long-range dependencies are lost). If ∄Whh∄>1\|W_{hh}\| > 1, gradients explode. LSTMs solve this by maintaining a cell state with additive updates (not multiplicative), allowing gradients to flow unchanged through the cell state.

šŸ“Sentiment Analysis Pipeline

Task: Classify movie reviews as positive/negative.

Pipeline:

  1. Preprocessing: Lowercase, remove punctuation, tokenize
  2. Vocabulary: Build word-to-index mapping (top 25K words)
  3. Embedding: Load pre-trained GloVe vectors (300d) or train from scratch
  4. Model: Bidirectional LSTM with 2 layers, hidden dim=256
  5. Classification: Use final hidden state [hforward;hbackward][h_{\text{forward}}; h_{\text{backward}}] as features for a linear classifier

Expected Performance:

  • TF-IDF + Logistic Regression: ~85% accuracy
  • LSTM (random init): ~87% accuracy
  • LSTM + GloVe: ~89% accuracy
  • LSTM + fine-tuned GloVe: ~90% accuracy
  • BERT fine-tuned: ~93% accuracy

The improvement from GloVe shows the value of pre-trained embeddings. The jump to BERT shows the power of contextual representations.

6. Complete Example: Sentiment Classification

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from collections import Counter

class Vocabulary:
    def __init__(self, max_size=25000):
        self.word2idx = {'<PAD>': 0, '<UNK>': 1}
        self.idx2word = {0: '<PAD>', 1: '<UNK>'}
        self.max_size = max_size

    def build(self, texts):
        counter = Counter()
        for text in texts:
            counter.update(text.split())
        for word, _ in counter.most_common(self.max_size - 2):
            idx = len(self.word2idx)
            self.word2idx[word] = idx
            self.idx2word[idx] = word

    def encode(self, text, max_len=100):
        tokens = text.lower().split()
        ids = [self.word2idx.get(w, 1) for w in tokens]
        # Pad or truncate
        if len(ids) < max_len:
            ids += [0] * (max_len - len(ids))
        else:
            ids = ids[:max_len]
        return ids

class SentimentDataset(Dataset):
    def __init__(self, texts, labels, vocab, max_len=100):
        self.texts = texts
        self.labels = labels
        self.vocab = vocab
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        ids = self.vocab.encode(self.texts[idx], self.max_len)
        return torch.tensor(ids), torch.tensor(self.labels[idx])

# Training
vocab = Vocabulary(max_size=25000)
vocab.build(train_texts)  # assuming train_texts is a list of strings

model = LSTMClassifier(vocab_size=25000, embed_dim=128, hidden_dim=256, num_classes=2)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(10):
    model.train()
    for batch_ids, batch_labels in train_loader:
        optimizer.zero_grad()
        outputs = model(batch_ids)
        loss = criterion(outputs, batch_labels)
        loss.backward()
        optimizer.step()

7. Key Takeaways

šŸ“‹Summary: NLP Basics

  • Preprocessing (cleaning, normalization) is task-dependent — modern models like BERT handle raw text well, but TF-IDF models still benefit from preprocessing
  • TF-IDF captures word importance by balancing term frequency and inverse document frequency; useful as a strong baseline
  • Word2Vec learns embeddings by predicting context (skip-gram) or predicting words from context (CBOW); captures syntactic and semantic relationships
  • GloVe combines global co-occurrence statistics with local context, often capturing better semantic relationships
  • Subword tokenization (BPE) handles rare words and morphological variation; enables open-vocabulary modeling
  • Word analogies (kingāƒ—āˆ’manāƒ—+womanāƒ—ā‰ˆqueenāƒ—\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}) demonstrate that embeddings capture linear semantic structure
  • LSTM/GRU solve the vanishing gradient problem through gating mechanisms; LSTM has 3 gates (forget, input, output), GRU has 2 (update, reset)
  • Pre-trained embeddings transfer knowledge and reduce training data requirements; fine-tuning them during training often improves performance

8. Practice Exercises

Exercise 1: Build a Vocabulary

# TODO: Build a vocabulary from a corpus of 10K documents
# Support: min frequency threshold, max vocab size, special tokens
# Test: encode and decode a sample sentence

Exercise 2: Train Word2Vec

# TODO: Train Word2Vec on a custom corpus (e.g., Wikipedia dump)
# Evaluate with word similarity benchmarks
# Visualize word vectors using t-SNE

Exercise 3: Sentiment Classifier

# TODO: Build a bidirectional LSTM for sentiment analysis
# Use the IMDB dataset (50K reviews)
# Target: >85% accuracy
# Compare with TF-IDF + Logistic Regression baseline

Exercise 4: Compare Embeddings

# TODO: Compare Word2Vec, GloVe, and FastText on:
# 1. Word similarity (WS-353 dataset)
# 2. Word analogy task
# 3. Text classification accuracy

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement