NLP Basics: Tokenization, Embeddings and Word Vectors

Natural Language Processing (NLP) bridges human language and computational understanding. This lesson covers the foundational building blocks — from raw text to dense vector representations that capture semantic meaning.
1. The NLP Pipeline

Every NLP system follows a sequential pipeline that transforms raw text into machine-understandable representations.
Architecture Diagram
"The cat sat"
["the","cat","sat"]
["cat","sat"]
[0.2, 0.8, ...]
P(cat|context)
"noun phrase"
NLP Pipeline
Each stage reduces ambiguity and enriches representation
Tokenization → Stemming/Lemmatization → Vectorization → Modeling → Inference~~~

**Key stages:**
1. **Raw Text** — Unprocessed input (documents, sentences, tweets)
2. **Tokenization** — Splitting text into atomic units (tokens)
3. **Normalization** — Lowercasing, removing noise, stemming/lemmatization
4. **Feature Extraction** — Converting tokens to numerical vectors
5. **Modeling** — Applying statistical or neural models
6. **Output** — Classification, generation, translation, etc.

---

## 2. Text Preprocessing

Preprocessing cleans and normalizes text to reduce vocabulary size and noise.

### 2.1 Lowercasing and Noise Removal

```python
import re

def preprocess(text):
    text = text.lower()
    text = re.sub(r'<.*?>', '', text)        # Remove HTML tags
    text = re.sub(r'http\S+', '', text)      # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Keep only letters
    text = re.sub(r'\s+', ' ', text).strip()  # Normalize whitespace
    return text

sample = "  <p>The Cat sat on a MAT!!! Visit http://example.com  </p> "
print(preprocess(sample))
# Output: "the cat sat on a mat"
```

### 2.2 Stopword Removal

Stopwords are high-frequency, low-information words (`the`, `is`, `at`, `which`).

```python
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

tokens = ["the", "cat", "sat", "on", "a", "mat"]
filtered = [w for w in tokens if w not in stop_words]
# ['cat', 'sat', 'mat']
```

### 2.3 Stemming vs Lemmatization

| Method | Approach | Example | Pros | Cons |
|--------|----------|---------|------|------|
| **Stemming** | Rule-based suffix stripping | "running" → "run", "studies" → "studi" | Fast, no lookup | Can over-stem or under-stem |
| **Lemmatization** | Dictionary + morphological analysis | "better" → "good", "ran" → "run" | Linguistically accurate | Slower, requires POS tags |

```python
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "studies", "better", "geese"]

# Stemming
print([stemmer.stem(w) for w in words])
# ['run', 'studi', 'better', 'gees']

# Lemmatization (requires POS for best results)
print([lemmatizer.lemmatize(w, pos='v') for w in words])
# ['run', 'studies', 'good', 'geese']
```

**When to use which:**
- **Stemming**: Information retrieval, search engines, when speed matters
- **Lemmatization**: Text analysis, chatbots, when semantic accuracy matters

---

## 3. Tokenization Strategies

Tokenization determines how text is segmented into processable units.

<svg viewBox="0 0 900 350" xmlns="http://www.w3.org/2000/svg" style={{width: '100%', maxWidth: '900px', margin: '2rem auto', display: 'block'}}>
  <defs>
    <linearGradient id="wordGrad" x1="0%" y1="0%" x2="0%" y2="100%">
      <stop offset="0%" style={{stopColor: '#43e97b'}} />
      <stop offset="100%" style={{stopColor: '#38f9d7'}} />
    </linearGradient>
    <linearGradient id="subGrad" x1="0%" y1="0%" x2="0%" y2="100%">
      <stop offset="0%" style={{stopColor: '#667eea'}} />
      <stop offset="100%" style={{stopColor: '#764ba2'}} />
    </linearGradient>
    <linearGradient id="charGrad" x1="0%" y1="0%" x2="0%" y2="100%">
      <stop offset="0%" style={{stopColor: '#f093fb'}} />
      <stop offset="100%" style={{stopColor: '#f5576c'}} />
    </linearGradient>
  </defs>

  {/* Input text */}
  <rect x={300} y={10} width={300} height={40} rx={8} fill="#1a1a2e" />
  <text x={450} y={36} textAnchor="middle" fontSize="14" fontWeight="600" fill="white">Input: "unhappiness"</text>

  {/* Branches */}
  <line x1={450} y1={50} x2={150} y2={100} stroke="#ccc" strokeWidth="2" />
  <line x1={450} y1={50} x2={450} y2={100} stroke="#ccc" strokeWidth="2" />
  <line x1={450} y1={50} x2={750} y2={100} stroke="#ccc" strokeWidth="2" />

  {/* Word-level */}
  <rect x={30} y={100} width={240} height={220} rx={12} fill="url(#wordGrad)" opacity="0.15" stroke="#43e97b" />
  <rect x={50} y={110} width={200} height={35} rx={8} fill="url(#wordGrad)" />
  <text x={150} y={133} textAnchor="middle" fontSize="14" fontWeight="700" fill="#1a1a2e">Word-Level</text>
  <text x={50} y={165} fontSize="12" fill="#333">Tokens:</text>
  <rect x={50} y={175} width={200} height={30} rx={6} fill="white" stroke="#43e97b" />
  <text x={150} y={195} textAnchor="middle" fontSize="12" fill="#333">["unhappiness"]</text>
  <text x={50} y={225} fontSize="11" fill="#555">Vocab size: ~1M</text>
  <text x={50} y={245} fontSize="11" fill="#555">Handles: known words</text>
  <text x={50} y={265} fontSize="11" fill="#555">Fails: OOV words</text>
  <text x={50} y={290} fontSize="11" fill="#555">Example: NLTK, spaCy</text>

  {/* Subword-level */}
  <rect x={330} y={100} width={240} height={220} rx={12} fill="url(#subGrad)" opacity="0.15" stroke="#667eea" />
  <rect x={350} y={110} width={200} height={35} rx={8} fill="url(#subGrad)" />
  <text x={450} y={133} textAnchor="middle" fontSize="14" fontWeight="700" fill="white">Subword-Level</text>
  <text x={350} y={165} fontSize="12" fill="#333">Tokens:</text>
  <rect x={350} y={175} width={200} height={30} rx={6} fill="white" stroke="#667eea" />
  <text x={450} y={195} textAnchor="middle" fontSize="12" fill="#333">["un", "happi", "ness"]</text>
  <text x={350} y={225} fontSize="11" fill="#555">Vocab size: ~30K-50K</text>
  <text x={350} y={245} fontSize="11" fill="#555">Handles: rare + common</text>
  <text x={350} y={265} fontSize="11" fill="#555">Best trade-off</text>
  <text x={350} y={290} fontSize="11" fill="#555">Example: BPE, WordPiece</text>

  {/* Character-level */}
  <rect x={630} y={100} width={240} height={220} rx={12} fill="url(#charGrad)" opacity="0.15" stroke="#f5576c" />
  <rect x={650} y={110} width={200} height={35} rx={8} fill="url(#charGrad)" />
  <text x={750} y={133} textAnchor="middle" fontSize="14" fontWeight="700" fill="white">Character-Level</text>
  <text x={650} y={165} fontSize="12" fill="#333">Tokens:</text>
  <rect x={650} y={175} width={200} height={30} rx={6} fill="white" stroke="#f5576c" />
  <text x={750} y={195} textAnchor="middle" fontSize="12" fill="#333">["u","n","h","a","p","i"]</text>
  <text x={650} y={225} fontSize="11" fill="#555">Vocab size: ~26-256</text>
  <text x={650} y={245} fontSize="11" fill="#555">Handles: any text</text>
  <text x={650} y={265} fontSize="11" fill="#555">Long sequences</text>
  <text x={650} y={290} fontSize="11" fill="#555">Example: CharCNN, ByT5</text>
</svg>

### 3.1 Word-Level Tokenization

```python
# Simple whitespace + punctuation tokenizer
import re

def word_tokenize(text):
    return re.findall(r"\b\w+\b", text.lower())

text = "Don't tokenize — it's easier!"
print(word_tokenize(text))
# ["don't", "tokenize", "it's", "easier"]
```

### 3.2 Byte-Pair Encoding (BPE)

BPE iteratively merges the most frequent character pairs, building a subword vocabulary.

**Algorithm:**
1. Start with character-level vocabulary
2. Count all adjacent symbol pairs
3. Merge the most frequent pair into a new symbol
4. Repeat until desired vocabulary size is reached

```python
# Simplified BPE training
corpus = ["low", "low", "low", "lowest", "newer", "wider"]

# Initial vocab: all unique characters
vocab = set(''.join(corpus))  # {'l','o','w','e','s','t','n','r','i','d'}

# Iteration 1: most frequent pair is ('l','o') → merge to 'lo'
# Iteration 2: ('lo','w') → 'low' (high frequency)
# ... continues until vocab size limit
```

### 3.3 WordPiece Tokenization

Used by BERT. Similar to BPE but merges pairs that maximize likelihood of the training data rather than pure frequency.



<MathBlock tex={`\\text{score}(x, y) = \\frac{\\text{freq}(xy)}{\\text{freq}(x) \\times \\text{freq}(y)}`} display={true} />



```
BERT tokenizer output:
"unhappiness" → ["##un", "##happi", "##ness"]
"tokenization" → ["token", "##ization"]
```

### 3.4 SentencePiece

Language-agnostic tokenization that treats input as raw Unicode, handling languages without whitespace.

```python
import sentencepiece as spm

sp = spm.SentencePieceProcessor(model_file='model.model')
tokens = sp.encode("unhappiness", out_type=str)
# ['-un', 'happy', 'ness']
```

### Tokenizer Comparison

| Method | Vocab Size | OOV Handling | Speed | Used By |
|--------|-----------|--------------|-------|---------|
| Word | 100K-1M | Poor | Fast | spaCy, NLTK |
| BPE | 30K-50K | Good | Medium | GPT-2/3/4 |
| WordPiece | 30K | Good | Medium | BERT, DistilBERT |
| SentencePiece | 32K-64K | Good | Medium | T5, LLaMA, mBART |

---

## 4. Bag of Words and TF-IDF

### 4.1 Bag of Words (BoW)

BoW represents documents as fixed-length vectors of word counts, ignoring order.

```python
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat chased the dog"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
# ['cat', 'chased', 'dog', 'log', 'mat', 'on', 'sat', 'the']

print(X.toarray())
# [[1, 0, 0, 0, 1, 1, 1, 2],
#  [0, 0, 1, 1, 0, 1, 1, 2],
#  [1, 1, 1, 0, 0, 0, 0, 2]]
```

**Limitations:**
- Loses word order: "dog bites man" = "man bites dog"
- High dimensionality: vocabulary-sized sparse vectors
- No semantic information: "good" and "excellent" are unrelated

### 4.2 TF-IDF (Term Frequency–Inverse Document Frequency)

TF-IDF weights words by their importance within a document relative to the corpus.

~~~TF-IDF Calculation Flow
TF(t, d)
Frequency of term t in doc d
TF(t,d) = f(t,d) / |d|
"cat" appears 2x in 10-word doc → TF = 0.2
IDF(t)
Rareness across corpus
IDF(t) = log(N / df(t))
"the" in 100/100 docs → IDF = 0 (useless)
TF-IDF
Combined importance
TF × IDF
High TF + High IDF = Important
Example: "cat" (TF=0.2, IDF=1.5) → TF-IDF = 0.30 | "the" (TF=0.3, IDF=0.0) → TF-IDF = 0.00~~~



<MathBlock tex={`\\text{TF-IDF}(t, d, D) = \\underbrace{\\frac{f_{t,d}}{\\sum_{t' \\in d} f_{t',d}}}_{\\text{TF}} \\times \\underbrace{\\log \\frac{&#124;D&#124;}{&#124;\\{d \\in D : t \\in d\\}&#124;}}_{\\text{IDF}}`} display={true} />



```python
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# "cat" has high TF-IDF in doc 0 (appears there, rare overall)
# "the" has near-zero TF-IDF (appears everywhere)
```

---

## 5. Word Embeddings

Word embeddings map words to dense, low-dimensional vectors where geometric relationships encode semantic similarity.

### 5.1 Why Not One-Hot Encoding?

One-hot vectors are orthogonal — no notion of similarity:



<MathBlock tex={`\\text{king} = [0, 0, \\ldots, 1, \\ldots, 0] \\quad \\text{queen} = [0, 0, \\ldots, 0, \\ldots, 1]`} display={true} />





<MathBlock tex={`\\text{sim}(\\text{king}, \\text{queen}) = \\text{sim}(\\text{king}, \\text{toaster}) = 0`} display={true} />



Dense embeddings solve this by learning a continuous vector space.

### 5.2 Word2Vec (Mikolov et al., 2013)

Word2Vec learns embeddings by predicting context from words (or words from context).

<svg viewBox="0 0 900 400" xmlns="http://www.w3.org/2000/svg" style={{width: '100%', maxWidth: '900px', margin: '2rem auto', display: 'block'}}>
  <defs>
    <linearGradient id="cbowGrad" x1="0%" y1="0%" x2="0%" y2="100%">
      <stop offset="0%" style={{stopColor: '#667eea'}} />
      <stop offset="100%" style={{stopColor: '#764ba2'}} />
    </linearGradient>
    <linearGradient id="sgGrad" x1="0%" y1="0%" x2="0%" y2="100%">
      <stop offset="0%" style={{stopColor: '#f093fb'}} />
      <stop offset="100%" style={{stopColor: '#f5576c'}} />
    </linearGradient>
  </defs>

  {/* Title */}
  <text x={450} y={30} textAnchor="middle" fontSize="18" fontWeight="700" fill="#1a1a2e">Word2Vec Architectures</text>

  {/* CBOW Side */}
  <text x={225} y={65} textAnchor="middle" fontSize="15" fontWeight="700" fill="#667eea">CBOW</text>
  <text x={225} y={85} textAnchor="middle" fontSize="12" fill="#666">Predict center word from context</text>

  {/* CBOW inputs */}
  {['w(t-2)', 'w(t-1)', 'w(t+1)', 'w(t+2)'].map((w, i) => (
    <g key={`cbow-in-${i}`}>
      <rect x={i < 2 ? 50 + i * 80 : 195 + (i-2) * 80} y={100 + Math.abs(i - 1.5) * 15} width={70} height={30} rx={6} fill="#e8eaf6" stroke="#667eea" strokeWidth="1.5" />
      <text x={i < 2 ? 85 + i * 80 : 230 + (i-2) * 80} y={118 + Math.abs(i - 1.5) * 15} textAnchor="middle" fill="#667eea" fontSize="10">{w}</text>
    </g>
  ))}

  {/* Arrows to hidden */}
  {[50, 130, 195, 275].map((x, i) => (
    <line key={`cbow-arr-${i}`} x1={x + 35} y1={130 + Math.abs(i - 1.5) * 15} x2={225} y2={185} stroke="#667eea" strokeWidth="1.5" markerEnd="url(#arrowCBOW)" />
  ))}

  {/* Hidden layer */}
  <rect x={165} y={185} width={120} height={35} rx={10} fill="url(#cbowGrad)" />
  <text x={225} y={207} textAnchor="middle" fontSize="12" fontWeight="600" fill="white">Projection</text>

  {/* Arrow to output */}
  <line x1={225} y1={220} x2={225} y2={260} stroke="#667eea" strokeWidth="2" />

  {/* Output */}
  <rect x={185} y={260} width={80} height={35} rx={10} fill="#667eea" />
  <text x={225} y={282} textAnchor="middle" fontSize="13" fontWeight="600" fill="white">w(t)</text>

  <rect x={145} y={310} width={160} height={30} rx={8} fill="#f0f4ff" />
  <text x={225} y={330} textAnchor="middle" fontSize="11" fill="#555">Loss: -log P(w(t)|context)</text>

  {/* Skip-gram Side */}
  <text x={675} y={65} textAnchor="middle" fontSize="15" fontWeight="700" fill="#f5576c">Skip-gram</text>
  <text x={675} y={85} textAnchor="middle" fontSize="12" fill="#666">Predict context words from center</text>

  {/* Input */}
  <rect x={635} y={100} width={80} height={35} rx={10} fill="#fce4ec" stroke="#f5576c" strokeWidth="1.5" />
  <text x={675} y={122} textAnchor="middle" fontSize="13" fontWeight="600" fill="#c62828">w(t)</text>

  {/* Arrow to hidden */}
  <line x1={675} y1={135} x2={675} y2={185} stroke="#f5576c" strokeWidth="2" />

  {/* Hidden layer */}
  <rect x={615} y={185} width={120} height={35} rx={10} fill="url(#sgGrad)" />
  <text x={675} y={207} textAnchor="middle" fontSize="12" fontWeight="600" fill="white">Projection</text>

  {/* Arrows from hidden */}
  {[615, 695].map((x, i) => (
    <g key={`sg-arr-${i}`}>
      <line x1={675} y1={220} x2={x + 35} y2={260} stroke="#f5576c" strokeWidth="1.5" />
      <rect x={x} y={260} width={70} height={30} rx={6} fill="#fce4ec" stroke="#f5576c" strokeWidth="1.5" />
      <text x={x + 35} y={278} textAnchor="middle" fill="#f5576c" fontSize="10"></text>
    </g>
  ))}

  <rect x={595} y={310} width={160} height={30} rx={8} fill="#fff0f5" />
  <text x={675} y={330} textAnchor="middle" fontSize="11" fill="#555">Loss: -≈ log P(w(context)|w(t))</text>

  {/* Separator */}
  <line x1={450} y1={55} x2={450} y2={360} stroke="#e0e0e0" strokeWidth="1" strokeDasharray="5 5" />
</svg>

#### CBOW (Continuous Bag of Words)

Predicts the center word given surrounding context words.



<MathBlock tex={`P(w_t &#124; w_{t-c}, \\ldots, w_{t-1}, w_{t+1}, \\ldots, w_{t+c}) = \\text{softmax}(W' \\cdot \\bar{v})`} display={true} />



where <MathBlock tex={`\\bar{v} = \\frac{1}{2c} \\sum_{j \\in [-c, c], j \\neq 0} W \\cdot x_{t+j}`} /> is the averaged context embedding.

```python
import torch
import torch.nn as nn

class CBOW(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_dim)
        self.output = nn.Linear(embed_dim, vocab_size)

    def forward(self, context_ids):
        # context_ids: (batch, 2*context_size)
        embeds = self.embeddings(context_ids)       # (batch, 2c, dim)
        hidden = embeds.mean(dim=1)                  # (batch, dim)
        logits = self.output(hidden)                 # (batch, vocab_size)
        return logits

# Training loop
model = CBOW(vocab_size=10000, embed_dim=300)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    for context, target in dataloader:
        logits = model(context)
        loss = criterion(logits, target)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
```

#### Skip-gram

Predicts context words given the center word — the reverse of CBOW.



<MathBlock tex={`P(w_{t+j} &#124; w_t) = \\frac{\\exp(v'_{w_{t+j}} \\cdot v_{w_t})}{\\sum_{w=1}^{V} \\exp(v'_w \\cdot v_{w_t})}`} display={true} />



**Negative Sampling** (approximation):



<MathBlock tex={`\\log \\sigma(v'_{w_O} \\cdot v_{w_I}) + \\sum_{i=1}^{k} \\mathbb{E}_{w_i \\sim P_n(w)}[\\log \\sigma(-v'_{w_i} \\cdot v_{w_I})]`} display={true} />



```python
class SkipGram(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.center_embed = nn.Embedding(vocab_size, embed_dim)
        self.context_embed = nn.Embedding(vocab_size, embed_dim)

    def forward(self, center, context, negatives):
        c = self.center_embed(center)           # (batch, dim)
        p = self.context_embed(context)          # (batch, dim)
        n = self.context_embed(negatives)        # (batch, k, dim)

        pos_score = torch.sum(c * p, dim=1)     # (batch,)
        neg_score = torch.bmm(n, c.unsqueeze(2)).squeeze()  # (batch, k)

        pos_loss = -torch.log(torch.sigmoid(pos_score) + 1e-8)
        neg_loss = -torch.sum(torch.log(torch.sigmoid(-neg_score) + 1e-8), dim=1)
        return (pos_loss + neg_loss).mean()
```

### 5.3 GloVe (Global Vectors, Pennington et al., 2014)

GloVe combines global co-occurrence statistics with local context learning.



<MathBlock tex={`J = \\sum_{i,j=1}^{V} f(X_{ij})(w_i^T \\tilde{w}_j + b_i + \\tilde{b}_j - \\log X_{ij})^2`} display={true} />



where <MathBlock tex={`X_{ij}`} /> is the co-occurrence count and <MathBlock tex={`f(x)`} /> is a weighting function:



<MathBlock tex={`f(x) = \\begin{cases} (x / x_{\\max})^\\alpha & \\text{if } x < x_{\\max} \\\\ 1 & \\text{otherwise} \\end{cases}`} display={true} />



```python
# Using pretrained GloVe embeddings
import numpy as np

def load_glove(path):
    embeddings = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

glove = load_glove('glove.6B.300d.txt')
print(glove['king'].shape)  # (300,)
```

### Word2Vec vs GloVe

| Aspect | Word2Vec | GloVe |
|--------|----------|-------|
| **Training** | Local context windows | Global co-occurrence matrix |
| **Objective** | Predict context (prediction-based) | Reconstruct log co-occurrence (count-based) |
| **Speed** | Faster per epoch | Faster convergence |
| **Performance** | Comparable | Comparable |
| **Intuition** | "You shall know a word by the company it keeps" | "Word co-occurrence ratios encode meaning" |

---

## 6. Embedding Properties

### 6.1 Word Analogies

Word embeddings capture linear relationships: `king - man + woman ≈ queen`.

~~~Word Analogy: Vector Arithmetic
king
man
woman
queen
king - man
+ (king - man)
king ≈ man + woman ≈ queen
v(king) ≈ v(man) + v(woman) ≈ v(queen)
More Analogies
Paris ≈ France + Japan ≈ Tokyo
bigger ≈ big + small ≈ smallest
walked ≈ walk + swim ≈ swam
computer ≈ software + hardware ≈ ???
← Gender axis in embedding space →~~~

```python
def analogy(word_a, word_b, word_c, embeddings, top_k=5):
    """a - b + c = ?"""
    vec = embeddings[word_a] - embeddings[word_b] + embeddings[word_c]

    # Normalize and compute cosine similarity
    vec = vec / np.linalg.norm(vec)
    similarities = {
        word: np.dot(vec, emb / np.linalg.norm(emb))
        for word, emb in embeddings.items()
        if word not in {word_a, word_b, word_c}
    }
    return sorted(similarities.items(), key=lambda x: -x[1])[:top_k]

# king - man + woman → queen
analogy('king', 'man', 'woman', glove)
# [('queen', 0.85), ('throne', 0.72), ...]
```

### 6.2 Clustering and Semantic Groups

Embeddings form clusters where semantically related words are proximal:

```
Cluster 1 (royalty):    king, queen, prince, throne, crown
Cluster 2 (food):       pizza, pasta, burger, restaurant, menu
Cluster 3 (emotions):   happy, sad, angry, joyful, depressed
```

```python
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

words = list(glove.keys())[:5000]
vectors = np.array([glove[w] for w in words])

# Dimensionality reduction
tsne = TSNE(n_components=2, random_state=42)
coords = tsne.fit_transform(vectors)

# Clustering
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(vectors)

# Visualize
import matplotlib.pyplot as plt
plt.scatter(coords[:, 0], coords[:, 1], c=labels, cmap='tab10', s=5)
plt.show()
```

### 6.3 Cosine Similarity



<MathBlock tex={`\\text{sim}(\\mathbf{u}, \\mathbf{v}) = \\frac{\\mathbf{u} \\cdot \\mathbf{v}}{&#124;&#124;\\mathbf{u}&#124;&#124; \\cdot &#124;&#124;\\mathbf{v}&#124;&#124;} = \\cos(\\theta)`} display={true} />



| Similarity | Score |
|-----------|-------|
| king → queen | 0.85 |
| king → throne | 0.72 |
| king → banana | 0.05 |

---

## 7. Sequence Representations

### 7.1 One-Hot Encoding

Each word is represented as a binary vector of size <MathBlock tex={`&#124;V&#124;`} />:



<MathBlock tex={`x_{\\text{cat}} = [0, 0, \\ldots, 1, \\ldots, 0] \\in \\{0,1\\}^{&#124;V&#124;}`} display={true} />



**Problem:** For <MathBlock tex={`&#124;V&#124; = 50{,}000`} />, each word is a 50K-dimensional sparse vector.

### 7.2 Word2Vec Averaging (Sentence Embeddings)

A simple sentence representation by averaging word vectors:



<MathBlock tex={`\\mathbf{s} = \\frac{1}{n} \\sum_{i=1}^{n} \\mathbf{v}_{w_i}`} display={true} />



```python
def sentence_embedding(sentence, embeddings, dim=300):
    words = sentence.lower().split()
    vectors = [embeddings[w] for w in words if w in embeddings]
    if not vectors:
        return np.zeros(dim)
    return np.mean(vectors, axis=0)

s1 = sentence_embedding("the king sat on the throne", glove)
s2 = sentence_embedding("the queen sat on the throne", glove)
s3 = sentence_embedding("the cat sat on the mat", glove)

# s1 and s2 are more similar than s1 and s3
```

**Limitations:** Averaging loses word order — "dog bites man" and "man bites dog" produce identical embeddings.

### 7.3 Comparison of Representations

| Representation | Dimensionality | Semantic Info | Order Info | Sparsity |
|---------------|----------------|---------------|------------|----------|
| One-hot | <MathBlock tex={`&#124;V&#124;`} /> (50K+) | None | None | 100% |
| TF-IDF | <MathBlock tex={`&#124;V&#124;`} /> | Document-level | None | ~99% |
| Word2Vec | 100-300 | Yes | No | 0% |
| GloVe | 100-300 | Yes | No | 0% |
| Averaged W2V | 100-300 | Partial | No | 0% |
| RNN/LSTM | Variable | Yes | Yes | 0% |
| Transformer | Variable | Yes | Yes | 0% |

---

## 8. Complete Implementation

### 8.1 End-to-End NLP Pipeline

```python
import numpy as np
import re
from collections import Counter
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

class NLPBaseline:
    def __init__(self, use_stemming=False, use_lemmatization=False):
        self.stemmer = PorterStemmer() if use_stemming else None
        self.lemmatizer = WordNetLemmatizer() if use_lemmatization else None
        self.stop_words = set(stopwords.words('english'))
        self.vectorizer = None

    def preprocess(self, text):
        text = text.lower()
        text = re.sub(r'<.*?>', '', text)
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        tokens = text.split()
        tokens = [t for t in tokens if t not in self.stop_words]
        if self.stemmer:
            tokens = [self.stemmer.stem(t) for t in tokens]
        if self.lemmatizer:
            tokens = [self.lemmatizer.lemmatize(t) for t in tokens]
        return ' '.join(tokens)

    def fit(self, corpus):
        processed = [self.preprocess(doc) for doc in corpus]
        self.vectorizer = TfidfVectorizer(max_features=10000)
        self.X = self.vectorizer.fit_transform(processed)
        return self

    def transform(self, texts):
        processed = [self.preprocess(doc) for doc in texts]
        return self.vectorizer.transform(processed)

    def similar_docs(self, query, top_k=5):
        q_vec = self.transform([query])
        sims = cosine_similarity(q_vec, self.X).flatten()
        return np.argsort(sims)[::-1][:top_k], sims

# Usage
corpus = [
    "Natural language processing enables computers to understand text",
    "Machine learning algorithms learn patterns from data",
    "Deep learning uses neural networks for complex tasks",
    "NLP combines linguistics and computer science",
]

pipeline = NLPBaseline(use_lemmatization=True)
pipeline.fit(corpus)
indices, scores = pipeline.similar_docs("computational linguistics", top_k=3)
for i in indices:
    print(f"[{scores[i]:.3f}] {corpus[i]}")
```

### 8.2 Training Word2Vec from Scratch

```python
import torch
from torch.utils.data import Dataset, DataLoader

class Word2VecDataset(Dataset):
    def __init__(self, token_ids, window_size=5):
        self.data = []
        for i, center in enumerate(token_ids):
            context = token_ids[max(0, i-window_size):i] + \
                      token_ids[i+1:i+window_size+1]
            for ctx in context:
                self.data.append((center, ctx))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

def train_word2vec(corpus, vocab_size, embed_dim=100, epochs=5):
    # Build vocab
    word_counts = Counter(w for doc in corpus for w in doc.split())
    vocab = {w: i for i, (w, _) in enumerate(word_counts.most_common(vocab_size))}

    # Prepare data
    token_ids = [vocab[w] for doc in corpus for w in doc.split() if w in vocab]
    dataset = Word2VecDataset(token_ids, window_size=3)
    loader = DataLoader(dataset, batch_size=512, shuffle=True)

    # Model
    model = SkipGram(vocab_size, embed_dim)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    for epoch in range(epochs):
        total_loss = 0
        for center, context in loader:
            # Negative sampling
            negatives = torch.randint(0, vocab_size, (center.size(0), 5))
            loss = model(center, context, negatives)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            total_loss += loss.item()
        print(f"Epoch {epoch+1}: loss={total_loss/len(dataset):.4f}")

    # Extract embeddings
    return model.center_embed.weight.detach().numpy(), vocab
```

---

## Key Takeaways

1. **Preprocessing matters** — lowercasing, stopword removal, and lemmatization significantly affect downstream performance
2. **Subword tokenization** (BPE, WordPiece) balances vocabulary size with OOV handling
3. **TF-IDF** weights words by importance: frequent in a document but rare across the corpus
4. **Word2Vec** learns embeddings via local context prediction; **GloVe** leverages global co-occurrence statistics
5. **Vector arithmetic** captures semantic relationships: `king - man + woman ≈ queen`
6. **Cosine similarity** measures semantic closeness in embedding space
7. **Limitations of bag-of-words approaches**: lose word order, syntax, and context — leading to contextual embeddings (BERT, GPT) in modern NLP

---

*Next: [Contextual Embeddings and Transformers](/learn/docs/contextual-embeddings/) — How BERT and GPT solve the polysemy problem with context-dependent representations.*
NLP Basics: Tokenization, Embeddings and Word Vectors

NLP Basics: Tokenization, Embeddings and Word Vectors

1. The NLP Pipeline

Need Expert Data Science Help?