NLP Fundamentals — Complete Guide
Natural Language Processing enables computers to understand and generate human language.
Text Preprocessing
Pipeline:
1. Tokenization: Split text into tokens
"I love ML" → ["I", "love", "ML"]
2. Lowercasing: "I LOVE ML" → "i love ml"
3. Stop word removal: Remove common words
["i", "love", "ml"] → ["love", "ml"]
4. Stemming: Reduce to root form
["running", "runs", "ran"] → ["run"]
5. Lemmatization: Reduce to dictionary form (better)
["better"] → ["good"]
6. Vectorization: Convert text to numbers
Bag of Words & TF-IDF
Bag of Words (BoW):
Count word occurrences in each document
Doc 1: "I love ML" → [1, 1, 1, 0]
Doc 2: "I love dogs" → [1, 1, 0, 1]
Vocabulary: [I, love, ML, dogs]
TF-IDF: Weighs words by importance
TF = term frequency in document
IDF = inverse document frequency (rarer = more important)
TF-IDF = TF × IDF
Common words (the, is) get LOW TF-IDF
Rare words (machine learning) get HIGH TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["I love machine learning", "I love dogs", "Machine learning is great"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())
Word Embeddings
One-hot: [0, 0, 1, 0, 0] — sparse, no meaning
Embedding: [0.2, -0.5, 0.8, 0.1, 0.3] — dense, captures meaning
Word2Vec:
├─ Skip-gram: Predict context from word
├─ CBOW: Predict word from context
└─ Learns: king - man + woman ≈ queen
GloVe:
├─ Global vectors
├─ Uses co-occurrence statistics
└─ Pre-trained on Wikipedia
import gensim.downloader as api
# Load pre-trained Word2Vec
model = api.load('word2vec-google-news-300')
# Similarity
model.similarity('cat', 'dog') # 0.76
# Analogy
model.most_similar(positive=['king', 'woman'], negative=['man'])
# [('queen', 0.71)]
Text Classification
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
# Pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=5000)),
('clf', MultinomialNB())
])
# Train
pipeline.fit(X_train_text, y_train)
# Predict
predictions = pipeline.predict(["This movie is great!"])
Key Takeaways
- Tokenization is the first step in any NLP pipeline
- TF-IDF is simple and effective for text classification
- Word embeddings capture semantic meaning
- Word2Vec and GloVe are pre-trained embeddings
- Pre-trained models (BERT, GPT) achieve state-of-the-art
- Text preprocessing significantly impacts performance
- N-grams capture local word order
- Sentiment analysis is a common NLP task