NLP Fundamentals — Text Processing, Embeddings & Classification

Core MLNLPFree Lesson

Advertisement

NLP Fundamentals — Complete Guide

Natural Language Processing enables computers to understand and generate human language.


Text Preprocessing

Pipeline:
1. Tokenization: Split text into tokens
   "I love ML" → ["I", "love", "ML"]

2. Lowercasing: "I LOVE ML" → "i love ml"

3. Stop word removal: Remove common words
   ["i", "love", "ml"] → ["love", "ml"]

4. Stemming: Reduce to root form
   ["running", "runs", "ran"] → ["run"]

5. Lemmatization: Reduce to dictionary form (better)
   ["better"] → ["good"]

6. Vectorization: Convert text to numbers

Bag of Words & TF-IDF

Bag of Words (BoW):
Count word occurrences in each document

Doc 1: "I love ML"      → [1, 1, 1, 0]
Doc 2: "I love dogs"    → [1, 1, 0, 1]
Vocabulary: [I, love, ML, dogs]

TF-IDF: Weighs words by importance
TF = term frequency in document
IDF = inverse document frequency (rarer = more important)
TF-IDF = TF × IDF

Common words (the, is) get LOW TF-IDF
Rare words (machine learning) get HIGH TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["I love machine learning", "I love dogs", "Machine learning is great"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

Word Embeddings

One-hot: [0, 0, 1, 0, 0] — sparse, no meaning
Embedding: [0.2, -0.5, 0.8, 0.1, 0.3] — dense, captures meaning

Word2Vec:
├─ Skip-gram: Predict context from word
├─ CBOW: Predict word from context
└─ Learns: king - man + woman ≈ queen

GloVe:
├─ Global vectors
├─ Uses co-occurrence statistics
└─ Pre-trained on Wikipedia
import gensim.downloader as api

# Load pre-trained Word2Vec
model = api.load('word2vec-google-news-300')

# Similarity
model.similarity('cat', 'dog')  # 0.76

# Analogy
model.most_similar(positive=['king', 'woman'], negative=['man'])
# [('queen', 0.71)]

Text Classification

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

# Pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000)),
    ('clf', MultinomialNB())
])

# Train
pipeline.fit(X_train_text, y_train)

# Predict
predictions = pipeline.predict(["This movie is great!"])

Key Takeaways

  1. Tokenization is the first step in any NLP pipeline
  2. TF-IDF is simple and effective for text classification
  3. Word embeddings capture semantic meaning
  4. Word2Vec and GloVe are pre-trained embeddings
  5. Pre-trained models (BERT, GPT) achieve state-of-the-art
  6. Text preprocessing significantly impacts performance
  7. N-grams capture local word order
  8. Sentiment analysis is a common NLP task

Advertisement

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement