Text Classification

Introduction

Text classification assigns labels to text documents, used for sentiment analysis, topic modeling, and spam detection.

Sentiment Analysis

from transformers import pipeline

# Using pretrained model
sentiment = pipeline("sentiment-analysis")
result = sentiment("I love this product!")
print(result)  # [{'label': 'POSITIVE', 'score': 0.99...}]

Topic Modeling

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Create document-term matrix
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
dtm = vectorizer.fit_transform(documents)

# LDA topic model
lda = LatentDirichletAllocation(n_topics=10, random_state=42)
lda.fit(dtm)

# Get topics
for topic_idx, topic in enumerate(lda.components_):
    print(f"Topic {topic_idx}: ", end="")
    print([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-5:]])

Custom Classifier

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
    ('clf', LogisticRegression(max_iter=1000))
])

pipeline.fit(train_texts, train_labels)
predictions = pipeline.predict(test_texts)

Multi-label Classification

from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

classifier = OneVsRestClassifier(LinearSVC())
classifier.fit(X_train, y_train_multilabel)
predictions = classifier.predict(X_test)

Practice Problems

Perform sentiment analysis
Extract topics with LDA
Build text classification pipeline
Handle multi-label classification
Evaluate with classification report

Introduction

Sentiment Analysis

Topic Modeling

Custom Classifier

Multi-label Classification

Practice Problems

Need Expert Python Help?