NLP Preprocessing

NLPText ProcessingFree Lesson

Advertisement

Introduction

Text preprocessing converts raw text into tokens suitable for machine learning models.

Tokenization

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Hello world. This is a test."

# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)  # ['Hello world.', 'This is a test.']

# Word tokenization
words = word_tokenize(text)
print(words)  # ['Hello', 'world', '.', 'This', 'is', 'a', 'test', '.']

# Whitespace tokenization
words = text.split()

Stemming

from nltk.stem import PorterStemmer, SnowballStemmer

stemmer = PorterStemmer()
words = ['running', 'runner', 'ran', 'runs']
stems = [stemmer.stem(w) for w in words]
print(stems)  # ['run', 'runner', 'ran', 'run']

# Snowball (Porter2)
snowball = SnowballStemmer('english')
print([snowball.stem(w) for w in words])

Lemmatization

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp("running runner ran")
lemmas = [token.lemma_ for token in doc]
print(lemmas)  # ['run', 'runner', 'run']

# WordNet lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('running', pos='v'))  # run

Stop Words

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
words = ['The', 'cat', 'is', 'on', 'the', 'mat']
filtered = [w for w in words if w.lower() not in stop_words]
print(filtered)  # ['cat', 'mat']

Practice Problems

  1. Tokenize sentences and words
  2. Apply stemming
  3. Use lemmatization
  4. Remove stop words
  5. Build preprocessing pipeline

Advertisement

Need Expert Python Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement