String & Text Processing for Data Science
Why This Matters for Data Science
Text data constitutes approximately 80% of all enterprise data. From social media posts to customer reviews, medical records to legal documents, the ability to clean, transform, and analyze text is essential. This tutorial provides a rigorous exploration of string operations, regular expressions, and NLP preprocessing techniques with focus on practical data science applications.
1. String Fundamentals
1.1 String Properties
Strings in Python are immutable sequences of Unicode codepoints:
DfUnicode Transformation Format
A variable-width character encoding that uses 1-4 bytes per character. ASCII characters use 1 byte, while characters outside the ASCII range use 2-4 bytes. UTF-8 is backward-compatible with ASCII and is the dominant encoding for web and data interchange.
import sys
s = "Hello, Data Science!"
print(f"String: {s}")
print(f"Length: {len(s)} characters")
print(f"Memory: {sys.getsizeof(s)} bytes")
print(f"Type: {type(s)}")
print()
# Unicode codepoints
print("Character details:")
for i, char in enumerate(s[:10]):
print(f" {i}: '{char}' β U+{ord(char):04X} ({ord(char)})")
Output:
String: Hello, Data Science!
Length: 20 characters
Memory: 69 bytes
Type: <class 'str'>
Character details:
0: 'H' β U+0048 (72)
1: 'e' β U+0065 (101)
2: 'l' β U+006C (108)
3: 'l' β U+006C (108)
4: 'o' β U+006F (111)
5: ',' β U+002C (44)
6: ' ' β U+0020 (32)
7: 'D' β U+0044 (68)
8: 'a' β U+0061 (97)
9: 't' β U+0074 (116)
Python 3 strings store Unicode codepoints as integers internally. The memory representation varies based on the maximum codepoint: Latin-1 uses 1 byte/char, UCS-2 uses 2 bytes/char, and UCS-4 uses 4 bytes/char. CPython automatically selects the most compact representation.
1.2 Indexing and Slicing
String Indexing:
Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
βββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ
String: β H β e β l β l β o β , β β D β a β t β a β β S β c β i β e β n β c β e β ! β
βββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ
Negative: -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
Slice notation: s[start:stop:step]
String Slice Complexity
Here,
- =Starting index (inclusive)
- =Ending index (exclusive)
- =Step size (default 1)
s = "Hello, Data Science!"
print("Indexing:")
print(f" s[0] = '{s[0]}'") # First char
print(f" s[-1] = '{s[-1]}'") # Last char
print(f" s[7] = '{s[7]}'") # 'D'
print()
print("Slicing:")
print(f" s[0:5] = '{s[0:5]}'") # First 5 chars
print(f" s[7:11] = '{s[7:11]}'") # 'Data'
print(f" s[:7] = '{s[:7]}'") # Everything before index 7
print(f" s[13:] = '{s[13:]}'") # Everything from index 13
print(f" s[::2] = '{s[::2]}'") # Every 2nd char
print(f" s[::-1] = '{s[::-1]}'") # Reverse string
print(f" s[::3] = '{s[::3]}'") # Every 3rd char
print()
# Negative slicing
print("Negative slicing:")
print(f" s[-5:] = '{s[-5:]}'") # Last 5 chars
print(f" s[-10:-5] = '{s[-10:-5]}'") # 10th to 5th from end
1.3 String Formatting
name = "Alice Johnson"
age = 30
score = 95.6789
items = ["Python", "SQL", "ML"]
# f-strings (Python 3.6+) - RECOMMENDED
print("f-string formatting:")
print(f" Basic: {name} is {age} years old")
print(f" Decimal: {score:.2f}")
print(f" Integer: {int(score)}")
print(f" Padding: {name:>20}")
print(f" Left align: {name:<20}")
print(f" Center: {name:^20}")
print(f" Filled: {name:*^20}")
print(f" Comma separator: {1234567:,}")
print(f" Percentage: {0.856:.1%}")
print(f" Scientific: {123456:.2e}")
print(f" Binary: {42:b}")
print(f" Hex: {255:x}")
print(f" Octal: {42:o}")
print()
# .format() method
print(".format() method:")
print(" {} is {} years old".format(name, age))
print(" {1} is {0} years old".format(age, name)) # Indexed
print(" {name} scored {score:.1f}".format(name=name, score=score))
print()
# Template strings
from string import Template
t = Template("$name is $age years old")
print(f"Template: {t.substitute(name=name, age=age)}")
print()
# Lambda formatting (for dynamic formats)
format_funcs = {
"int": lambda x: f"{x:.0f}",
"float2": lambda x: f"{x:.2f}",
"percent": lambda x: f"{x:.1%}",
"scientific": lambda x: f"{x:.2e}"
}
value = 1234.5678
print("Dynamic formatting:")
for fmt_name, fmt_func in format_funcs.items():
print(f" {fmt_name}: {fmt_func(value)}")
1.4 Common String Methods
text = " Hello, World! This is a TEST string. "
print("Case methods:")
print(f" Original: '{text}'")
print(f" upper(): '{text.upper()}'")
print(f" lower(): '{text.lower()}'")
print(f" title(): '{text.title()}'")
print(f" capitalize(): '{text.capitalize()}'")
print(f" swapcase(): '{text.swapcase()}'")
print()
print("Search methods:")
print(f" find('World'): {text.find('World')}")
print(f" find('Python'): {text.find('Python')}") # -1 if not found
print(f" count('is'): {text.count('is')}")
print(f" startswith(' Hello'): {text.startswith(' Hello')}")
print(f" endswith('ring. '): {text.endswith('ring. ')}")
print(f" 'World' in text: {'World' in text}")
print()
print("Transform methods:")
print(f" strip(): '{text.strip()}'")
print(f" lstrip(): '{text.lstrip()}'")
print(f" replace('TEST', 'example'): '{text.replace('TEST', 'example')}'")
print(f" replace with count: '{text.replace('is', 'IS', 1)}'")
print()
print("Split/Join methods:")
words = text.strip().split()
print(f" split(): {words}")
print(f" split(','): {text.split(',')}")
print(f" join: {' | '.join(words)}")
print()
print("Check methods:")
print(f" isalpha(): {'Hello'.isalpha()}")
print(f" isdigit(): {'123'.isdigit()}")
print(f" isalnum(): {'Hello123'.isalnum()}")
print(f" isspace(): {' '.isspace()}")
print(f" isnumeric(): {'Β½'.isnumeric()}")
String concatenation using + in a loop is O(nΒ²) because each concatenation creates a new string. Use ''.join(list) instead, which is O(n) because it pre-allocates the result buffer.
2. Regular Expressions
2.1 Why Regular Expressions?
Regular expressions provide a declarative way to match patterns in text. For data scientists, they are essential for:
- Extracting structured data from unstructured text
- Cleaning messy data
- Validating input formats
- Pattern-based feature engineering
DfRegular Expression
A formal language for describing sets of strings. Regular expressions are defined over an alphabet Ξ£ and can be built from: (1) individual characters, (2) concatenation, (3) union (|), and (4) Kleene star (*). They correspond to finite automata and have O(nΒ·m) matching complexity for input length n and pattern length m.
2.2 Regex Pattern Syntax
Regex Pattern Syntax Diagram:
Character Classes:
[abc] Match a, b, or c
[^abc] Match anything except a, b, or c
[a-z] Match any lowercase letter
[A-Z] Match any uppercase letter
[0-9] Match any digit
[a-zA-Z] Match any letter
. Match any character (except newline)
Quantifiers:
* Zero or more times
+ One or more times
? Zero or one time
{n} Exactly n times
{n,} At least n times
{n,m} Between n and m times
Anchors:
^ Start of string
$ End of string
\b Word boundary
Groups:
(...) Capture group
(?:...) Non-capturing group
(?P<name>...) Named group
Special:
| OR
\d Digit [0-9]
\D Non-digit
\w Word character [a-zA-Z0-9_]
\W Non-word character
\s Whitespace
\S Non-whitespace
2.3 Practical Regex Examples
import re
from typing import List, Dict, Tuple
# Example 1: Email extraction
text = """
Contact us at support@example.com or sales@company.org.
For urgent issues, email urgent@domain.co.uk.
Invalid emails: @missing.com, user@, no-at-sign.com
"""
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text)
print("Email extraction:")
print(f" Pattern: {email_pattern}")
print(f" Found: {emails}")
print()
# Example 2: Phone number extraction
text2 = """
Call us at (555) 123-4567 or 555-987-6543.
International: +1-555-111-2222 or +44 20 7946 0958.
"""
phone_pattern = r'(\+?\d{1,3}[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
phones = re.findall(phone_pattern, text2)
print("Phone extraction:")
print(f" Found: {phones}")
print()
# Example 3: URL extraction
text3 = """
Visit https://www.example.com or http://blog.site.org/post?param=value
Also check https://docs.python.org/3/library/re.html
"""
url_pattern = r'https?://[a-zA-Z0-9.-]+(?:/[a-zA-Z0-9._%+-]*)*(?:\?[a-zA-Z0-9._&=-]*)?'
urls = re.findall(url_pattern, text3)
print("URL extraction:")
for url in urls:
print(f" {url}")
print()
# Example 4: Date extraction (multiple formats)
text4 = """
Dates: 01/15/2024, 2024-01-15, Jan 15, 2024, 15-Jan-2024
"""
date_patterns = [
r'\d{2}/\d{2}/\d{4}', # MM/DD/YYYY
r'\d{4}-\d{2}-\d{2}', # YYYY-MM-DD
r'[A-Z][a-z]{2}\s\d{1,2},?\s\d{4}', # Mon DD, YYYY
r'\d{1,2}-[A-Z][a-z]{2}-\d{4}' # DD-Mon-YYYY
]
print("Date extraction:")
for pattern in date_patterns:
dates = re.findall(pattern, text4)
print(f" Pattern '{pattern}': {dates}")
2.4 Named Groups and Complex Patterns
import re
from typing import Dict, Any
# Named groups for structured extraction
log_line = '192.168.1.100 - - [15/Jan/2024:10:30:45 +0000] "GET /api/users HTTP/1.1" 200 1234'
log_pattern = r'(?P<ip>\d+\.\d+\.\d+\.\d+) - - \[(?P<timestamp>[^\]]+)\] "(?P<method>\w+) (?P<path>\S+) (?P<protocol>\S+)" (?P<status>\d+) (?P<size>\d+)'
match = re.match(log_pattern, log_line)
if match:
data = match.groupdict()
print("Parsed log entry:")
for key, value in data.items():
print(f" {key}: {value}")
print()
# Advanced: Lookahead and Lookbehind
text = """
Price: $19.99 (was $29.99)
Price: $5.50 (was $8.00)
Free shipping on orders over $50.00
"""
# Extract prices with context using lookahead/lookbehind
price_pattern = r'(?<=\$)\d+\.\d{2}'
prices = re.findall(price_pattern, text)
print("Price extraction with lookbehind:")
print(f" Prices found: {prices}")
print()
# Extract prices that are on sale (followed by "was")
sale_pattern = r'(?<=Price: \$)\d+\.\d{2}(?= \(was)'
sale_prices = re.findall(sale_pattern, text)
print("Sale prices (with lookahead):")
print(f" Sale prices: {sale_prices}")
print()
# Substitution with function
def discount_price(match):
price = float(match.group(0))
discounted = price * 0.9 # 10% off
return f"${discounted:.2f}"
original = "Price: $100.00"
discounted = re.sub(r'(?<=\$)\d+\.\d{2}', discount_price, original)
print(f"Discount substitution:")
print(f" Original: {original}")
print(f" Discounted: {discounted}")
Regex Time Complexity
Here,
- =Length of input string
- =Length of regex pattern
- =Worst-case for standard regex engines
Python's re module uses a backtracking NFA engine, which can have exponential worst-case time for pathological patterns (e.g., (a+)+b on aaaa...). For performance-critical applications, consider the regex module with atomic groups or re2 library.
2.5 Regex Performance
import re
import time
# Benchmark: regex vs string methods
text = "Hello, my email is user@example.com and my phone is 555-123-4567" * 10000
# Method 1: String methods
start = time.perf_counter()
emails_str = []
parts = text.split()
for part in parts:
if '@' in part and '.' in part:
clean = part.strip('.,;:')
if clean.count('@') == 1:
emails_str.append(clean)
string_time = time.perf_counter() - start
# Method 2: Regex
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
start = time.perf_counter()
emails_regex = re.findall(email_pattern, text)
regex_time = time.perf_counter() - start
print(f"Benchmark (text length: {len(text):,} chars):")
print(f" String methods: {string_time:.4f}s")
print(f" Regex: {regex_time:.4f}s")
print(f" Results match: {set(emails_str) == set(emails_regex)}")
3. Text Cleaning Pipeline
3.1 The Why and How of Text Cleaning
Raw text data is messy. Before any NLP task, you must:
- Normalize text to consistent format
- Remove noise that doesn't contribute to meaning
- Tokenize into processable units
- Standardize vocabulary for analysis
DfText Normalization
The process of transforming text into a standard (canonical) form. This includes case normalization (lowercasing), Unicode normalization (NFKD/NFKC), whitespace normalization, and removing or replacing special characters. Normalization ensures that equivalent textual inputs produce identical representations.
3.2 Complete Text Cleaning Function
import re
import unicodedata
from typing import List, Optional, Dict
from collections import Counter
def advanced_text_cleaner(
text: str,
lowercase: bool = True,
remove_urls: bool = True,
remove_emails: bool = True,
remove_numbers: bool = False,
remove_punctuation: bool = True,
remove_special_chars: bool = True,
remove_extra_whitespace: bool = True,
min_word_length: int = 1,
max_word_length: Optional[int] = None
) -> Dict[str, any]:
"""
Comprehensive text cleaning function.
Args:
text: Input text to clean
lowercase: Convert to lowercase
remove_urls: Remove URLs
remove_emails: Remove email addresses
remove_numbers: Remove numbers
remove_punctuation: Remove punctuation
remove_special_chars: Remove special characters
remove_extra_whitespace: Normalize whitespace
min_word_length: Minimum word length to keep
max_word_length: Maximum word length to keep
Returns:
Dictionary with cleaned text and statistics
"""
original = text
stats = {
"original_length": len(text),
"operations_performed": []
}
# Step 1: Normalize Unicode
text = unicodedata.normalize('NFKD', text)
stats["operations_performed"].append("unicode_normalize")
# Step 2: Remove URLs
if remove_urls:
url_pattern = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
text = re.sub(url_pattern, '', text)
stats["operations_performed"].append("remove_urls")
# Step 3: Remove emails
if remove_emails:
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
text = re.sub(email_pattern, '', text)
stats["operations_performed"].append("remove_emails")
# Step 4: Remove numbers
if remove_numbers:
text = re.sub(r'\d+', '', text)
stats["operations_performed"].append("remove_numbers")
# Step 5: Remove punctuation
if remove_punctuation:
text = re.sub(r'[^\w\s]', '', text)
stats["operations_performed"].append("remove_punctuation")
# Step 6: Remove special characters
if remove_special_chars:
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
stats["operations_performed"].append("remove_special_chars")
# Step 7: Lowercase
if lowercase:
text = text.lower()
stats["operations_performed"].append("lowercase")
# Step 8: Remove extra whitespace
if remove_extra_whitespace:
text = re.sub(r'\s+', ' ', text).strip()
stats["operations_performed"].append("remove_whitespace")
# Step 9: Filter by word length
words = text.split()
if min_word_length > 1 or max_word_length:
words = [
w for w in words
if len(w) >= min_word_length
and (max_word_length is None or len(w) <= max_word_length)
]
text = ' '.join(words)
stats["operations_performed"].append("filter_word_length")
stats["cleaned_length"] = len(text)
stats["reduction_percent"] = round(
(1 - len(text) / len(original)) * 100, 2
)
return {
"cleaned_text": text,
"statistics": stats
}
# Example usage
raw_text = """
Check out https://www.example.com for more info!
Contact us at support@company.com or sales@corp.org.
We have 150+ products priced at $29.99, $45.50, etc.
Our office is in New York City (NYC)!!!
Call 555-123-4567 for more info...
Here's some HTML: <p>Hello</p> & <div>World</div>
And unicode: cafΓ©, naΓ―ve, rΓ©sumΓ©
"""
result = advanced_text_cleaner(raw_text)
print("Text Cleaning Pipeline:")
print("=" * 60)
print(f"\nOriginal ({len(raw_text)} chars):")
print(f" {raw_text[:100]}...")
print(f"\nCleaned ({len(result['cleaned_text'])} chars):")
print(f" {result['cleaned_text']}")
print(f"\nStatistics:")
print(f" Operations: {result['statistics']['operations_performed']}")
print(f" Reduction: {result['statistics']['reduction_percent']}%")
Output:
Text Cleaning Pipeline:
============================================================
Original (387 chars):
Check out https://www.example.com for more info!
Contact us at support@company.com or sales@corp.org.
We have 150+ products priced at $29.99, $45.50, etc.
Our office is in New York City (NYC)!!!
Call 555-123-4567 for more info...
Here's some HTML: <p>Hello</p> & <div>World</div>
And unicode: cafΓ©, naΓ―ve, rΓ©sumΓ©
Cleaned (195 chars):
check out for more info contact us at or we have products priced at etc our office is in new york city nyc call for more info here is some html hello div world div and unicode cafe naive resume
Statistics:
Operations: ['unicode_normalize', 'remove_urls', 'remove_emails',
'remove_numbers', 'remove_punctuation', 'remove_special_chars',
'lowercase', 'remove_whitespace']
Reduction: 49.61%
3.3 Before/After Examples
test_cases = [
# (description, input, expected_output)
("URL removal",
"Visit https://www.example.com/path?q=1 today",
"visit today"),
("Email removal",
"Email john.doe@company.co.uk for info",
"email john doe for info"),
("Number removal",
"I have 3 cats and 2 dogs",
"i have cats and dogs"),
("Punctuation removal",
"Hello, World! How's it going?",
"hello world hows it going"),
("Extra whitespace",
" too many spaces ",
"too many spaces"),
("Mixed cleanup",
"Price: $19.99!!! Visit https://sale.com #discount",
"price visit discount")
]
print("Before/After Examples:")
print("=" * 80)
for desc, input_text, expected in test_cases:
result = advanced_text_cleaner(input_text)
print(f"\n{desc}:")
print(f" Input: '{input_text}'")
print(f" Output: '{result['cleaned_text']}'")
print(f" Expected: '{expected}'")
print(f" Match: {result['cleaned_text'] == expected}")
4. Unicode and Encoding
4.1 Why Encoding Matters
Different encodings represent characters differently. Mixing encodings causes:
- Garbled text (mojibake)
- Program crashes
- Data corruption in databases
4.2 Common Encodings
| Encoding | Characters | Use Case | Size |
|---|---|---|---|
| ASCII | 128 | English text, legacy systems | 1 byte/char |
| Latin-1 | 256 | Western European | 1 byte/char |
| UTF-8 | 1,112,064 | Universal (variable) | 1-4 bytes |
| UTF-16 | 1,112,064 | Windows, Java (variable) | 2-4 bytes |
| UTF-32 | 1,112,064 | Fixed-width (rare) | 4 bytes/char |
UTF-8 Byte Encoding
Here,
- =Unicode codepoint value
- =Number of bytes in UTF-8 representation
4.3 Encoding Operations
# Demonstrate encoding differences
text = "Hello, World! cafΓ©"
# Different encodings
encodings = ['ascii', 'latin-1', 'utf-8', 'utf-16']
print("Encoding comparison:")
print(f"Original: {text}")
print()
for enc in encodings:
try:
encoded = text.encode(enc)
decoded = encoded.decode(enc)
print(f"{enc:10}: {len(encoded):3} bytes | {encoded[:30]}...")
except UnicodeEncodeError as e:
print(f"{enc:10}: ENCODING ERROR - {e}")
print()
# Handle encoding errors gracefully
def safe_decode(byte_data: bytes, encodings: list = None) -> str:
"""Try multiple encodings to decode bytes."""
if encodings is None:
encodings = ['utf-8', 'latin-1', 'ascii', 'cp1252']
for enc in encodings:
try:
return byte_data.decode(enc)
except UnicodeDecodeError:
continue
return byte_data.decode('utf-8', errors='replace')
# Example with problematic bytes
problematic = b'caf\xe9' # Latin-1 encoding of cafΓ©
print(f"Safe decode: {safe_decode(problematic)}")
When working with international text data, always specify encoding='utf-8' explicitly. Don't rely on system defaults, which vary by OS and locale. UTF-8 is the universal standard and should be your default choice.
5. String Operations on Pandas Series
5.1 The str Accessor
import pandas as pd
import numpy as np
# Create sample DataFrame
data = {
'raw_text': [
' John.Doe@company.com ',
'JANE_SMITH@ORG.ORG',
'bob@example.net',
'ALICE@TEST.COM',
'charlie@domain.co.uk'
],
'price': ['$19.99', '$25.50', 'N/A', '$30.00', '$15.75'],
'date': ['2024-01-15', '2024-02-20', 'invalid', '2024-03-10', '2024-04-05']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print()
# String operations using str accessor
print("String operations on 'raw_text':")
df['clean_email'] = df['raw_text'].str.strip().str.lower()
df['username'] = df['clean_email'].str.split('@').str[0]
df['domain'] = df['clean_email'].str.split('@').str[1]
df['has_numbers'] = df['clean_email'].str.contains(r'\d', regex=True)
df['email_length'] = df['clean_email'].str.len()
print(df[['raw_text', 'clean_email', 'username', 'domain', 'has_numbers']])
print()
# String operations on 'price'
print("String operations on 'price':")
df['price_clean'] = df['price'].str.replace('$', '', regex=False)
df['price_numeric'] = pd.to_numeric(df['price_clean'], errors='coerce')
print(df[['price', 'price_clean', 'price_numeric']])
print(f"\nPrice statistics:")
print(f" Mean: ${df['price_numeric'].mean():.2f}")
print(f" Median: ${df['price_numeric'].median():.2f}")
print(f" Missing: {df['price_numeric'].isna().sum()}")
print()
# Advanced: Extract patterns from text
product_names = pd.Series([
'iPhone 14 Pro Max',
'Samsung Galaxy S23',
'Google Pixel 7 Pro',
'OnePlus 11 5G',
'Xiaomi 13 Ultra'
])
print("Pattern extraction from product names:")
df_products = pd.DataFrame({'product': product_names})
df_products['brand'] = df_products['product'].str.split().str[0]
df_products['model'] = df_products['product'].str.extract(r'(\d+)')
df_products['has_pro'] = df_products['product'].str.contains('Pro', case=False)
df_products['has_5g'] = df_products['product'].str.contains('5G')
print(df_products)
5.2 Vectorized String Operations
import pandas as pd
import time
# Performance comparison
n = 1_000_000
text_data = pd.Series(['Hello World'] * n)
# Method 1: Python loop
start = time.perf_counter()
result_loop = pd.Series([x.lower() for x in text_data])
loop_time = time.perf_counter() - start
# Method 2: Pandas str accessor
start = time.perf_counter()
result_pandas = text_data.str.lower()
pandas_time = time.perf_counter() - start
print(f"Performance comparison ({n:,} strings):")
print(f" Python loop: {loop_time:.4f}s")
print(f" Pandas str: {pandas_time:.4f}s")
print(f" Speedup: {loop_time/pandas_time:.1f}x faster with pandas")
Pandas .str accessor operations are implemented in C, providing 10-100x speedup over Python loops for string operations on millions of rows.
6. NLP Preprocessing
6.1 Tokenization
DfTokenization
The process of breaking text into discrete units called tokens. Tokens can be words, subwords, characters, or sentences. Word tokenization splits on whitespace/punctuation, while subword tokenization (BPE, WordPiece) handles rare words by decomposing them into common subunits.
import re
from typing import List
def simple_tokenizer(text: str) -> List[str]:
"""Basic whitespace tokenizer."""
return text.split()
def word_tokenizer(text: str) -> List[str]:
"""Split on word boundaries."""
return re.findall(r'\b\w+\b', text.lower())
def sentence_tokenizer(text: str) -> List[str]:
"""Split on sentence boundaries."""
return re.split(r'[.!?]+', text)
# Example
text = "Hello, World! This is a test. It has multiple sentences."
print("Tokenization examples:")
print(f" Input: {text}")
print(f" Whitespace: {simple_tokenizer(text)}")
print(f" Word: {word_tokenizer(text)}")
print(f" Sentence: {sentence_tokenizer(text)}")
6.2 Stop Words Removal
# Common English stop words (partial list)
STOP_WORDS = {
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves',
'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him',
'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its',
'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what',
'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am',
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has',
'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the',
'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of',
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'through',
'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up',
'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',
'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no',
'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's',
't', 'can', 'will', 'just', 'don', 'should', 'now'
}
def remove_stopwords(tokens: List[str], custom_stopwords: set = None) -> List[str]:
"""Remove stop words from token list."""
stop_words = STOP_WORDS.copy()
if custom_stopwords:
stop_words.update(custom_stopwords)
return [token for token in tokens if token.lower() not in stop_words]
# Example
tokens = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
print(f"Original tokens: {tokens}")
print(f"After stopword removal: {remove_stopwords(tokens)}")
6.3 Stemming vs Lemmatization
DfStemming
A heuristic process that chops off word suffixes to reduce words to their root form (stem). Stems may not be valid words (e.g., "running" β "run", "studies" β "studi"). Fast but crude; used when dictionary lookup is too expensive.
DfLemmatization
A morphological analysis that reduces words to their dictionary form (lemma). Unlike stemming, lemmatization produces valid words using vocabulary and morphological analysis (e.g., "better" β "good", "running" β "run"). More accurate but computationally expensive.
from typing import List
# Simple stemmer (Porter Stemmer approximation)
def simple_stemmer(word: str) -> str:
"""Very basic stemmer for demonstration."""
suffixes = ['ing', 'tion', 'ness', 'ment', 'able', 'ible', 'ly', 'ed', 'er', 'est', 's']
for suffix in suffixes:
if word.endswith(suffix) and len(word) - len(suffix) >= 3:
return word[:-len(suffix)]
return word
# Simple lemmatizer
IRREGULAR_FORMS = {
'running': 'run', 'running': 'run', 'better': 'good',
'worse': 'bad', 'best': 'good', 'worst': 'bad',
'went': 'go', 'going': 'go', 'gone': 'go',
'was': 'is', 'were': 'is', 'been': 'is',
'had': 'have', 'having': 'have'
}
def simple_lemmatizer(word: str) -> str:
"""Basic lemmatizer using irregular forms."""
if word.lower() in IRREGULAR_FORMS:
return IRREGULAR_FORMS[word.lower()]
# Basic rules
if word.endswith('ies'):
return word[:-3] + 'y'
elif word.endswith('ing'):
return word[:-3]
elif word.endswith('ed'):
return word[:-2]
elif word.endswith('s') and not word.endswith('ss'):
return word[:-1]
return word
# Comparison
test_words = ['running', 'better', 'studies', 'jumping', 'worses', 'happiness']
print("Stemming vs Lemmatization:")
print(f"{'Word':<12} {'Stemmed':<12} {'Lemmatized':<12}")
print("-" * 36)
for word in test_words:
print(f"{word:<12} {simple_stemmer(word):<12} {simple_lemmatizer(word):<12}")
6.4 Complete NLP Pipeline
import re
from typing import List, Dict
from collections import Counter
class NLPPipeline:
"""Complete NLP preprocessing pipeline."""
def __init__(self):
self.stop_words = STOP_WORDS
def preprocess(
self,
text: str,
lowercase: bool = True,
remove_stopwords: bool = True,
min_length: int = 2
) -> Dict[str, any]:
"""
Complete NLP preprocessing pipeline.
Returns:
Dictionary with tokens, frequencies, and metadata
"""
# Step 1: Basic cleaning
if lowercase:
text = text.lower()
# Step 2: Remove special characters
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Step 3: Remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
# Step 4: Tokenize
tokens = text.split()
# Step 5: Remove stopwords
if remove_stopwords:
tokens = [t for t in tokens if t not in self.stop_words]
# Step 6: Filter by length
tokens = [t for t in tokens if len(t) >= min_length]
# Step 7: Compute statistics
freq = Counter(tokens)
vocab = sorted(set(tokens))
return {
'tokens': tokens,
'vocabulary': vocab,
'frequencies': dict(freq.most_common(10)),
'unique_words': len(vocab),
'total_tokens': len(tokens),
'type_token_ratio': len(vocab) / len(tokens) if tokens else 0
}
# Example usage
pipeline = NLPPipeline()
sample_texts = [
"The quick brown fox jumps over the lazy dog. This is a simple sentence.",
"Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence.",
"Data science involves extracting insights from data using scientific methods, algorithms, and systems."
]
print("NLP Pipeline Results:")
print("=" * 80)
for i, text in enumerate(sample_texts, 1):
result = pipeline.preprocess(text)
print(f"\nText {i}:")
print(f" Original: {text[:60]}...")
print(f" Tokens: {result['tokens'][:10]}...")
print(f" Unique words: {result['unique_words']}")
print(f" Type-Token Ratio: {result['type_token_ratio']:.2f}")
print(f" Top 5 words: {list(result['frequencies'].items())[:5]}")
Type-Token Ratio
Here,
- =Type-Token Ratio (vocabulary richness)
- =Set of unique words
- =Total number of word occurrences
7. Regex Cheat Sheet Table
| Pattern | Description | Example | Matches |
|---|---|---|---|
. | Any character | a.c | abc, a1c, a c |
^ | Start of string | ^Hello | Hello World |
$ | End of string | World$ | Hello World |
* | Zero or more | ab*c | ac, abc, abbc |
+ | One or more | ab+c | abc, abbc (not ac) |
? | Zero or one | colou?r | color, colour |
\d | Digit | \d+ | 123 |
\D | Non-digit | \D+ | abc |
\w | Word char | \w+ | hello_123 |
\W | Non-word | \W+ | !@# |
\s | Whitespace | \s+ | spaces, tabs |
\S | Non-space | \S+ | hello |
[abc] | Character class | [aeiou] | vowels |
[^abc] | Negated class | [^0-9] | non-digits |
(abc) | Capture group | (ab)+ | ab, abab |
(?:abc) | Non-capturing | (?:ab)+ | ab, abab |
(?P<name>...) | Named group | (?P<year>\d{4}) | captures as 'year' |
a|b | OR | cat|dog | cat or dog |
{n} | Exactly n | \d{3} | 123 |
{n,} | At least n | \d{2,} | 12, 123 |
{n,m} | Between n,m | \d{2,4} | 12, 123, 1234 |
\b | Word boundary | \bword\b | word in a word b |
\A | Start of string | \AHello | Hello |
\Z | End of string | World\Z | World |
(?!...) | Negative lookahead | foo(?!bar) | foo not followed by bar |
(?<=...) | Positive lookbehind | (?<=\$)\d+ | digits after $ |
(?<!...) | Negative lookbehind | (?<!\$)\d+ | digits not after $ |
Key Takeaways
πSummary: String & Text Processing
- String Immutability: All string operations create new strings. For repeated concatenation, use
''.join()instead of+=. - Regular Expressions: Compile patterns with
re.compile()for repeated use; use raw stringsr'pattern'to avoid escape issues; named groups improve readability. - Text Cleaning Pipeline: Always normalize Unicode before processing; remove noise first; lowercase for case-insensitive analysis; tokenize before further processing; remove stopwords after tokenization.
- Encoding Matters: Always specify encoding when reading/writing files; use UTF-8 as default for international text; handle encoding errors gracefully with
errors='replace'. - Pandas String Operations: Use
.straccessor for vectorized operations (10-100x faster than Python loops); chain methods for complex transformations. - NLP Preprocessing: Tokenization β Stopword removal β Stemming/Lemmatization; Type-Token Ratio measures vocabulary richness.
Practice Exercises
Exercise 1: Regex Pattern Writing
Write regex patterns to:
- Extract all hashtags from social media text
- Validate phone numbers in multiple formats
- Extract price values with currency symbols
- Find sentences containing specific keywords
Exercise 2: Text Cleaning Pipeline
Build a text cleaning function that:
- Handles HTML tags
- Removes emojis (with option to keep)
- Normalizes contractions (don't β do not)
- Expands abbreviations (NYC β New York City)
Test with 10 different messy text samples.
Exercise 3: Pandas String Processing
Given a DataFrame with a "product_description" column:
- Extract brand names
- Identify product categories
- Extract numerical specifications (size, weight, etc.)
- Create a feature matrix for machine learning
Exercise 4: NLP Comparison
Compare the output of stemming vs lemmatization on:
- 50 product reviews
- 20 news articles
- Discuss which produces better results for topic modeling
Exercise 5: Performance Optimization
Benchmark different text processing approaches:
- Python loops vs list comprehensions
- Pandas
.strmethods vsapply() - Regex vs string methods for pattern matching
Create a report with recommendations for different data sizes.