String & Text Processing for Data Science

Module 1: FoundationsFree Lesson

Advertisement

String & Text Processing for Data Science

Why This Matters for Data Science

Text data constitutes approximately 80% of all enterprise data. From social media posts to customer reviews, medical records to legal documents, the ability to clean, transform, and analyze text is essential. This tutorial provides a rigorous exploration of string operations, regular expressions, and NLP preprocessing techniques with focus on practical data science applications.


1. String Fundamentals

1.1 String Properties

Strings in Python are immutable sequences of Unicode codepoints:

DfUnicode Transformation Format

A variable-width character encoding that uses 1-4 bytes per character. ASCII characters use 1 byte, while characters outside the ASCII range use 2-4 bytes. UTF-8 is backward-compatible with ASCII and is the dominant encoding for web and data interchange.

import sys

s = "Hello, Data Science!"

print(f"String: {s}")
print(f"Length: {len(s)} characters")
print(f"Memory: {sys.getsizeof(s)} bytes")
print(f"Type: {type(s)}")
print()

# Unicode codepoints
print("Character details:")
for i, char in enumerate(s[:10]):
    print(f"  {i}: '{char}' β†’ U+{ord(char):04X} ({ord(char)})")

Output:

Architecture Diagram
String: Hello, Data Science!
Length: 20 characters
Memory: 69 bytes
Type: <class 'str'>

Character details:
  0: 'H' β†’ U+0048 (72)
  1: 'e' β†’ U+0065 (101)
  2: 'l' β†’ U+006C (108)
  3: 'l' β†’ U+006C (108)
  4: 'o' β†’ U+006F (111)
  5: ',' β†’ U+002C (44)
  6: ' ' β†’ U+0020 (32)
  7: 'D' β†’ U+0044 (68)
  8: 'a' β†’ U+0061 (97)
  9: 't' β†’ U+0074 (116)

Python 3 strings store Unicode codepoints as integers internally. The memory representation varies based on the maximum codepoint: Latin-1 uses 1 byte/char, UCS-2 uses 2 bytes/char, and UCS-4 uses 4 bytes/char. CPython automatically selects the most compact representation.

1.2 Indexing and Slicing

Architecture Diagram
String Indexing:

Index:    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19
        β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”
String: β”‚ H β”‚ e β”‚ l β”‚ l β”‚ o β”‚ , β”‚   β”‚ D β”‚ a β”‚ t β”‚ a β”‚   β”‚ S β”‚ c β”‚ i β”‚ e β”‚ n β”‚ c β”‚ e β”‚ ! β”‚
        β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”˜
Negative: -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10  -9  -8  -7  -6  -5  -4  -3  -2  -1

Slice notation: s[start:stop:step]

String Slice Complexity

O(stopβˆ’start)Β forΒ sliceΒ s[start:stop]O(\text{stop} - \text{start}) \text{ for slice s[start:stop]}

Here,

  • =Starting index (inclusive)
  • =Ending index (exclusive)
  • =Step size (default 1)
s = "Hello, Data Science!"

print("Indexing:")
print(f"  s[0] = '{s[0]}'")       # First char
print(f"  s[-1] = '{s[-1]}'")     # Last char
print(f"  s[7] = '{s[7]}'")      # 'D'
print()

print("Slicing:")
print(f"  s[0:5] = '{s[0:5]}'")     # First 5 chars
print(f"  s[7:11] = '{s[7:11]}'")   # 'Data'
print(f"  s[:7] = '{s[:7]}'")       # Everything before index 7
print(f"  s[13:] = '{s[13:]}'")     # Everything from index 13
print(f"  s[::2] = '{s[::2]}'")     # Every 2nd char
print(f"  s[::-1] = '{s[::-1]}'")   # Reverse string
print(f"  s[::3] = '{s[::3]}'")     # Every 3rd char
print()

# Negative slicing
print("Negative slicing:")
print(f"  s[-5:] = '{s[-5:]}'")     # Last 5 chars
print(f"  s[-10:-5] = '{s[-10:-5]}'")  # 10th to 5th from end

1.3 String Formatting

name = "Alice Johnson"
age = 30
score = 95.6789
items = ["Python", "SQL", "ML"]

# f-strings (Python 3.6+) - RECOMMENDED
print("f-string formatting:")
print(f"  Basic: {name} is {age} years old")
print(f"  Decimal: {score:.2f}")
print(f"  Integer: {int(score)}")
print(f"  Padding: {name:>20}")
print(f"  Left align: {name:<20}")
print(f"  Center: {name:^20}")
print(f"  Filled: {name:*^20}")
print(f"  Comma separator: {1234567:,}")
print(f"  Percentage: {0.856:.1%}")
print(f"  Scientific: {123456:.2e}")
print(f"  Binary: {42:b}")
print(f"  Hex: {255:x}")
print(f"  Octal: {42:o}")
print()

# .format() method
print(".format() method:")
print("  {} is {} years old".format(name, age))
print("  {1} is {0} years old".format(age, name))  # Indexed
print("  {name} scored {score:.1f}".format(name=name, score=score))
print()

# Template strings
from string import Template
t = Template("$name is $age years old")
print(f"Template: {t.substitute(name=name, age=age)}")
print()

# Lambda formatting (for dynamic formats)
format_funcs = {
    "int": lambda x: f"{x:.0f}",
    "float2": lambda x: f"{x:.2f}",
    "percent": lambda x: f"{x:.1%}",
    "scientific": lambda x: f"{x:.2e}"
}

value = 1234.5678
print("Dynamic formatting:")
for fmt_name, fmt_func in format_funcs.items():
    print(f"  {fmt_name}: {fmt_func(value)}")

1.4 Common String Methods

text = "  Hello, World! This is a TEST string.  "

print("Case methods:")
print(f"  Original: '{text}'")
print(f"  upper(): '{text.upper()}'")
print(f"  lower(): '{text.lower()}'")
print(f"  title(): '{text.title()}'")
print(f"  capitalize(): '{text.capitalize()}'")
print(f"  swapcase(): '{text.swapcase()}'")
print()

print("Search methods:")
print(f"  find('World'): {text.find('World')}")
print(f"  find('Python'): {text.find('Python')}")  # -1 if not found
print(f"  count('is'): {text.count('is')}")
print(f"  startswith('  Hello'): {text.startswith('  Hello')}")
print(f"  endswith('ring.  '): {text.endswith('ring.  ')}")
print(f"  'World' in text: {'World' in text}")
print()

print("Transform methods:")
print(f"  strip(): '{text.strip()}'")
print(f"  lstrip(): '{text.lstrip()}'")
print(f"  replace('TEST', 'example'): '{text.replace('TEST', 'example')}'")
print(f"  replace with count: '{text.replace('is', 'IS', 1)}'")
print()

print("Split/Join methods:")
words = text.strip().split()
print(f"  split(): {words}")
print(f"  split(','): {text.split(',')}")
print(f"  join: {' | '.join(words)}")
print()

print("Check methods:")
print(f"  isalpha(): {'Hello'.isalpha()}")
print(f"  isdigit(): {'123'.isdigit()}")
print(f"  isalnum(): {'Hello123'.isalnum()}")
print(f"  isspace(): {'   '.isspace()}")
print(f"  isnumeric(): {'Β½'.isnumeric()}")

String concatenation using + in a loop is O(nΒ²) because each concatenation creates a new string. Use ''.join(list) instead, which is O(n) because it pre-allocates the result buffer.


2. Regular Expressions

2.1 Why Regular Expressions?

Regular expressions provide a declarative way to match patterns in text. For data scientists, they are essential for:

  • Extracting structured data from unstructured text
  • Cleaning messy data
  • Validating input formats
  • Pattern-based feature engineering

DfRegular Expression

A formal language for describing sets of strings. Regular expressions are defined over an alphabet Ξ£ and can be built from: (1) individual characters, (2) concatenation, (3) union (|), and (4) Kleene star (*). They correspond to finite automata and have O(nΒ·m) matching complexity for input length n and pattern length m.

2.2 Regex Pattern Syntax

Architecture Diagram
Regex Pattern Syntax Diagram:

Character Classes:
  [abc]     Match a, b, or c
  [^abc]    Match anything except a, b, or c
  [a-z]     Match any lowercase letter
  [A-Z]     Match any uppercase letter
  [0-9]     Match any digit
  [a-zA-Z]  Match any letter
  .         Match any character (except newline)

Quantifiers:
  *         Zero or more times
  +         One or more times
  ?         Zero or one time
  {n}       Exactly n times
  {n,}      At least n times
  {n,m}     Between n and m times

Anchors:
  ^         Start of string
  $         End of string
  \b        Word boundary

Groups:
  (...)     Capture group
  (?:...)   Non-capturing group
  (?P<name>...)  Named group

Special:
  |         OR
  \d        Digit [0-9]
  \D        Non-digit
  \w        Word character [a-zA-Z0-9_]
  \W        Non-word character
  \s        Whitespace
  \S        Non-whitespace

2.3 Practical Regex Examples

import re
from typing import List, Dict, Tuple

# Example 1: Email extraction
text = """
Contact us at support@example.com or sales@company.org.
For urgent issues, email urgent@domain.co.uk.
Invalid emails: @missing.com, user@, no-at-sign.com
"""

email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text)

print("Email extraction:")
print(f"  Pattern: {email_pattern}")
print(f"  Found: {emails}")
print()

# Example 2: Phone number extraction
text2 = """
Call us at (555) 123-4567 or 555-987-6543.
International: +1-555-111-2222 or +44 20 7946 0958.
"""

phone_pattern = r'(\+?\d{1,3}[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
phones = re.findall(phone_pattern, text2)

print("Phone extraction:")
print(f"  Found: {phones}")
print()

# Example 3: URL extraction
text3 = """
Visit https://www.example.com or http://blog.site.org/post?param=value
Also check https://docs.python.org/3/library/re.html
"""

url_pattern = r'https?://[a-zA-Z0-9.-]+(?:/[a-zA-Z0-9._%+-]*)*(?:\?[a-zA-Z0-9._&=-]*)?'
urls = re.findall(url_pattern, text3)

print("URL extraction:")
for url in urls:
    print(f"  {url}")
print()

# Example 4: Date extraction (multiple formats)
text4 = """
Dates: 01/15/2024, 2024-01-15, Jan 15, 2024, 15-Jan-2024
"""

date_patterns = [
    r'\d{2}/\d{2}/\d{4}',           # MM/DD/YYYY
    r'\d{4}-\d{2}-\d{2}',           # YYYY-MM-DD
    r'[A-Z][a-z]{2}\s\d{1,2},?\s\d{4}',  # Mon DD, YYYY
    r'\d{1,2}-[A-Z][a-z]{2}-\d{4}'  # DD-Mon-YYYY
]

print("Date extraction:")
for pattern in date_patterns:
    dates = re.findall(pattern, text4)
    print(f"  Pattern '{pattern}': {dates}")

2.4 Named Groups and Complex Patterns

import re
from typing import Dict, Any

# Named groups for structured extraction
log_line = '192.168.1.100 - - [15/Jan/2024:10:30:45 +0000] "GET /api/users HTTP/1.1" 200 1234'

log_pattern = r'(?P<ip>\d+\.\d+\.\d+\.\d+) - - \[(?P<timestamp>[^\]]+)\] "(?P<method>\w+) (?P<path>\S+) (?P<protocol>\S+)" (?P<status>\d+) (?P<size>\d+)'

match = re.match(log_pattern, log_line)
if match:
    data = match.groupdict()
    print("Parsed log entry:")
    for key, value in data.items():
        print(f"  {key}: {value}")
print()

# Advanced: Lookahead and Lookbehind
text = """
Price: $19.99 (was $29.99)
Price: $5.50 (was $8.00)
Free shipping on orders over $50.00
"""

# Extract prices with context using lookahead/lookbehind
price_pattern = r'(?<=\$)\d+\.\d{2}'
prices = re.findall(price_pattern, text)

print("Price extraction with lookbehind:")
print(f"  Prices found: {prices}")
print()

# Extract prices that are on sale (followed by "was")
sale_pattern = r'(?<=Price: \$)\d+\.\d{2}(?= \(was)'
sale_prices = re.findall(sale_pattern, text)

print("Sale prices (with lookahead):")
print(f"  Sale prices: {sale_prices}")
print()

# Substitution with function
def discount_price(match):
    price = float(match.group(0))
    discounted = price * 0.9  # 10% off
    return f"${discounted:.2f}"

original = "Price: $100.00"
discounted = re.sub(r'(?<=\$)\d+\.\d{2}', discount_price, original)
print(f"Discount substitution:")
print(f"  Original: {original}")
print(f"  Discounted: {discounted}")

Regex Time Complexity

O(nβ‹…m)Β forΒ inputΒ lengthΒ nΒ andΒ patternΒ lengthΒ mO(n \cdot m) \text{ for input length } n \text{ and pattern length } m

Here,

  • =Length of input string
  • =Length of regex pattern
  • =Worst-case for standard regex engines

Python's re module uses a backtracking NFA engine, which can have exponential worst-case time for pathological patterns (e.g., (a+)+b on aaaa...). For performance-critical applications, consider the regex module with atomic groups or re2 library.

2.5 Regex Performance

import re
import time

# Benchmark: regex vs string methods
text = "Hello, my email is user@example.com and my phone is 555-123-4567" * 10000

# Method 1: String methods
start = time.perf_counter()
emails_str = []
parts = text.split()
for part in parts:
    if '@' in part and '.' in part:
        clean = part.strip('.,;:')
        if clean.count('@') == 1:
            emails_str.append(clean)
string_time = time.perf_counter() - start

# Method 2: Regex
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
start = time.perf_counter()
emails_regex = re.findall(email_pattern, text)
regex_time = time.perf_counter() - start

print(f"Benchmark (text length: {len(text):,} chars):")
print(f"  String methods: {string_time:.4f}s")
print(f"  Regex: {regex_time:.4f}s")
print(f"  Results match: {set(emails_str) == set(emails_regex)}")

3. Text Cleaning Pipeline

3.1 The Why and How of Text Cleaning

Raw text data is messy. Before any NLP task, you must:

  1. Normalize text to consistent format
  2. Remove noise that doesn't contribute to meaning
  3. Tokenize into processable units
  4. Standardize vocabulary for analysis

DfText Normalization

The process of transforming text into a standard (canonical) form. This includes case normalization (lowercasing), Unicode normalization (NFKD/NFKC), whitespace normalization, and removing or replacing special characters. Normalization ensures that equivalent textual inputs produce identical representations.

3.2 Complete Text Cleaning Function

import re
import unicodedata
from typing import List, Optional, Dict
from collections import Counter

def advanced_text_cleaner(
    text: str,
    lowercase: bool = True,
    remove_urls: bool = True,
    remove_emails: bool = True,
    remove_numbers: bool = False,
    remove_punctuation: bool = True,
    remove_special_chars: bool = True,
    remove_extra_whitespace: bool = True,
    min_word_length: int = 1,
    max_word_length: Optional[int] = None
) -> Dict[str, any]:
    """
    Comprehensive text cleaning function.
    
    Args:
        text: Input text to clean
        lowercase: Convert to lowercase
        remove_urls: Remove URLs
        remove_emails: Remove email addresses
        remove_numbers: Remove numbers
        remove_punctuation: Remove punctuation
        remove_special_chars: Remove special characters
        remove_extra_whitespace: Normalize whitespace
        min_word_length: Minimum word length to keep
        max_word_length: Maximum word length to keep
        
    Returns:
        Dictionary with cleaned text and statistics
    """
    original = text
    stats = {
        "original_length": len(text),
        "operations_performed": []
    }
    
    # Step 1: Normalize Unicode
    text = unicodedata.normalize('NFKD', text)
    stats["operations_performed"].append("unicode_normalize")
    
    # Step 2: Remove URLs
    if remove_urls:
        url_pattern = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
        text = re.sub(url_pattern, '', text)
        stats["operations_performed"].append("remove_urls")
    
    # Step 3: Remove emails
    if remove_emails:
        email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
        text = re.sub(email_pattern, '', text)
        stats["operations_performed"].append("remove_emails")
    
    # Step 4: Remove numbers
    if remove_numbers:
        text = re.sub(r'\d+', '', text)
        stats["operations_performed"].append("remove_numbers")
    
    # Step 5: Remove punctuation
    if remove_punctuation:
        text = re.sub(r'[^\w\s]', '', text)
        stats["operations_performed"].append("remove_punctuation")
    
    # Step 6: Remove special characters
    if remove_special_chars:
        text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
        stats["operations_performed"].append("remove_special_chars")
    
    # Step 7: Lowercase
    if lowercase:
        text = text.lower()
        stats["operations_performed"].append("lowercase")
    
    # Step 8: Remove extra whitespace
    if remove_extra_whitespace:
        text = re.sub(r'\s+', ' ', text).strip()
        stats["operations_performed"].append("remove_whitespace")
    
    # Step 9: Filter by word length
    words = text.split()
    if min_word_length > 1 or max_word_length:
        words = [
            w for w in words 
            if len(w) >= min_word_length 
            and (max_word_length is None or len(w) <= max_word_length)
        ]
        text = ' '.join(words)
        stats["operations_performed"].append("filter_word_length")
    
    stats["cleaned_length"] = len(text)
    stats["reduction_percent"] = round(
        (1 - len(text) / len(original)) * 100, 2
    )
    
    return {
        "cleaned_text": text,
        "statistics": stats
    }

# Example usage
raw_text = """
    Check out https://www.example.com for more info!
    Contact us at support@company.com or sales@corp.org.
    We have 150+ products priced at $29.99, $45.50, etc.
    Our office is in New York City (NYC)!!! 
    Call 555-123-4567 for more info...
    
    Here's some HTML: <p>Hello</p> & <div>World</div>
    And unicode: cafΓ©, naΓ―ve, rΓ©sumΓ©
"""

result = advanced_text_cleaner(raw_text)

print("Text Cleaning Pipeline:")
print("=" * 60)
print(f"\nOriginal ({len(raw_text)} chars):")
print(f"  {raw_text[:100]}...")
print(f"\nCleaned ({len(result['cleaned_text'])} chars):")
print(f"  {result['cleaned_text']}")
print(f"\nStatistics:")
print(f"  Operations: {result['statistics']['operations_performed']}")
print(f"  Reduction: {result['statistics']['reduction_percent']}%")

Output:

Architecture Diagram
Text Cleaning Pipeline:
============================================================

Original (387 chars):
    Check out https://www.example.com for more info!
    Contact us at support@company.com or sales@corp.org.
    We have 150+ products priced at $29.99, $45.50, etc.
    Our office is in New York City (NYC)!!! 
    Call 555-123-4567 for more info...
    
    Here's some HTML: <p>Hello</p> & <div>World</div>
    And unicode: cafΓ©, naΓ―ve, rΓ©sumΓ©

Cleaned (195 chars):
    check out for more info contact us at or we have products priced at etc our office is in new york city nyc call for more info here is some html hello div world div and unicode cafe naive resume

Statistics:
  Operations: ['unicode_normalize', 'remove_urls', 'remove_emails', 
               'remove_numbers', 'remove_punctuation', 'remove_special_chars', 
               'lowercase', 'remove_whitespace']
  Reduction: 49.61%

3.3 Before/After Examples

test_cases = [
    # (description, input, expected_output)
    ("URL removal", 
     "Visit https://www.example.com/path?q=1 today", 
     "visit today"),
    
    ("Email removal",
     "Email john.doe@company.co.uk for info",
     "email john doe for info"),
    
    ("Number removal",
     "I have 3 cats and 2 dogs",
     "i have cats and dogs"),
    
    ("Punctuation removal",
     "Hello, World! How's it going?",
     "hello world hows it going"),
    
    ("Extra whitespace",
     "  too   many    spaces  ",
     "too many spaces"),
    
    ("Mixed cleanup",
     "Price: $19.99!!! Visit https://sale.com #discount",
     "price visit discount")
]

print("Before/After Examples:")
print("=" * 80)
for desc, input_text, expected in test_cases:
    result = advanced_text_cleaner(input_text)
    print(f"\n{desc}:")
    print(f"  Input:    '{input_text}'")
    print(f"  Output:   '{result['cleaned_text']}'")
    print(f"  Expected: '{expected}'")
    print(f"  Match: {result['cleaned_text'] == expected}")

4. Unicode and Encoding

4.1 Why Encoding Matters

Different encodings represent characters differently. Mixing encodings causes:

  • Garbled text (mojibake)
  • Program crashes
  • Data corruption in databases

4.2 Common Encodings

EncodingCharactersUse CaseSize
ASCII128English text, legacy systems1 byte/char
Latin-1256Western European1 byte/char
UTF-81,112,064Universal (variable)1-4 bytes
UTF-161,112,064Windows, Java (variable)2-4 bytes
UTF-321,112,064Fixed-width (rare)4 bytes/char

UTF-8 Byte Encoding

bytes={1ifΒ U≀0x7F2ifΒ 0x80≀U≀0x7FF3ifΒ 0x800≀U≀0xFFFF4ifΒ 0x10000≀U≀0x10FFFF\text{bytes} = \begin{cases} 1 & \text{if } U \leq 0x7F \\ 2 & \text{if } 0x80 \leq U \leq 0x7FF \\ 3 & \text{if } 0x800 \leq U \leq 0xFFFF \\ 4 & \text{if } 0x10000 \leq U \leq 0x10FFFF \end{cases}

Here,

  • =Unicode codepoint value
  • =Number of bytes in UTF-8 representation

4.3 Encoding Operations

# Demonstrate encoding differences
text = "Hello, World! cafΓ©"

# Different encodings
encodings = ['ascii', 'latin-1', 'utf-8', 'utf-16']

print("Encoding comparison:")
print(f"Original: {text}")
print()

for enc in encodings:
    try:
        encoded = text.encode(enc)
        decoded = encoded.decode(enc)
        print(f"{enc:10}: {len(encoded):3} bytes | {encoded[:30]}...")
    except UnicodeEncodeError as e:
        print(f"{enc:10}: ENCODING ERROR - {e}")
print()

# Handle encoding errors gracefully
def safe_decode(byte_data: bytes, encodings: list = None) -> str:
    """Try multiple encodings to decode bytes."""
    if encodings is None:
        encodings = ['utf-8', 'latin-1', 'ascii', 'cp1252']
    
    for enc in encodings:
        try:
            return byte_data.decode(enc)
        except UnicodeDecodeError:
            continue
    
    return byte_data.decode('utf-8', errors='replace')

# Example with problematic bytes
problematic = b'caf\xe9'  # Latin-1 encoding of cafΓ©
print(f"Safe decode: {safe_decode(problematic)}")

When working with international text data, always specify encoding='utf-8' explicitly. Don't rely on system defaults, which vary by OS and locale. UTF-8 is the universal standard and should be your default choice.


5. String Operations on Pandas Series

5.1 The str Accessor

import pandas as pd
import numpy as np

# Create sample DataFrame
data = {
    'raw_text': [
        '  John.Doe@company.com  ',
        'JANE_SMITH@ORG.ORG',
        'bob@example.net',
        'ALICE@TEST.COM',
        'charlie@domain.co.uk'
    ],
    'price': ['$19.99', '$25.50', 'N/A', '$30.00', '$15.75'],
    'date': ['2024-01-15', '2024-02-20', 'invalid', '2024-03-10', '2024-04-05']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print()

# String operations using str accessor
print("String operations on 'raw_text':")
df['clean_email'] = df['raw_text'].str.strip().str.lower()
df['username'] = df['clean_email'].str.split('@').str[0]
df['domain'] = df['clean_email'].str.split('@').str[1]
df['has_numbers'] = df['clean_email'].str.contains(r'\d', regex=True)
df['email_length'] = df['clean_email'].str.len()

print(df[['raw_text', 'clean_email', 'username', 'domain', 'has_numbers']])
print()

# String operations on 'price'
print("String operations on 'price':")
df['price_clean'] = df['price'].str.replace('$', '', regex=False)
df['price_numeric'] = pd.to_numeric(df['price_clean'], errors='coerce')

print(df[['price', 'price_clean', 'price_numeric']])
print(f"\nPrice statistics:")
print(f"  Mean: ${df['price_numeric'].mean():.2f}")
print(f"  Median: ${df['price_numeric'].median():.2f}")
print(f"  Missing: {df['price_numeric'].isna().sum()}")
print()

# Advanced: Extract patterns from text
product_names = pd.Series([
    'iPhone 14 Pro Max',
    'Samsung Galaxy S23',
    'Google Pixel 7 Pro',
    'OnePlus 11 5G',
    'Xiaomi 13 Ultra'
])

print("Pattern extraction from product names:")
df_products = pd.DataFrame({'product': product_names})
df_products['brand'] = df_products['product'].str.split().str[0]
df_products['model'] = df_products['product'].str.extract(r'(\d+)')
df_products['has_pro'] = df_products['product'].str.contains('Pro', case=False)
df_products['has_5g'] = df_products['product'].str.contains('5G')

print(df_products)

5.2 Vectorized String Operations

import pandas as pd
import time

# Performance comparison
n = 1_000_000
text_data = pd.Series(['Hello World'] * n)

# Method 1: Python loop
start = time.perf_counter()
result_loop = pd.Series([x.lower() for x in text_data])
loop_time = time.perf_counter() - start

# Method 2: Pandas str accessor
start = time.perf_counter()
result_pandas = text_data.str.lower()
pandas_time = time.perf_counter() - start

print(f"Performance comparison ({n:,} strings):")
print(f"  Python loop: {loop_time:.4f}s")
print(f"  Pandas str: {pandas_time:.4f}s")
print(f"  Speedup: {loop_time/pandas_time:.1f}x faster with pandas")
Tvectorizedβ‰ˆTloop10Β toΒ 100T_{\text{vectorized}} \approx \frac{T_{\text{loop}}}{10 \text{ to } 100}

Pandas .str accessor operations are implemented in C, providing 10-100x speedup over Python loops for string operations on millions of rows.


6. NLP Preprocessing

6.1 Tokenization

DfTokenization

The process of breaking text into discrete units called tokens. Tokens can be words, subwords, characters, or sentences. Word tokenization splits on whitespace/punctuation, while subword tokenization (BPE, WordPiece) handles rare words by decomposing them into common subunits.

import re
from typing import List

def simple_tokenizer(text: str) -> List[str]:
    """Basic whitespace tokenizer."""
    return text.split()

def word_tokenizer(text: str) -> List[str]:
    """Split on word boundaries."""
    return re.findall(r'\b\w+\b', text.lower())

def sentence_tokenizer(text: str) -> List[str]:
    """Split on sentence boundaries."""
    return re.split(r'[.!?]+', text)

# Example
text = "Hello, World! This is a test. It has multiple sentences."

print("Tokenization examples:")
print(f"  Input: {text}")
print(f"  Whitespace: {simple_tokenizer(text)}")
print(f"  Word: {word_tokenizer(text)}")
print(f"  Sentence: {sentence_tokenizer(text)}")

6.2 Stop Words Removal

# Common English stop words (partial list)
STOP_WORDS = {
    'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves',
    'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him',
    'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its',
    'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what',
    'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am',
    'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has',
    'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the',
    'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of',
    'at', 'by', 'for', 'with', 'about', 'against', 'between', 'through',
    'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up',
    'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',
    'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all',
    'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no',
    'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's',
    't', 'can', 'will', 'just', 'don', 'should', 'now'
}

def remove_stopwords(tokens: List[str], custom_stopwords: set = None) -> List[str]:
    """Remove stop words from token list."""
    stop_words = STOP_WORDS.copy()
    if custom_stopwords:
        stop_words.update(custom_stopwords)
    return [token for token in tokens if token.lower() not in stop_words]

# Example
tokens = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
print(f"Original tokens: {tokens}")
print(f"After stopword removal: {remove_stopwords(tokens)}")

6.3 Stemming vs Lemmatization

DfStemming

A heuristic process that chops off word suffixes to reduce words to their root form (stem). Stems may not be valid words (e.g., "running" β†’ "run", "studies" β†’ "studi"). Fast but crude; used when dictionary lookup is too expensive.

DfLemmatization

A morphological analysis that reduces words to their dictionary form (lemma). Unlike stemming, lemmatization produces valid words using vocabulary and morphological analysis (e.g., "better" β†’ "good", "running" β†’ "run"). More accurate but computationally expensive.

from typing import List

# Simple stemmer (Porter Stemmer approximation)
def simple_stemmer(word: str) -> str:
    """Very basic stemmer for demonstration."""
    suffixes = ['ing', 'tion', 'ness', 'ment', 'able', 'ible', 'ly', 'ed', 'er', 'est', 's']
    
    for suffix in suffixes:
        if word.endswith(suffix) and len(word) - len(suffix) >= 3:
            return word[:-len(suffix)]
    return word

# Simple lemmatizer
IRREGULAR_FORMS = {
    'running': 'run', 'running': 'run', 'better': 'good',
    'worse': 'bad', 'best': 'good', 'worst': 'bad',
    'went': 'go', 'going': 'go', 'gone': 'go',
    'was': 'is', 'were': 'is', 'been': 'is',
    'had': 'have', 'having': 'have'
}

def simple_lemmatizer(word: str) -> str:
    """Basic lemmatizer using irregular forms."""
    if word.lower() in IRREGULAR_FORMS:
        return IRREGULAR_FORMS[word.lower()]
    
    # Basic rules
    if word.endswith('ies'):
        return word[:-3] + 'y'
    elif word.endswith('ing'):
        return word[:-3]
    elif word.endswith('ed'):
        return word[:-2]
    elif word.endswith('s') and not word.endswith('ss'):
        return word[:-1]
    return word

# Comparison
test_words = ['running', 'better', 'studies', 'jumping', 'worses', 'happiness']

print("Stemming vs Lemmatization:")
print(f"{'Word':<12} {'Stemmed':<12} {'Lemmatized':<12}")
print("-" * 36)
for word in test_words:
    print(f"{word:<12} {simple_stemmer(word):<12} {simple_lemmatizer(word):<12}")

6.4 Complete NLP Pipeline

import re
from typing import List, Dict
from collections import Counter

class NLPPipeline:
    """Complete NLP preprocessing pipeline."""
    
    def __init__(self):
        self.stop_words = STOP_WORDS
    
    def preprocess(
        self,
        text: str,
        lowercase: bool = True,
        remove_stopwords: bool = True,
        min_length: int = 2
    ) -> Dict[str, any]:
        """
        Complete NLP preprocessing pipeline.
        
        Returns:
            Dictionary with tokens, frequencies, and metadata
        """
        # Step 1: Basic cleaning
        if lowercase:
            text = text.lower()
        
        # Step 2: Remove special characters
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        
        # Step 3: Remove extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        # Step 4: Tokenize
        tokens = text.split()
        
        # Step 5: Remove stopwords
        if remove_stopwords:
            tokens = [t for t in tokens if t not in self.stop_words]
        
        # Step 6: Filter by length
        tokens = [t for t in tokens if len(t) >= min_length]
        
        # Step 7: Compute statistics
        freq = Counter(tokens)
        vocab = sorted(set(tokens))
        
        return {
            'tokens': tokens,
            'vocabulary': vocab,
            'frequencies': dict(freq.most_common(10)),
            'unique_words': len(vocab),
            'total_tokens': len(tokens),
            'type_token_ratio': len(vocab) / len(tokens) if tokens else 0
        }

# Example usage
pipeline = NLPPipeline()

sample_texts = [
    "The quick brown fox jumps over the lazy dog. This is a simple sentence.",
    "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence.",
    "Data science involves extracting insights from data using scientific methods, algorithms, and systems."
]

print("NLP Pipeline Results:")
print("=" * 80)

for i, text in enumerate(sample_texts, 1):
    result = pipeline.preprocess(text)
    print(f"\nText {i}:")
    print(f"  Original: {text[:60]}...")
    print(f"  Tokens: {result['tokens'][:10]}...")
    print(f"  Unique words: {result['unique_words']}")
    print(f"  Type-Token Ratio: {result['type_token_ratio']:.2f}")
    print(f"  Top 5 words: {list(result['frequencies'].items())[:5]}")

Type-Token Ratio

TTR=∣vocabulary∣∣tokens∣TTR = \frac{|\text{vocabulary}|}{|\text{tokens}|}

Here,

  • =Type-Token Ratio (vocabulary richness)
  • =Set of unique words
  • =Total number of word occurrences

7. Regex Cheat Sheet Table

PatternDescriptionExampleMatches
.Any charactera.cabc, a1c, a c
^Start of string^HelloHello World
$End of stringWorld$Hello World
*Zero or moreab*cac, abc, abbc
+One or moreab+cabc, abbc (not ac)
?Zero or onecolou?rcolor, colour
\dDigit\d+123
\DNon-digit\D+abc
\wWord char\w+hello_123
\WNon-word\W+!@#
\sWhitespace\s+spaces, tabs
\SNon-space\S+hello
[abc]Character class[aeiou]vowels
[^abc]Negated class[^0-9]non-digits
(abc)Capture group(ab)+ab, abab
(?:abc)Non-capturing(?:ab)+ab, abab
(?P<name>...)Named group(?P<year>\d{4})captures as 'year'
a|bORcat|dogcat or dog
{n}Exactly n\d{3}123
{n,}At least n\d{2,}12, 123
{n,m}Between n,m\d{2,4}12, 123, 1234
\bWord boundary\bword\bword in a word b
\AStart of string\AHelloHello
\ZEnd of stringWorld\ZWorld
(?!...)Negative lookaheadfoo(?!bar)foo not followed by bar
(?<=...)Positive lookbehind(?<=\$)\d+digits after $
(?<!...)Negative lookbehind(?<!\$)\d+digits not after $

Key Takeaways

πŸ“‹Summary: String & Text Processing

  1. String Immutability: All string operations create new strings. For repeated concatenation, use ''.join() instead of +=.
  2. Regular Expressions: Compile patterns with re.compile() for repeated use; use raw strings r'pattern' to avoid escape issues; named groups improve readability.
  3. Text Cleaning Pipeline: Always normalize Unicode before processing; remove noise first; lowercase for case-insensitive analysis; tokenize before further processing; remove stopwords after tokenization.
  4. Encoding Matters: Always specify encoding when reading/writing files; use UTF-8 as default for international text; handle encoding errors gracefully with errors='replace'.
  5. Pandas String Operations: Use .str accessor for vectorized operations (10-100x faster than Python loops); chain methods for complex transformations.
  6. NLP Preprocessing: Tokenization β†’ Stopword removal β†’ Stemming/Lemmatization; Type-Token Ratio measures vocabulary richness.

Practice Exercises

Exercise 1: Regex Pattern Writing

Write regex patterns to:

  1. Extract all hashtags from social media text
  2. Validate phone numbers in multiple formats
  3. Extract price values with currency symbols
  4. Find sentences containing specific keywords

Exercise 2: Text Cleaning Pipeline

Build a text cleaning function that:

  1. Handles HTML tags
  2. Removes emojis (with option to keep)
  3. Normalizes contractions (don't β†’ do not)
  4. Expands abbreviations (NYC β†’ New York City)

Test with 10 different messy text samples.

Exercise 3: Pandas String Processing

Given a DataFrame with a "product_description" column:

  1. Extract brand names
  2. Identify product categories
  3. Extract numerical specifications (size, weight, etc.)
  4. Create a feature matrix for machine learning

Exercise 4: NLP Comparison

Compare the output of stemming vs lemmatization on:

  • 50 product reviews
  • 20 news articles
  • Discuss which produces better results for topic modeling

Exercise 5: Performance Optimization

Benchmark different text processing approaches:

  • Python loops vs list comprehensions
  • Pandas .str methods vs apply()
  • Regex vs string methods for pattern matching

Create a report with recommendations for different data sizes.

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement