Python Regular Expressions — Pattern Matching Mastery
Regular expressions describe search patterns in text. They are essential for validation, extraction, and text transformation.
Learning Objectives
- Use the
remodule for pattern matching - Master quantifiers, groups, and anchors
- Apply lookahead and lookbehind assertions
- Solve real-world text processing problems
re Module Basics
import re
text = "The price is $42.50 and $19.99"
# Find all numbers
numbers = re.findall(r'\d+\.?\d*', text)
print(numbers) # ['42.50', '19.99']
# Match and extract
match = re.search(r'\$(\d+\.?\d*)', text)
if match:
print(match.group(0)) # $42.50 (full match)
print(match.group(1)) # 42.50 (first group)
Pattern Syntax
# Quantifiers
r'a+' # One or more 'a'
r'a*' # Zero or more 'a'
r'a?' # Zero or one 'a'
r'a{3}' # Exactly 3 'a's
r'a{2,4}' # 2 to 4 'a's
# Character classes
r'\d' # Digit [0-9]
r'\w' # Word character [a-zA-Z0-9_]
r'\s' # Whitespace
r'[aeiou]' # Any vowel
r'[^aeiou]' # Not a vowel
# Anchors
r'^Hello' # Starts with
r'world$' # Ends with
r'\bword\b' # Word boundary
# Groups
r'(abc)' # Capturing group
r'(?:abc)' # Non-capturing group
r'(?P<name>abc)' # Named group
Common Patterns
# Email validation
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
bool(re.match(email_pattern, 'user@example.com')) # True
# Phone number (US)
phone_pattern = r'(\+1)?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
# URL extraction
url_pattern = r'https?://(?:www\.)?[\w.-]+\.[a-zA-Z]{2,}(?:/[\w./?%&=-]*)?'
urls = re.findall(url_pattern, "Visit https://example.com or http://test.org/path")
# Date formats
date_pattern = r'\d{4}[-/]\d{2}[-/]\d{2}'
Lookahead and Lookbehind
# Positive lookahead: followed by X
r'\d+(?= dollars)' # 42 in "42 dollars"
# Negative lookahead: NOT followed by X
r'\d+(?! dollars)' # 42 in "42 euros"
# Positive lookbehind: preceded by X
r'(?<=\$)\d+' # 42 in "$42"
# Negative lookbehind: NOT preceded by X
r'(?<!\$)\d+' # 42 in "42 dollars" but not "$42"
Substitution
text = "Hello World, hello Python"
# Replace all occurrences
result = re.sub(r'hello', 'Hi', text, flags=re.IGNORECASE)
# "Hi World, Hi Python"
# Replace with function
def double_number(match):
return str(int(match.group()) * 2)
result = re.sub(r'\d+', double_number, "I have 3 cats and 5 dogs")
# "I have 6 cats and 10 dogs"
Key Takeaways
- Use raw strings
r'...'for regex patterns \ddigits,\wword chars,\swhitespace- Groups
()capture matched text - Lookahead/lookbehind for context-sensitive matching
re.sub()for pattern-based replacement