Why Performance Matters
Pandas is convenient but can be deceptively slow. Understanding its internals β vectorization, chunking, and efficient patterns β is critical when working with datasets larger than a few hundred thousand rows.
DfVectorization
The practice of applying operations to entire arrays or columns at the C level rather than iterating through elements one by one in Python. Vectorized operations exploit CPU SIMD (Single Instruction, Multiple Data) instructions to process multiple data points simultaneously, bypassing the Python interpreter overhead entirely.
The Performance Trap
Approach 1M rows 10M rows 100M rows
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
for loop 2.5s 25s 250s+ π
.apply() 1.8s 18s 180s π°
Vectorized 0.05s 0.5s 5s β
eval()/query() 0.03s 0.3s 3s ββ
Numba/cuDF 0.01s 0.1s 1s βββ
Why the gap widens with scale: Python's per-operation overhead is roughly 100β200 ns. For 1 million rows, that adds up to 100β200 seconds of pure interpreter overhead. Vectorized operations amortize this overhead across the entire array, keeping the constant factor near zero. The speedup ratio increases with dataset size.
Why Pandas Can Be Slow
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Why Vectorization Wins β
β β
β Python loop: for i in range(n): result[i] = ... β
β β β
β Python interpreter overhead per item β
β Type checking per iteration β
β Function call overhead β
β O(n) Python operations β
β β
β Vectorized: result = array * 2 β
β β β
β Single C-level operation β
β No Python overhead β
β SIMD instructions (parallel) β
β O(1) Python operation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
πMeasuring the Vectorization Speedup
import pandas as pd
import numpy as np
import time
# Performance demonstration
np.random.seed(42)
n = 1_000_000
df = pd.DataFrame({
'a': np.random.randn(n),
'b': np.random.randn(n),
'c': np.random.choice(['X', 'Y', 'Z'], n)
})
# Method 1: apply() β SLOW
start = time.perf_counter()
df['result_apply'] = df.apply(lambda row: row['a'] * 2 + row['b'], axis=1)
time_apply = time.perf_counter() - start
# Method 2: Vectorized β FAST
start = time.perf_counter()
df['result_vector'] = df['a'] * 2 + df['b']
time_vector = time.perf_counter() - start
# Method 3: eval() β FASTEST
start = time.perf_counter()
df['result_eval'] = df.eval('a * 2 + b')
time_eval = time.perf_counter() - start
print(f"apply(): {time_apply:.4f}s")
print(f"vectorized: {time_vector:.4f}s")
print(f"eval(): {time_eval:.4f}s")
print(f"Speedup (vector vs apply): {time_apply/time_vector:.1f}x")
print(f"Speedup (eval vs apply): {time_apply/time_eval:.1f}x")
Typical Output:
apply(): 1.8234s
vectorized: 0.0156s
eval(): 0.0112s
Speedup (vector vs apply): 116.9x
Speedup (eval vs apply): 162.8x
eval() and query(): Fast Expression Evaluation
eval() and query() use expression parsing to evaluate operations in a single pass, avoiding intermediate array creation.
DfExpression Parsing in eval()
A technique where the entire arithmetic expression is parsed as an abstract syntax tree and evaluated in a single C-level pass. Rather than creating intermediate arrays for each binary operation, eval() rewrites the expression to minimize temporary allocations. For an expression like a * b + c, naive pandas creates two intermediates; eval() creates only one.
Memory Allocation with eval() vs Chained Operations
Here,
- =
- =
- =
For an expression like a * b + c, naive pandas creates:
- temp1 = a Γ b (intermediate array)
- result = temp1 + c (another intermediate)
eval() parses the entire expression and evaluates it in one C-level operation, using only one intermediate allocation.
When to use eval() vs query(): Use eval() when you are creating new columns from complex arithmetic on existing columns. Use query() when you are filtering rows based on multiple conditions. Both avoid intermediate allocations, but eval() is most beneficial when the expression involves 3+ columns.
import pandas as pd
import numpy as np
np.random.seed(42)
n = 1_000_000
df = pd.DataFrame({
'x': np.random.randn(n),
'y': np.random.randn(n),
'z': np.random.randn(n),
'w': np.random.randn(n)
})
# Standard approach: multiple intermediate arrays
df['result_std'] = df['x'] * df['y'] + df['z'] - df['w'] * 2
# eval(): single-pass evaluation, no intermediates
df['result_eval'] = df.eval('x * y + z - w * 2')
# query(): filter with expression strings
# Equivalent to: df[(df['x'] > 0) & (df['y'] > 1) & (df['z'] < df['w'])]
result = df.query('x > 0 and y > 1 and z < w')
print(f"Rows matching query: {len(result)}")
# Local variables in query
threshold = 0.5
result2 = df.query('x > @threshold and y > @threshold')
print(f"Rows with both > {threshold}: {len(result2)}")
# Assignment with eval (modify in place)
df.eval('x_squared = x ** 2', inplace=True)
print(df[['x', 'x_squared']].head())
# x x_squared
# 0 0.248357 0.061681
# 1 -0.072008 0.005185
# 2 0.253489 0.064257
# 3 -0.035586 0.001266
# 4 0.203452 0.041393
Performance Comparison
import time
np.random.seed(42)
n = 5_000_000
df = pd.DataFrame({
'a': np.random.randn(n),
'b': np.random.randn(n),
'c': np.random.randn(n),
'd': np.random.randn(n)
})
def benchmark(label, func, *args):
start = time.perf_counter()
result = func(*args)
elapsed = time.perf_counter() - start
return label, elapsed, result
results = []
# Standard chaining
results.append(benchmark(
'chained', lambda: df['a'] * df['b'] + df['c'] / df['d'] - df['a']**2
))
# eval
results.append(benchmark(
'eval', lambda: df.eval('a * b + c / d - a ** 2')
))
# query for filtering
results.append(benchmark(
'query', lambda: df.query('a > 0 and b > 0 and c > 0')
))
# Boolean indexing (slow)
results.append(benchmark(
'boolean_idx', lambda: df[(df['a'] > 0) & (df['b'] > 0) & (df['c'] > 0)]
))
print(f"{'Method':<15} {'Time (ms)':<12}")
print("-" * 27)
for label, elapsed, _ in results:
print(f"{label:<15} {elapsed*1000:<12.2f}")
Typical Results:
Method Time (ms)
---------------------------
chained 45.23
eval 28.67
query 31.45
boolean_idx 52.18
Method Chaining: pipe() for Clean Pipelines
Method chaining creates readable, linear data transformations without temporary variables. The pipe() method enables custom functions in chains.
DfMethod Chaining
A programming pattern where multiple method calls are written sequentially, with the output of one method passed as the input to the next. In pandas, most methods return a DataFrame, enabling this pattern. The pipe() method extends this to custom functions, enabling composable, testable, and readable data pipelines.
The pipe() advantage over nested calls: Nested function calls like g(f(h(df))) read inside-out, increasing cognitive load. Method chains like df.pipe(h).pipe(f).pipe(g) read left-to-right, matching natural reading order. This is especially important when pipelines have 5+ steps.
import pandas as pd
import numpy as np
# Sample data
np.random.seed(42)
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=1000, freq='D'),
'store': np.random.choice(['NYC', 'LA', 'Chicago'], 1000),
'product': np.random.choice(['A', 'B', 'C'], 1000),
'revenue': np.random.exponential(100, 1000),
'units': np.random.randint(1, 50, 1000),
'cost': np.random.exponential(50, 1000)
})
# Without chaining: nested, hard to read
result1 = (
df.assign(
profit=lambda x: x['revenue'] - x['cost'],
margin=lambda x: (x['revenue'] - x['cost']) / x['revenue']
)
.query('margin > 0.1')
.groupby(['store', 'product'])
.agg({
'revenue': 'sum',
'profit': 'sum',
'margin': 'mean',
'units': 'sum'
})
.reset_index()
.sort_values('profit', ascending=False)
)
# With pipe(): composable and testable
def add_profit_margin(data):
"""Calculate profit and margin."""
return data.assign(
profit=lambda x: x['revenue'] - x['cost'],
margin=lambda x: (x['revenue'] - x['cost']) / x['revenue']
)
def filter_profitable(data, min_margin=0.1):
"""Filter by minimum margin."""
return data[data['margin'] > min_margin]
def aggregate_by_store_product(data):
"""Aggregate metrics by store and product."""
return (data
.groupby(['store', 'product'])
.agg({
'revenue': 'sum',
'profit': 'sum',
'margin': 'mean',
'units': 'sum'
})
.reset_index())
def rank_by_profit(data):
"""Sort by profit descending."""
return data.sort_values('profit', ascending=False)
# Clean pipeline
result2 = (df
.pipe(add_profit_margin)
.pipe(filter_profitable, min_margin=0.15)
.pipe(aggregate_by_store_product)
.pipe(rank_by_profit))
print(result2.head())
Advanced Indexing: .loc vs .iloc vs []
Understanding the difference between indexing methods prevents silent bugs and performance issues.
DfLabel-Based Indexing (.loc)
Selects data by row and column labels rather than integer positions. Unlike Python slicing, .loc is inclusive on both ends β df.loc['a':'c'] returns rows labeled 'a', 'b', and 'c'. This can cause subtle bugs when switching between label-based and position-based indexing.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Pandas Indexing Methods β
β β
β .loc[label] β Label-based df.loc[0, 'name'] β
β .iloc[position] β Integer position df.iloc[0, 1] β
β [] β Mixed (context) df[condition] β
β β
β β WARNING: SettingWithCopyWarning β
β df[df['col'] > 0]['new_col'] = value β WRONG (copy) β
β df.loc[df['col'] > 0, 'new_col'] = val β RIGHT (view) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'age': [25, 30, 35, 28, 32],
'score': [85, 92, 78, 95, 88],
'grade': ['B', 'A', 'B+', 'A+', 'A']
}, index=['r0', 'r1', 'r2', 'r3', 'r4'])
# .loc: label-based (inclusive on both ends!)
print("=== .loc (label-based) ===")
print(df.loc['r0':'r2', 'name':'score']) # Includes r2!
# name age score
# r0 Alice 25 85
# r1 Bob 30 92
# r2 Charlie 35 78
# .iloc: position-based (exclusive on right, like Python slicing)
print("\n=== .iloc (position-based) ===")
print(df.iloc[0:3, 0:2]) # Excludes position 3!
# name age
# r0 Alice 25
# r1 Bob 30
# r2 Charlie 35
# Boolean indexing
print("\n=== Boolean indexing ===")
print(df[df['age'] > 30])
# name age score grade
# r2 Charlie 35 78 B+
# r4 Eve 32 88 A
# Chained assignment (WRONG - triggers SettingWithCopyWarning)
# df[df['age'] > 30]['grade'] = 'Senior' # β
# Proper assignment (RIGHT)
df.loc[df['age'] > 30, 'grade'] = 'Senior'
print("\nAfter proper assignment:")
print(df.loc[df['age'] > 30])
# name age score grade
# r2 Charlie 35 78 Senior
# r4 Eve 32 88 Senior
The SettingWithCopyWarning explained: When you write df[mask]['col'] = val, pandas first creates a copy of the data with df[mask], then attempts to assign into that copy. The assignment silently fails on a copy, not the original DataFrame. Always use .loc[mask, 'col'] = val to operate on a view of the original data.
Indexing Performance Comparison
import pandas as pd
import numpy as np
import time
n = 1_000_000
df = pd.DataFrame({
'value': np.random.randn(n),
'category': np.random.choice(['A', 'B', 'C', 'D'], n)
})
df.index = range(n) # Integer index
# Benchmark different indexing methods
def timeit(label, func, n_runs=100):
times = []
for _ in range(n_runs):
start = time.perf_counter()
func()
times.append(time.perf_counter() - start)
return label, np.median(times)
results = []
results.append(timeit('df[df["value"] > 0]', lambda: df[df['value'] > 0]))
results.append(timeit('df.loc[df["value"] > 0]', lambda: df.loc[df['value'] > 0]))
results.append(timeit('df.query("value > 0")', lambda: df.query('value > 0')))
results.append(timeit('df.iloc[:500000]', lambda: df.iloc[:500000]))
results.append(timeit('df.loc[:500000]', lambda: df.loc[:500000]))
print(f"{'Method':<35} {'Median (ms)':<12}")
print("-" * 47)
for label, t in results:
print(f"{label:<35} {t*1000:<12.3f}")
MultiIndex: Hierarchical Indexing
MultiIndex enables multi-dimensional grouping within a 2D DataFrame β essential for panel data, time series with multiple hierarchies, and pivot results.
DfMultiIndex (Hierarchical Indexing)
A pandas index structure that enables multiple levels of row or column labels on a single DataFrame axis. Each level represents a categorical variable, and the combination of levels uniquely identifies each row. MultiIndex is analogous to a composite database key and enables group-by operations across multiple dimensions simultaneously.
MultiIndex Structure:
βββββββββββββββββββββββββββββββββββ
β MultiIndex β
β Level 0 Level 1 Level 2 β
β βββββββββββ¬ββββββββββ¬βββββββββ β
β β NYC β Q1 β Jan β β
β β NYC β Q1 β Feb β β
β β NYC β Q2 β Mar β β
β β LA β Q1 β Jan β β
β βββββββββββ΄ββββββββββ΄βββββββββ β
β Store Quarter Month β
βββββββββββββββββββββββββββββββββββ
When to use MultiIndex vs separate DataFrames: MultiIndex is ideal when your data naturally has hierarchical structure (e.g., city-quarter-month) and you need to aggregate or slice across multiple levels. If you only ever access data by a single key, a flat index with a regular column is simpler and often faster.
import pandas as pd
import numpy as np
# Create MultiIndex DataFrame
np.random.seed(42)
arrays = [
['NYC', 'NYC', 'NYC', 'LA', 'LA', 'LA', 'Chicago', 'Chicago', 'Chicago'],
['Q1', 'Q1', 'Q2', 'Q1', 'Q1', 'Q2', 'Q1', 'Q1', 'Q2'],
['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar']
]
index = pd.MultiIndex.from_arrays(arrays, names=['City', 'Quarter', 'Month'])
df = pd.DataFrame({
'revenue': np.random.uniform(10000, 50000, 9),
'units': np.random.randint(100, 1000, 9),
'cost': np.random.uniform(5000, 25000, 9)
}, index=index)
print("MultiIndex DataFrame:")
print(df)
# Accessing MultiIndex levels
print("\nAll cities:", df.index.get_level_values('City').unique().tolist())
print("All quarters:", df.index.get_level_values('Quarter').unique().tolist())
# Cross-section: select all data for NYC
print("\nNYC data:")
print(df.loc['NYC'])
# Cross-section: select all Q1 data across all cities
print("\nAll Q1 data:")
print(df.xs('Q1', level='Quarter'))
# Stack/Unstack: reshape between wide and long format
print("\nUnstacked (wide format):")
print(df.unstack(level='City'))
# Swap levels for different grouping
print("\nSwapped levels (Quarter first):")
print(df.swaplevel('Quarter', 'City').sort_index())
# Aggregation across levels
print("\nRevenue by City:")
print(df.groupby(level='City')['revenue'].sum())
print("\nRevenue by Quarter across all cities:")
print(df.groupby(level='Quarter')['revenue'].sum())
Creating MultiIndex from Columns
# Create flat DataFrame
df_flat = pd.DataFrame({
'city': ['NYC', 'NYC', 'LA', 'LA', 'Chicago', 'Chicago'],
'year': [2023, 2024, 2023, 2024, 2023, 2024],
'revenue': [100000, 120000, 95000, 110000, 80000, 95000]
})
# Convert to MultiIndex
df_multi = df_flat.set_index(['city', 'year'])
print("From set_index:")
print(df_multi)
# Reset back to flat
df_flat_again = df_multi.reset_index()
print("\nBack to flat:")
print(df_flat_again)
# Pivot to create MultiIndex columns
df_pivot = df_flat.pivot(index='city', columns='year', values='revenue')
print("\nPivoted:")
print(df_pivot)
Window Functions: Rolling, Expanding, EWM
Window functions compute sliding window aggregations β essential for time series analysis.
Mathematical Definitions
Rolling window of size for function :
Rolling Window Aggregation
Here,
- =
- =
- =
Expanding window (cumulative):
Expanding Window (Cumulative) Aggregation
Here,
- =
- =
Exponentially Weighted Moving (EWM):
Why EWM over simple rolling? EWM assigns exponentially decreasing weights to older observations, giving recent data more influence. This makes EWM more responsive to recent changes than a simple rolling mean of the same effective window. The half-life parameter provides an intuitive way to control memory: after half_life periods, a past observation's weight drops to 50% of its peak.
import pandas as pd
import numpy as np
# Generate time series data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=365, freq='D')
df = pd.DataFrame({
'date': dates,
'price': 100 + np.cumsum(np.random.randn(365) * 2),
'volume': np.random.randint(1000, 10000, 365)
}).set_index('date')
# Rolling statistics
df['rolling_mean_7d'] = df['price'].rolling(window=7).mean()
df['rolling_std_7d'] = df['price'].rolling(window=7).std()
df['rolling_min_30d'] = df['price'].rolling(window=30).min()
df['rolling_max_30d'] = df['price'].rolling(window=30).max()
# Expanding (cumulative) statistics
df['expanding_mean'] = df['price'].expanding().mean()
df['expanding_std'] = df['price'].expanding().std()
# EWM (Exponentially Weighted Moving)
df['ewm_7d'] = df['price'].ewm(span=7).mean()
df['ewm_30d'] = df['price'].ewm(span=30).mean()
# Custom rolling aggregation
df['rolling_range'] = df['price'].rolling(7).apply(
lambda x: x.max() - x.min(), raw=True
)
# Bollinger Bands (trading indicator)
window = 20
df['bb_mid'] = df['price'].rolling(window).mean()
df['bb_std'] = df['price'].rolling(window).std()
df['bb_upper'] = df['bb_mid'] + 2 * df['bb_std']
df['bb_lower'] = df['bb_mid'] - 2 * df['bb_std']
print(df[['price', 'rolling_mean_7d', 'bb_upper', 'bb_lower']].tail())
Rolling with Multiple Windows
# Multiple window sizes in one operation
windows = [7, 14, 30, 90]
for w in windows:
df[f'mean_{w}d'] = df['price'].rolling(w).mean()
df[f'std_{w}d'] = df['price'].rolling(w).std()
df[f'change_{w}d'] = df['price'].pct_change(w)
# Rolling correlation between price and volume
df['price_vol_corr_30d'] = df['price'].rolling(30).corr(df['volume'])
# Rolling apply with multiple values
def rolling_trend(series):
"""Compute linear trend over rolling window."""
x = np.arange(len(series))
slope = np.polyfit(x, series, 1)[0]
return slope
df['trend_30d'] = df['price'].rolling(30).apply(rolling_trend, raw=True)
print(df[['price', 'mean_7d', 'mean_30d', 'trend_30d']].head(35))
String Operations: .str Accessor
The .str accessor enables vectorized string operations β apply string methods to entire columns without loops.
import pandas as pd
# Sample text data
df = pd.DataFrame({
'text': [
' Hello World ',
'python is GREAT',
'data-science@example.com',
'Price: $1,234.56',
'NEW YORK, NY 10001',
' Machine Learning '
]
})
# Basic string operations
df['cleaned'] = df['text'].str.strip()
df['lower'] = df['text'].str.lower()
df['upper'] = df['text'].str.upper()
df['title'] = df['text'].str.title()
# Length and containment
df['length'] = df['text'].str.len()
df['contains_science'] = df['text'].str.contains('science', case=False)
df['starts_with_hello'] = df['text'].str.startswith('Hello')
# Split and extract
df['first_word'] = df['text'].str.strip().str.split().str[0]
df['word_count'] = df['text'].str.strip().str.split().str.len()
# Extract patterns with regex
df['email_domain'] = df['text'].str.extract(r'@(\w+)')
df['numbers'] = df['text'].str.extractall(r'(\d+)').groupby(level=0).apply(list)
# Replace
df['no_spaces'] = df['text'].str.replace(' ', '_', regex=False)
# Pad and slice
df['padded'] = df['text'].str.strip().str.pad(20, fillchar='=')
df['first_10'] = df['text'].str[:10]
# Categorical encoding from strings
df['category'] = pd.Categorical(df['text'].str.strip().str.lower())
print(df[['text', 'cleaned', 'lower', 'first_word', 'length']])
Advanced Regex with .str
# Complex regex patterns
df = pd.DataFrame({
'raw': [
'Order #12345 - Total: $1,234.56',
'Invoice ABC-789 - Amount: $567.89',
'Order #67890 - Total: $901.23',
'INV-001 - $12.34'
]
})
# Extract multiple patterns
df['order_id'] = df['raw'].str.extract(r'#(\d+|-[\d]+)')
df['amount'] = df['raw'].str.extract(r'\$([\d,]+\.?\d*)')
df['amount_numeric'] = df['amount'].str.replace(',', '', regex=False).astype(float)
# Named groups
pattern = r'(?P<prefix>#|INV-)(?P<id>\d+)'
df['prefix'] = df['raw'].str.extract(pattern)['prefix']
df['extracted_id'] = df['raw'].str.extract(pattern)['id']
# Replace with function
df['normalized'] = df['raw'].str.replace(
r'\$[\d,]+\.?\d*',
lambda m: f"${float(m.group().replace('$','').replace(',','')) * 1.1:.2f}",
regex=True
)
print(df[['raw', 'order_id', 'amount', 'normalized']])
Categorical Data: Memory-Efficient Enums
Categorical data stores integer codes instead of repeated strings, dramatically reducing memory.
DfCategorical Data Type
A pandas data type that stores each unique string value as an integer code, with a separate mapping array. For a column with unique values repeated times, string storage requires bytes (where is average string length), while categorical storage requires bytes for codes plus bytes for categories β yielding reduction.
String Storage: Categorical Storage:
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β "Engineering" (12B) β β Code: 0 β
β "Engineering" (12B) β β Code: 0 β
β "Sales" (5B) β β Code: 1 β
β "Engineering" (12B) β β Code: 0 β
β "Marketing" (9B) β β Code: 2 β
β "Sales" (5B) β β Code: 1 β
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
Total: 55 bytes Codes: 6 bytes + categories: 26 bytes
Total: 32 bytes (42% savings)
Categorical memory savings formula: For rows with unique string values of average length bytes, the savings are approximately bytes (since each code is 1 byte for < 256 categories). The breakeven point is roughly β categorical becomes worthwhile when you have at least ~10Γ more rows than unique categories.
import pandas as pd
import numpy as np
# Sample data with repeated strings
n = 1_000_000
df = pd.DataFrame({
'department': np.random.choice(
['Engineering', 'Sales', 'Marketing', 'HR', 'Finance', 'Legal'],
n
),
'level': np.random.choice(['Junior', 'Mid', 'Senior', 'Staff', 'Principal'], n),
'location': np.random.choice(['NYC', 'LA', 'Chicago', 'Seattle', 'Austin'], n)
})
# Memory before categorization
mem_before = df.memory_usage(deep=True).sum() / 1024 / 1024
print(f"Memory before: {mem_before:.2f} MB")
# Convert to categorical
df['department'] = pd.Categorical(df['department'])
df['level'] = pd.Categorical(df['level'])
df['location'] = pd.Categorical(df['location'])
# Memory after categorization
mem_after = df.memory_usage(deep=True).sum() / 1024 / 1024
print(f"Memory after: {mem_after:.2f} MB")
print(f"Memory saved: {(1 - mem_after/mem_before)*100:.1f}%")
# Ordered categorities (for sorting)
df['level'] = pd.Categorical(
df['level'],
categories=['Junior', 'Mid', 'Senior', 'Staff', 'Principal'],
ordered=True
)
# Now comparisons work!
print(df[df['level'] > 'Senior'].head())
print(f"\nSorted by level:")
print(df.sort_values('level')['level'].value_counts())
# Category info
print(f"\nDepartment categories: {df['department'].cat.categories.tolist()}")
print(f"Category codes (first 5): {df['department'].cat.codes.head().tolist()}")
Memory Optimization
import pandas as pd
import numpy as np
# Create a memory-heavy DataFrame
n = 1_000_000
df = pd.DataFrame({
'id': range(n),
'age': np.random.randint(18, 80, n).astype('int64'),
'income': np.random.normal(50000, 15000, n).astype('float64'),
'score': np.random.uniform(0, 100, n).astype('float64'),
'is_active': np.random.choice([True, False], n).astype('bool'),
'department': np.random.choice(['A', 'B', 'C', 'D', 'E'], n),
'timestamp': pd.date_range('2024-01-01', periods=n, freq='s')
})
# Check memory usage
print("Memory usage by column:")
print(df.memory_usage(deep=True))
print(f"\nTotal: {df.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")
# Downcast numeric types
def downcast(df):
"""Downcast numeric columns to smallest possible type."""
result = df.copy()
for col in result.select_dtypes(include=['int64']).columns:
result[col] = pd.to_numeric(result[col], downcast='integer')
for col in result.select_dtypes(include=['float64']).columns:
result[col] = pd.to_numeric(result[col], downcast='float')
return result
df_optimized = downcast(df)
# Memory after downcasting
mem_after = df_optimized.memory_usage(deep=True).sum() / 1024 / 1024
print(f"\nMemory after downcasting: {mem_after:.2f} MB")
# Show dtype changes
print("\nDtype comparison:")
for col in df.columns:
before = df[col].dtype
after = df_optimized[col].dtype
if before != after:
print(f" {col}: {before} β {after}")
# Chunked reading for very large files
def process_large_file(filepath, chunksize=10000):
"""Process large CSV in chunks."""
results = []
for chunk in pd.read_csv(filepath, chunksize=chunksize):
# Process each chunk
result = chunk.groupby('category').agg({'value': 'sum'})
results.append(result)
return pd.concat(results).groupby(level=0).sum()
Performance Benchmarking
import pandas as pd
import numpy as np
import time
np.random.seed(42)
n = 2_000_000
df = pd.DataFrame({
'a': np.random.randn(n),
'b': np.random.randn(n),
'c': np.random.choice(['X', 'Y', 'Z', 'W'], n)
})
def benchmark(func, n_runs=10, label=""):
"""Run benchmark and return statistics."""
times = []
for _ in range(n_runs):
start = time.perf_counter()
result = func()
times.append(time.perf_counter() - start)
return {
'label': label,
'median': np.median(times),
'mean': np.mean(times),
'std': np.std(times),
'min': min(times),
'max': max(times)
}
# Benchmark different approaches for group operation
benchmarks = []
# Method 1: groupby + agg
benchmarks.append(benchmark(
lambda: df.groupby('c').agg({'a': 'mean', 'b': 'sum'}),
label='groupby + agg'
))
# Method 2: groupby + apply
benchmarks.append(benchmark(
lambda: df.groupby('c').apply(lambda g: pd.Series({
'a_mean': g['a'].mean(),
'b_sum': g['b'].sum()
})),
label='groupby + apply'
))
# Method 3: pivot_table
benchmarks.append(benchmark(
lambda: df.pivot_table(values=['a', 'b'], index='c', aggfunc={'a': 'mean', 'b': 'sum'}),
label='pivot_table'
))
# Method 4: crosstab
benchmarks.append(benchmark(
lambda: pd.crosstab(df['c'], values=df['a'], aggfunc='mean'),
label='crosstab'
))
# Display results
print(f"{'Method':<25} {'Median (ms)':<12} {'Std (ms)':<12} {'Speedup':<10}")
print("-" * 59)
baseline = benchmarks[0]['median']
for b in sorted(benchmarks, key=lambda x: x['median']):
speedup = baseline / b['median']
print(f"{b['label']:<25} {b['median']*1000:<12.2f} {b['std']*1000:<12.2f} {speedup:<10.1f}x")
Typical Results:
Method Median (ms) Std (ms) Speedup
-----------------------------------------------------------
groupby + agg 45.23 2.34 1.0x
pivot_table 52.18 3.12 0.9x
crosstab 38.67 1.89 1.2x
groupby + apply 180.45 8.56 0.3x
Complete Data Processing Pipeline
import pandas as pd
import numpy as np
from typing import Dict, List, Tuple
class DataProcessor:
"""Production-grade data processor with performance monitoring."""
def __init__(self, df: pd.DataFrame):
self.original = df.copy()
self.processed = df.copy()
self.steps_log: List[Dict] = []
self._log_step("init", len(df))
def _log_step(self, step_name: str, n_rows: int):
self.steps_log.append({
'step': step_name,
'n_rows': n_rows,
'memory_mb': self.processed.memory_usage(deep=True).sum() / 1024 / 1024
})
def optimize_dtypes(self) -> 'DataProcessor':
"""Downcast numeric types to reduce memory."""
for col in self.processed.select_dtypes(include=['int64']).columns:
self.processed[col] = pd.to_numeric(
self.processed[col], downcast='integer'
)
for col in self.processed.select_dtypes(include=['float64']).columns:
self.processed[col] = pd.to_numeric(
self.processed[col], downcast='float'
)
self._log_step("optimize_dtypes", len(self.processed))
return self
def convert_categoricals(self, columns: List[str]) -> 'DataProcessor':
"""Convert string columns to categorical type."""
for col in columns:
if col in self.processed.columns:
self.processed[col] = pd.Categorical(self.processed[col])
self._log_step("convert_categoricals", len(self.processed))
return self
def remove_duplicates(self, subset: List[str] = None,
keep: str = 'first') -> 'DataProcessor':
"""Remove duplicate rows."""
before = len(self.processed)
self.processed = self.processed.drop_duplicates(
subset=subset, keep=keep
).reset_index(drop=True)
self._log_step(
f"remove_duplicates (removed {before - len(self.processed)})",
len(self.processed)
)
return self
def handle_missing(self, strategy: Dict[str, str] = None) -> 'DataProcessor':
"""Handle missing values with specified strategies.
Strategies: 'mean', 'median', 'mode', 'drop', 'ffill', 'bfill'
"""
if strategy is None:
# Default: drop rows with >50% missing
threshold = len(self.processed.columns) * 0.5
self.processed = self.processed.dropna(thresh=threshold)
else:
for col, method in strategy.items():
if col not in self.processed.columns:
continue
if method == 'mean':
self.processed[col].fillna(
self.processed[col].mean(), inplace=True
)
elif method == 'median':
self.processed[col].fillna(
self.processed[col].median(), inplace=True
)
elif method == 'mode':
self.processed[col].fillna(
self.processed[col].mode()[0], inplace=True
)
elif method == 'drop':
self.processed.dropna(subset=[col], inplace=True)
elif method == 'ffill':
self.processed[col].fillna(method='ffill', inplace=True)
self._log_step("handle_missing", len(self.processed))
return self
def add_features(self, feature_dict: Dict[str, callable]) -> 'DataProcessor':
"""Add engineered features using eval or lambda."""
for name, func in feature_dict.items():
if callable(func):
self.processed[name] = func(self.processed)
elif isinstance(func, str):
self.processed[name] = self.processed.eval(func)
self._log_step("add_features", len(self.processed))
return self
def filter_rows(self, condition: str) -> 'DataProcessor':
"""Filter rows using query syntax."""
self.processed = self.processed.query(condition).reset_index(drop=True)
self._log_step(f"filter_rows: {condition}", len(self.processed))
return self
def aggregate(self, group_by: List[str], agg_dict: Dict) -> 'DataProcessor':
"""Group and aggregate data."""
self.processed = self.processed.groupby(group_by).agg(agg_dict)
if isinstance(self.processed.index, pd.MultiIndex):
self.processed = self.processed.reset_index()
self._log_step("aggregate", len(self.processed))
return self
def summary(self) -> pd.DataFrame:
"""Return processing summary."""
return pd.DataFrame(self.steps_log)
# Usage: Full pipeline
np.random.seed(42)
n = 10000
raw_df = pd.DataFrame({
'id': range(n),
'name': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], n),
'age': np.random.randint(18, 70, n).astype('int64'),
'income': np.random.normal(50000, 15000, n).astype('float64'),
'department': np.random.choice(['Eng', 'Sales', 'Mktg', 'HR'], n),
'score': np.random.uniform(0, 100, n).astype('float64'),
'date': pd.date_range('2024-01-01', periods=n, freq='h').astype('object')
})
# Introduce some issues
raw_df.loc[np.random.choice(n, 500), 'age'] = np.nan
raw_df.loc[np.random.choice(n, 300), 'income'] = np.nan
raw_df = pd.concat([raw_df, raw_df.sample(200)]).reset_index(drop=True) # Duplicates
# Run pipeline
result = (DataProcessor(raw_df)
.optimize_dtypes()
.convert_categoricals(['department', 'name'])
.remove_duplicates(subset=['id'])
.handle_missing({'age': 'median', 'income': 'mean'})
.add_features({
'income_log': lambda df: np.log1p(df['income']),
'age_group': lambda df: pd.cut(
df['age'],
bins=[0, 25, 35, 50, 100],
labels=['Young', 'Mid', 'Senior', 'Expert']
)
})
.filter_rows('age > 20 and income > 30000')
.aggregate(
group_by=['department'],
agg_dict={
'income': ['mean', 'std'],
'score': ['mean', 'max'],
'age': 'mean',
'id': 'count'
}
)
.summary()
print("\nPipeline Summary:")
print(result)
Performance Tips Cheat Sheet
| Tip | Before | After | Speedup |
|---|---|---|---|
| Use vectorization | df.apply(f) | df['a'] * df['b'] | 10-100x |
| Use eval() | chained ops | df.eval('a*b+c') | 2-5x |
| Use query() | boolean indexing | df.query('a>0') | 1.5-3x |
| Categorical dtypes | string columns | pd.Categorical() | 5-10x mem |
| Downcast numerics | int64/float64 | int8/float32 | 2-8x mem |
| Use pipe() | nested calls | .pipe(f).pipe(g) | readability |
| Chunk large files | pd.read_csv() | chunksize=N | prevents OOM |
| Avoid .iterrows() | row iteration | vectorize | 100-1000x |
| Use .itertuples() | must iterate | named tuples | 10-50x |
Key Takeaways
πSummary: Advanced Pandas Patterns
- Vectorization is the single most important performance technique in pandas. Always prefer vectorized operations over
apply(),iterrows(), or Python loops. Vectorized operations execute at C speed, bypassing Python interpreter overhead entirely. eval()andquery()provide further optimization for complex expressions by parsing the entire expression and evaluating in a single C-level pass, reducing memory allocation from O(kΒ·n) to O(n).- Method chaining with
pipe()creates readable, composable pipelines that are easier to test and maintain, reading left-to-right like natural language. - Indexing matters:
.locfor labels (inclusive on both ends),.ilocfor positions (exclusive on right), boolean masks for filtering. Avoid chained indexing to prevent SettingWithCopyWarning. - MultiIndex enables hierarchical data analysis with multi-level aggregation and cross-section access. Use
xs()for cross-section access andstack()/unstack()for reshaping between wide and long formats. - Window functions (
rolling,expanding,ewm) are essential for time series analysis. EWM assigns exponentially decaying weights, making it more responsive to recent changes than simple rolling windows. - Categorical dtypes can reduce memory by 5-10x for columns with repeated string values by storing integer codes instead of repeated string objects.
- Monitor memory with
df.info(memory_usage='deep')and downcast dtypes to prevent OOM errors on large datasets. The downcast function automatically selects the smallest numeric type that can represent the data.
Practice Exercises
Exercise 1: Performance Optimization
# Optimize this slow code:
def process_data(df):
"""This function takes 5 seconds on 1M rows. Make it under 0.5s."""
results = []
for idx, row in df.iterrows():
if row['age'] > 30 and row['income'] > 50000:
results.append({
'name': row['name'],
'score': row['income'] / row['age'],
'category': 'high' if row['income'] > 75000 else 'medium'
})
return pd.DataFrame(results)
# Rewrite using vectorized operations
def process_data_optimized(df):
# Your code here
pass
Exercise 2: MultiIndex Challenge
# Create a MultiIndex DataFrame from sales data
# and answer these questions:
# 1. What's the total revenue per region?
# 2. What's the average order value per quarter?
# 3. Which product has the highest growth Q1βQ2?
sales_data = pd.DataFrame({
'region': np.random.choice(['North', 'South', 'East', 'West'], 1000),
'quarter': np.random.choice(['Q1', 'Q2', 'Q3', 'Q4'], 1000),
'product': np.random.choice(['A', 'B', 'C'], 1000),
'revenue': np.random.uniform(100, 10000, 1000),
'units': np.random.randint(1, 100, 1000)
})
# Your MultiIndex operations here
Exercise 3: Build a Rolling Dashboard
# Create a function that computes and visualizes:
# 1. Rolling mean with different windows (7, 14, 30 days)
# 2. Bollinger Bands
# 3. Rolling correlation between two metrics
# 4. Exponential weighted moving average
def rolling_analysis(df, value_col, date_col='date'):
"""Compute rolling statistics and return summary DataFrame."""
# Your code here
pass
Exercise 4: Memory-Efficient Pipeline
# Process this 100M-row dataset in chunks:
# 1. Read in chunks of 100K rows
# 2. Optimize dtypes in each chunk
# 3. Filter and aggregate
# 4. Combine results efficiently
def process_large_dataset(filepath):
"""Process large CSV without running out of memory."""
# Your code here
pass