Advanced Pandas: Performance & Patterns

Module 1: FoundationsFree Lesson

Advertisement

Why Performance Matters

Pandas is convenient but can be deceptively slow. Understanding its internals β€” vectorization, chunking, and efficient patterns β€” is critical when working with datasets larger than a few hundred thousand rows.

DfVectorization

The practice of applying operations to entire arrays or columns at the C level rather than iterating through elements one by one in Python. Vectorized operations exploit CPU SIMD (Single Instruction, Multiple Data) instructions to process multiple data points simultaneously, bypassing the Python interpreter overhead entirely.

The Performance Trap

Architecture Diagram
Approach              1M rows    10M rows    100M rows
─────────────────────────────────────────────────────────
for loop              2.5s       25s         250s+ πŸ’€
.apply()              1.8s       18s         180s  😰
Vectorized            0.05s      0.5s        5s    βœ“
eval()/query()        0.03s      0.3s        3s    βœ“βœ“
Numba/cuDF            0.01s      0.1s        1s    βœ“βœ“βœ“

Why the gap widens with scale: Python's per-operation overhead is roughly 100–200 ns. For 1 million rows, that adds up to 100–200 seconds of pure interpreter overhead. Vectorized operations amortize this overhead across the entire array, keeping the constant factor near zero. The speedup ratio increases with dataset size.

Why Pandas Can Be Slow

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Why Vectorization Wins                β”‚
β”‚                                                         β”‚
β”‚  Python loop:     for i in range(n):  result[i] = ...   β”‚
β”‚                   ↓                                     β”‚
β”‚                   Python interpreter overhead per item   β”‚
β”‚                   Type checking per iteration            β”‚
β”‚                   Function call overhead                 β”‚
β”‚                   O(n) Python operations                 β”‚
β”‚                                                         β”‚
β”‚  Vectorized:      result = array * 2                     β”‚
β”‚                   ↓                                     β”‚
β”‚                   Single C-level operation               β”‚
β”‚                   No Python overhead                     β”‚
β”‚                   SIMD instructions (parallel)           β”‚
β”‚                   O(1) Python operation                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“Measuring the Vectorization Speedup

import pandas as pd
import numpy as np
import time

# Performance demonstration
np.random.seed(42)
n = 1_000_000
df = pd.DataFrame({
    'a': np.random.randn(n),
    'b': np.random.randn(n),
    'c': np.random.choice(['X', 'Y', 'Z'], n)
})

# Method 1: apply() β€” SLOW
start = time.perf_counter()
df['result_apply'] = df.apply(lambda row: row['a'] * 2 + row['b'], axis=1)
time_apply = time.perf_counter() - start

# Method 2: Vectorized β€” FAST
start = time.perf_counter()
df['result_vector'] = df['a'] * 2 + df['b']
time_vector = time.perf_counter() - start

# Method 3: eval() β€” FASTEST
start = time.perf_counter()
df['result_eval'] = df.eval('a * 2 + b')
time_eval = time.perf_counter() - start

print(f"apply():     {time_apply:.4f}s")
print(f"vectorized:  {time_vector:.4f}s")
print(f"eval():      {time_eval:.4f}s")
print(f"Speedup (vector vs apply): {time_apply/time_vector:.1f}x")
print(f"Speedup (eval vs apply):   {time_apply/time_eval:.1f}x")

Typical Output:

Architecture Diagram
apply():     1.8234s
vectorized:  0.0156s
eval():      0.0112s
Speedup (vector vs apply): 116.9x
Speedup (eval vs apply):   162.8x

eval() and query(): Fast Expression Evaluation

eval() and query() use expression parsing to evaluate operations in a single pass, avoiding intermediate array creation.

DfExpression Parsing in eval()

A technique where the entire arithmetic expression is parsed as an abstract syntax tree and evaluated in a single C-level pass. Rather than creating intermediate arrays for each binary operation, eval() rewrites the expression to minimize temporary allocations. For an expression like a * b + c, naive pandas creates two intermediates; eval() creates only one.

Memory Allocation with eval() vs Chained Operations

Chained:Β Mchained=O(kβ‹…n)Β forΒ kΒ operations\text{Chained: } M_{\text{chained}} = O(k \cdot n) \text{ for } k \text{ operations}

Here,

  • MM=
  • nn=
  • kk=

For an expression like a * b + c, naive pandas creates:

  1. temp1 = a Γ— b (intermediate array)
  2. result = temp1 + c (another intermediate)

eval() parses the entire expression and evaluates it in one C-level operation, using only one intermediate allocation.

When to use eval() vs query(): Use eval() when you are creating new columns from complex arithmetic on existing columns. Use query() when you are filtering rows based on multiple conditions. Both avoid intermediate allocations, but eval() is most beneficial when the expression involves 3+ columns.

import pandas as pd
import numpy as np

np.random.seed(42)
n = 1_000_000
df = pd.DataFrame({
    'x': np.random.randn(n),
    'y': np.random.randn(n),
    'z': np.random.randn(n),
    'w': np.random.randn(n)
})

# Standard approach: multiple intermediate arrays
df['result_std'] = df['x'] * df['y'] + df['z'] - df['w'] * 2

# eval(): single-pass evaluation, no intermediates
df['result_eval'] = df.eval('x * y + z - w * 2')

# query(): filter with expression strings
# Equivalent to: df[(df['x'] > 0) & (df['y'] > 1) & (df['z'] < df['w'])]
result = df.query('x > 0 and y > 1 and z < w')
print(f"Rows matching query: {len(result)}")

# Local variables in query
threshold = 0.5
result2 = df.query('x > @threshold and y > @threshold')
print(f"Rows with both > {threshold}: {len(result2)}")

# Assignment with eval (modify in place)
df.eval('x_squared = x ** 2', inplace=True)
print(df[['x', 'x_squared']].head())
#           x  x_squared
# 0  0.248357   0.061681
# 1 -0.072008   0.005185
# 2  0.253489   0.064257
# 3 -0.035586   0.001266
# 4  0.203452   0.041393

Performance Comparison

import time

np.random.seed(42)
n = 5_000_000
df = pd.DataFrame({
    'a': np.random.randn(n),
    'b': np.random.randn(n),
    'c': np.random.randn(n),
    'd': np.random.randn(n)
})

def benchmark(label, func, *args):
    start = time.perf_counter()
    result = func(*args)
    elapsed = time.perf_counter() - start
    return label, elapsed, result

results = []

# Standard chaining
results.append(benchmark(
    'chained', lambda: df['a'] * df['b'] + df['c'] / df['d'] - df['a']**2
))

# eval
results.append(benchmark(
    'eval', lambda: df.eval('a * b + c / d - a ** 2')
))

# query for filtering
results.append(benchmark(
    'query', lambda: df.query('a > 0 and b > 0 and c > 0')
))

# Boolean indexing (slow)
results.append(benchmark(
    'boolean_idx', lambda: df[(df['a'] > 0) & (df['b'] > 0) & (df['c'] > 0)]
))

print(f"{'Method':<15} {'Time (ms)':<12}")
print("-" * 27)
for label, elapsed, _ in results:
    print(f"{label:<15} {elapsed*1000:<12.2f}")

Typical Results:

Architecture Diagram
Method          Time (ms)
---------------------------
chained         45.23
eval            28.67
query           31.45
boolean_idx     52.18

Method Chaining: pipe() for Clean Pipelines

Method chaining creates readable, linear data transformations without temporary variables. The pipe() method enables custom functions in chains.

DfMethod Chaining

A programming pattern where multiple method calls are written sequentially, with the output of one method passed as the input to the next. In pandas, most methods return a DataFrame, enabling this pattern. The pipe() method extends this to custom functions, enabling composable, testable, and readable data pipelines.

The pipe() advantage over nested calls: Nested function calls like g(f(h(df))) read inside-out, increasing cognitive load. Method chains like df.pipe(h).pipe(f).pipe(g) read left-to-right, matching natural reading order. This is especially important when pipelines have 5+ steps.

import pandas as pd
import numpy as np

# Sample data
np.random.seed(42)
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=1000, freq='D'),
    'store': np.random.choice(['NYC', 'LA', 'Chicago'], 1000),
    'product': np.random.choice(['A', 'B', 'C'], 1000),
    'revenue': np.random.exponential(100, 1000),
    'units': np.random.randint(1, 50, 1000),
    'cost': np.random.exponential(50, 1000)
})

# Without chaining: nested, hard to read
result1 = (
    df.assign(
        profit=lambda x: x['revenue'] - x['cost'],
        margin=lambda x: (x['revenue'] - x['cost']) / x['revenue']
    )
    .query('margin > 0.1')
    .groupby(['store', 'product'])
    .agg({
        'revenue': 'sum',
        'profit': 'sum',
        'margin': 'mean',
        'units': 'sum'
    })
    .reset_index()
    .sort_values('profit', ascending=False)
)

# With pipe(): composable and testable
def add_profit_margin(data):
    """Calculate profit and margin."""
    return data.assign(
        profit=lambda x: x['revenue'] - x['cost'],
        margin=lambda x: (x['revenue'] - x['cost']) / x['revenue']
    )

def filter_profitable(data, min_margin=0.1):
    """Filter by minimum margin."""
    return data[data['margin'] > min_margin]

def aggregate_by_store_product(data):
    """Aggregate metrics by store and product."""
    return (data
            .groupby(['store', 'product'])
            .agg({
                'revenue': 'sum',
                'profit': 'sum',
                'margin': 'mean',
                'units': 'sum'
            })
            .reset_index())

def rank_by_profit(data):
    """Sort by profit descending."""
    return data.sort_values('profit', ascending=False)

# Clean pipeline
result2 = (df
           .pipe(add_profit_margin)
           .pipe(filter_profitable, min_margin=0.15)
           .pipe(aggregate_by_store_product)
           .pipe(rank_by_profit))

print(result2.head())

Advanced Indexing: .loc vs .iloc vs []

Understanding the difference between indexing methods prevents silent bugs and performance issues.

DfLabel-Based Indexing (.loc)

Selects data by row and column labels rather than integer positions. Unlike Python slicing, .loc is inclusive on both ends β€” df.loc['a':'c'] returns rows labeled 'a', 'b', and 'c'. This can cause subtle bugs when switching between label-based and position-based indexing.

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Pandas Indexing Methods                     β”‚
β”‚                                                               β”‚
β”‚  .loc[label]     β†’ Label-based        df.loc[0, 'name']      β”‚
β”‚  .iloc[position] β†’ Integer position   df.iloc[0, 1]          β”‚
β”‚  []              β†’ Mixed (context)    df[condition]           β”‚
β”‚                                                               β”‚
β”‚  ⚠ WARNING: SettingWithCopyWarning                          β”‚
β”‚  df[df['col'] > 0]['new_col'] = value  ← WRONG (copy)       β”‚
β”‚  df.loc[df['col'] > 0, 'new_col'] = val ← RIGHT (view)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 30, 35, 28, 32],
    'score': [85, 92, 78, 95, 88],
    'grade': ['B', 'A', 'B+', 'A+', 'A']
}, index=['r0', 'r1', 'r2', 'r3', 'r4'])

# .loc: label-based (inclusive on both ends!)
print("=== .loc (label-based) ===")
print(df.loc['r0':'r2', 'name':'score'])  # Includes r2!
#          name  age  score
# r0    Alice   25     85
# r1      Bob   30     92
# r2  Charlie   35     78

# .iloc: position-based (exclusive on right, like Python slicing)
print("\n=== .iloc (position-based) ===")
print(df.iloc[0:3, 0:2])  # Excludes position 3!
#          name  age
# r0    Alice   25
# r1      Bob   30
# r2  Charlie   35

# Boolean indexing
print("\n=== Boolean indexing ===")
print(df[df['age'] > 30])
#          name  age  score grade
# r2  Charlie   35     78    B+
# r4      Eve   32     88     A

# Chained assignment (WRONG - triggers SettingWithCopyWarning)
# df[df['age'] > 30]['grade'] = 'Senior'  # ❌

# Proper assignment (RIGHT)
df.loc[df['age'] > 30, 'grade'] = 'Senior'
print("\nAfter proper assignment:")
print(df.loc[df['age'] > 30])
#          name  age  score   grade
# r2  Charlie   35     78  Senior
# r4      Eve   32     88  Senior

The SettingWithCopyWarning explained: When you write df[mask]['col'] = val, pandas first creates a copy of the data with df[mask], then attempts to assign into that copy. The assignment silently fails on a copy, not the original DataFrame. Always use .loc[mask, 'col'] = val to operate on a view of the original data.

Indexing Performance Comparison

import pandas as pd
import numpy as np
import time

n = 1_000_000
df = pd.DataFrame({
    'value': np.random.randn(n),
    'category': np.random.choice(['A', 'B', 'C', 'D'], n)
})
df.index = range(n)  # Integer index

# Benchmark different indexing methods
def timeit(label, func, n_runs=100):
    times = []
    for _ in range(n_runs):
        start = time.perf_counter()
        func()
        times.append(time.perf_counter() - start)
    return label, np.median(times)

results = []
results.append(timeit('df[df["value"] > 0]', lambda: df[df['value'] > 0]))
results.append(timeit('df.loc[df["value"] > 0]', lambda: df.loc[df['value'] > 0]))
results.append(timeit('df.query("value > 0")', lambda: df.query('value > 0')))
results.append(timeit('df.iloc[:500000]', lambda: df.iloc[:500000]))
results.append(timeit('df.loc[:500000]', lambda: df.loc[:500000]))

print(f"{'Method':<35} {'Median (ms)':<12}")
print("-" * 47)
for label, t in results:
    print(f"{label:<35} {t*1000:<12.3f}")

MultiIndex: Hierarchical Indexing

MultiIndex enables multi-dimensional grouping within a 2D DataFrame β€” essential for panel data, time series with multiple hierarchies, and pivot results.

DfMultiIndex (Hierarchical Indexing)

A pandas index structure that enables multiple levels of row or column labels on a single DataFrame axis. Each level represents a categorical variable, and the combination of levels uniquely identifies each row. MultiIndex is analogous to a composite database key and enables group-by operations across multiple dimensions simultaneously.

Architecture Diagram
MultiIndex Structure:
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚        MultiIndex               β”‚
                     β”‚  Level 0    Level 1    Level 2  β”‚
                     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
                     β”‚  β”‚ NYC     β”‚ Q1      β”‚ Jan    β”‚ β”‚
                     β”‚  β”‚ NYC     β”‚ Q1      β”‚ Feb    β”‚ β”‚
                     β”‚  β”‚ NYC     β”‚ Q2      β”‚ Mar    β”‚ β”‚
                     β”‚  β”‚ LA      β”‚ Q1      β”‚ Jan    β”‚ β”‚
                     β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
                     β”‚  Store      Quarter    Month    β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

When to use MultiIndex vs separate DataFrames: MultiIndex is ideal when your data naturally has hierarchical structure (e.g., city-quarter-month) and you need to aggregate or slice across multiple levels. If you only ever access data by a single key, a flat index with a regular column is simpler and often faster.

import pandas as pd
import numpy as np

# Create MultiIndex DataFrame
np.random.seed(42)
arrays = [
    ['NYC', 'NYC', 'NYC', 'LA', 'LA', 'LA', 'Chicago', 'Chicago', 'Chicago'],
    ['Q1', 'Q1', 'Q2', 'Q1', 'Q1', 'Q2', 'Q1', 'Q1', 'Q2'],
    ['Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar', 'Jan', 'Feb', 'Mar']
]
index = pd.MultiIndex.from_arrays(arrays, names=['City', 'Quarter', 'Month'])

df = pd.DataFrame({
    'revenue': np.random.uniform(10000, 50000, 9),
    'units': np.random.randint(100, 1000, 9),
    'cost': np.random.uniform(5000, 25000, 9)
}, index=index)

print("MultiIndex DataFrame:")
print(df)

# Accessing MultiIndex levels
print("\nAll cities:", df.index.get_level_values('City').unique().tolist())
print("All quarters:", df.index.get_level_values('Quarter').unique().tolist())

# Cross-section: select all data for NYC
print("\nNYC data:")
print(df.loc['NYC'])

# Cross-section: select all Q1 data across all cities
print("\nAll Q1 data:")
print(df.xs('Q1', level='Quarter'))

# Stack/Unstack: reshape between wide and long format
print("\nUnstacked (wide format):")
print(df.unstack(level='City'))

# Swap levels for different grouping
print("\nSwapped levels (Quarter first):")
print(df.swaplevel('Quarter', 'City').sort_index())

# Aggregation across levels
print("\nRevenue by City:")
print(df.groupby(level='City')['revenue'].sum())

print("\nRevenue by Quarter across all cities:")
print(df.groupby(level='Quarter')['revenue'].sum())

Creating MultiIndex from Columns

# Create flat DataFrame
df_flat = pd.DataFrame({
    'city': ['NYC', 'NYC', 'LA', 'LA', 'Chicago', 'Chicago'],
    'year': [2023, 2024, 2023, 2024, 2023, 2024],
    'revenue': [100000, 120000, 95000, 110000, 80000, 95000]
})

# Convert to MultiIndex
df_multi = df_flat.set_index(['city', 'year'])
print("From set_index:")
print(df_multi)

# Reset back to flat
df_flat_again = df_multi.reset_index()
print("\nBack to flat:")
print(df_flat_again)

# Pivot to create MultiIndex columns
df_pivot = df_flat.pivot(index='city', columns='year', values='revenue')
print("\nPivoted:")
print(df_pivot)

Window Functions: Rolling, Expanding, EWM

Window functions compute sliding window aggregations β€” essential for time series analysis.

Mathematical Definitions

Rolling window of size ww for function ff:

Rolling Window Aggregation

rolling_ft=f(xtβˆ’w+1,xtβˆ’w+2,…,xt)\text{rolling\_f}_t = f(x_{t-w+1}, x_{t-w+2}, \ldots, x_t)

Here,

  • ww=
  • ff=
  • tt=

Expanding window (cumulative):

Expanding Window (Cumulative) Aggregation

expanding_ft=f(x1,x2,…,xt)\text{expanding\_f}_t = f(x_1, x_2, \ldots, x_t)

Here,

  • ff=
  • tt=

Exponentially Weighted Moving (EWM):

ewmt=βˆ‘i=0tΞ±ixtβˆ’iβˆ‘i=0tΞ±i\text{ewm}_t = \frac{\sum_{i=0}^{t} \alpha^i x_{t-i}}{\sum_{i=0}^{t} \alpha^i}

Why EWM over simple rolling? EWM assigns exponentially decreasing weights to older observations, giving recent data more influence. This makes EWM more responsive to recent changes than a simple rolling mean of the same effective window. The half-life parameter provides an intuitive way to control memory: after half_life periods, a past observation's weight drops to 50% of its peak.

import pandas as pd
import numpy as np

# Generate time series data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=365, freq='D')
df = pd.DataFrame({
    'date': dates,
    'price': 100 + np.cumsum(np.random.randn(365) * 2),
    'volume': np.random.randint(1000, 10000, 365)
}).set_index('date')

# Rolling statistics
df['rolling_mean_7d'] = df['price'].rolling(window=7).mean()
df['rolling_std_7d'] = df['price'].rolling(window=7).std()
df['rolling_min_30d'] = df['price'].rolling(window=30).min()
df['rolling_max_30d'] = df['price'].rolling(window=30).max()

# Expanding (cumulative) statistics
df['expanding_mean'] = df['price'].expanding().mean()
df['expanding_std'] = df['price'].expanding().std()

# EWM (Exponentially Weighted Moving)
df['ewm_7d'] = df['price'].ewm(span=7).mean()
df['ewm_30d'] = df['price'].ewm(span=30).mean()

# Custom rolling aggregation
df['rolling_range'] = df['price'].rolling(7).apply(
    lambda x: x.max() - x.min(), raw=True
)

# Bollinger Bands (trading indicator)
window = 20
df['bb_mid'] = df['price'].rolling(window).mean()
df['bb_std'] = df['price'].rolling(window).std()
df['bb_upper'] = df['bb_mid'] + 2 * df['bb_std']
df['bb_lower'] = df['bb_mid'] - 2 * df['bb_std']

print(df[['price', 'rolling_mean_7d', 'bb_upper', 'bb_lower']].tail())

Rolling with Multiple Windows

# Multiple window sizes in one operation
windows = [7, 14, 30, 90]
for w in windows:
    df[f'mean_{w}d'] = df['price'].rolling(w).mean()
    df[f'std_{w}d'] = df['price'].rolling(w).std()
    df[f'change_{w}d'] = df['price'].pct_change(w)

# Rolling correlation between price and volume
df['price_vol_corr_30d'] = df['price'].rolling(30).corr(df['volume'])

# Rolling apply with multiple values
def rolling_trend(series):
    """Compute linear trend over rolling window."""
    x = np.arange(len(series))
    slope = np.polyfit(x, series, 1)[0]
    return slope

df['trend_30d'] = df['price'].rolling(30).apply(rolling_trend, raw=True)

print(df[['price', 'mean_7d', 'mean_30d', 'trend_30d']].head(35))

String Operations: .str Accessor

The .str accessor enables vectorized string operations β€” apply string methods to entire columns without loops.

import pandas as pd

# Sample text data
df = pd.DataFrame({
    'text': [
        '  Hello World  ',
        'python is GREAT',
        'data-science@example.com',
        'Price: $1,234.56',
        'NEW YORK, NY 10001',
        '  Machine Learning  '
    ]
})

# Basic string operations
df['cleaned'] = df['text'].str.strip()
df['lower'] = df['text'].str.lower()
df['upper'] = df['text'].str.upper()
df['title'] = df['text'].str.title()

# Length and containment
df['length'] = df['text'].str.len()
df['contains_science'] = df['text'].str.contains('science', case=False)
df['starts_with_hello'] = df['text'].str.startswith('Hello')

# Split and extract
df['first_word'] = df['text'].str.strip().str.split().str[0]
df['word_count'] = df['text'].str.strip().str.split().str.len()

# Extract patterns with regex
df['email_domain'] = df['text'].str.extract(r'@(\w+)')
df['numbers'] = df['text'].str.extractall(r'(\d+)').groupby(level=0).apply(list)

# Replace
df['no_spaces'] = df['text'].str.replace(' ', '_', regex=False)

# Pad and slice
df['padded'] = df['text'].str.strip().str.pad(20, fillchar='=')
df['first_10'] = df['text'].str[:10]

# Categorical encoding from strings
df['category'] = pd.Categorical(df['text'].str.strip().str.lower())

print(df[['text', 'cleaned', 'lower', 'first_word', 'length']])

Advanced Regex with .str

# Complex regex patterns
df = pd.DataFrame({
    'raw': [
        'Order #12345 - Total: $1,234.56',
        'Invoice ABC-789 - Amount: $567.89',
        'Order #67890 - Total: $901.23',
        'INV-001 - $12.34'
    ]
})

# Extract multiple patterns
df['order_id'] = df['raw'].str.extract(r'#(\d+|-[\d]+)')
df['amount'] = df['raw'].str.extract(r'\$([\d,]+\.?\d*)')
df['amount_numeric'] = df['amount'].str.replace(',', '', regex=False).astype(float)

# Named groups
pattern = r'(?P<prefix>#|INV-)(?P<id>\d+)'
df['prefix'] = df['raw'].str.extract(pattern)['prefix']
df['extracted_id'] = df['raw'].str.extract(pattern)['id']

# Replace with function
df['normalized'] = df['raw'].str.replace(
    r'\$[\d,]+\.?\d*',
    lambda m: f"${float(m.group().replace('$','').replace(',','')) * 1.1:.2f}",
    regex=True
)

print(df[['raw', 'order_id', 'amount', 'normalized']])

Categorical Data: Memory-Efficient Enums

Categorical data stores integer codes instead of repeated strings, dramatically reducing memory.

DfCategorical Data Type

A pandas data type that stores each unique string value as an integer code, with a separate mapping array. For a column with kk unique values repeated nn times, string storage requires O(nβ‹…L)O(n \cdot L) bytes (where LL is average string length), while categorical storage requires O(n)O(n) bytes for codes plus O(kβ‹…L)O(k \cdot L) bytes for categories β€” yielding O(nβ‹…L)β†’O(n+kβ‹…L)O(n \cdot L) \to O(n + k \cdot L) reduction.

Architecture Diagram
String Storage:              Categorical Storage:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ "Engineering" (12B)  β”‚     β”‚ Code: 0              β”‚
β”‚ "Engineering" (12B)  β”‚     β”‚ Code: 0              β”‚
β”‚ "Sales"       (5B)   β”‚     β”‚ Code: 1              β”‚
β”‚ "Engineering" (12B)  β”‚     β”‚ Code: 0              β”‚
β”‚ "Marketing"   (9B)   β”‚     β”‚ Code: 2              β”‚
β”‚ "Sales"       (5B)   β”‚     β”‚ Code: 1              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Total: 55 bytes              Codes: 6 bytes + categories: 26 bytes
                             Total: 32 bytes (42% savings)

Categorical memory savings formula: For nn rows with kk unique string values of average length LL bytes, the savings are approximately nβ‹…(Lβˆ’1)βˆ’kβ‹…Ln \cdot (L - 1) - k \cdot L bytes (since each code is 1 byte for < 256 categories). The breakeven point is roughly n>kβ‹…L/(Lβˆ’1)β‰ˆkn > k \cdot L / (L - 1) \approx k β€” categorical becomes worthwhile when you have at least ~10Γ— more rows than unique categories.

import pandas as pd
import numpy as np

# Sample data with repeated strings
n = 1_000_000
df = pd.DataFrame({
    'department': np.random.choice(
        ['Engineering', 'Sales', 'Marketing', 'HR', 'Finance', 'Legal'],
        n
    ),
    'level': np.random.choice(['Junior', 'Mid', 'Senior', 'Staff', 'Principal'], n),
    'location': np.random.choice(['NYC', 'LA', 'Chicago', 'Seattle', 'Austin'], n)
})

# Memory before categorization
mem_before = df.memory_usage(deep=True).sum() / 1024 / 1024
print(f"Memory before: {mem_before:.2f} MB")

# Convert to categorical
df['department'] = pd.Categorical(df['department'])
df['level'] = pd.Categorical(df['level'])
df['location'] = pd.Categorical(df['location'])

# Memory after categorization
mem_after = df.memory_usage(deep=True).sum() / 1024 / 1024
print(f"Memory after: {mem_after:.2f} MB")
print(f"Memory saved: {(1 - mem_after/mem_before)*100:.1f}%")

# Ordered categorities (for sorting)
df['level'] = pd.Categorical(
    df['level'],
    categories=['Junior', 'Mid', 'Senior', 'Staff', 'Principal'],
    ordered=True
)

# Now comparisons work!
print(df[df['level'] > 'Senior'].head())
print(f"\nSorted by level:")
print(df.sort_values('level')['level'].value_counts())

# Category info
print(f"\nDepartment categories: {df['department'].cat.categories.tolist()}")
print(f"Category codes (first 5): {df['department'].cat.codes.head().tolist()}")

Memory Optimization

import pandas as pd
import numpy as np

# Create a memory-heavy DataFrame
n = 1_000_000
df = pd.DataFrame({
    'id': range(n),
    'age': np.random.randint(18, 80, n).astype('int64'),
    'income': np.random.normal(50000, 15000, n).astype('float64'),
    'score': np.random.uniform(0, 100, n).astype('float64'),
    'is_active': np.random.choice([True, False], n).astype('bool'),
    'department': np.random.choice(['A', 'B', 'C', 'D', 'E'], n),
    'timestamp': pd.date_range('2024-01-01', periods=n, freq='s')
})

# Check memory usage
print("Memory usage by column:")
print(df.memory_usage(deep=True))
print(f"\nTotal: {df.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")

# Downcast numeric types
def downcast(df):
    """Downcast numeric columns to smallest possible type."""
    result = df.copy()
    for col in result.select_dtypes(include=['int64']).columns:
        result[col] = pd.to_numeric(result[col], downcast='integer')
    for col in result.select_dtypes(include=['float64']).columns:
        result[col] = pd.to_numeric(result[col], downcast='float')
    return result

df_optimized = downcast(df)

# Memory after downcasting
mem_after = df_optimized.memory_usage(deep=True).sum() / 1024 / 1024
print(f"\nMemory after downcasting: {mem_after:.2f} MB")

# Show dtype changes
print("\nDtype comparison:")
for col in df.columns:
    before = df[col].dtype
    after = df_optimized[col].dtype
    if before != after:
        print(f"  {col}: {before} β†’ {after}")

# Chunked reading for very large files
def process_large_file(filepath, chunksize=10000):
    """Process large CSV in chunks."""
    results = []
    for chunk in pd.read_csv(filepath, chunksize=chunksize):
        # Process each chunk
        result = chunk.groupby('category').agg({'value': 'sum'})
        results.append(result)
    return pd.concat(results).groupby(level=0).sum()

Performance Benchmarking

import pandas as pd
import numpy as np
import time

np.random.seed(42)
n = 2_000_000
df = pd.DataFrame({
    'a': np.random.randn(n),
    'b': np.random.randn(n),
    'c': np.random.choice(['X', 'Y', 'Z', 'W'], n)
})

def benchmark(func, n_runs=10, label=""):
    """Run benchmark and return statistics."""
    times = []
    for _ in range(n_runs):
        start = time.perf_counter()
        result = func()
        times.append(time.perf_counter() - start)
    return {
        'label': label,
        'median': np.median(times),
        'mean': np.mean(times),
        'std': np.std(times),
        'min': min(times),
        'max': max(times)
    }

# Benchmark different approaches for group operation
benchmarks = []

# Method 1: groupby + agg
benchmarks.append(benchmark(
    lambda: df.groupby('c').agg({'a': 'mean', 'b': 'sum'}),
    label='groupby + agg'
))

# Method 2: groupby + apply
benchmarks.append(benchmark(
    lambda: df.groupby('c').apply(lambda g: pd.Series({
        'a_mean': g['a'].mean(),
        'b_sum': g['b'].sum()
    })),
    label='groupby + apply'
))

# Method 3: pivot_table
benchmarks.append(benchmark(
    lambda: df.pivot_table(values=['a', 'b'], index='c', aggfunc={'a': 'mean', 'b': 'sum'}),
    label='pivot_table'
))

# Method 4: crosstab
benchmarks.append(benchmark(
    lambda: pd.crosstab(df['c'], values=df['a'], aggfunc='mean'),
    label='crosstab'
))

# Display results
print(f"{'Method':<25} {'Median (ms)':<12} {'Std (ms)':<12} {'Speedup':<10}")
print("-" * 59)
baseline = benchmarks[0]['median']
for b in sorted(benchmarks, key=lambda x: x['median']):
    speedup = baseline / b['median']
    print(f"{b['label']:<25} {b['median']*1000:<12.2f} {b['std']*1000:<12.2f} {speedup:<10.1f}x")

Typical Results:

Architecture Diagram
Method                  Median (ms)  Std (ms)    Speedup
-----------------------------------------------------------
groupby + agg           45.23        2.34        1.0x
pivot_table             52.18        3.12        0.9x
crosstab                38.67        1.89        1.2x
groupby + apply         180.45       8.56        0.3x

Complete Data Processing Pipeline

import pandas as pd
import numpy as np
from typing import Dict, List, Tuple

class DataProcessor:
    """Production-grade data processor with performance monitoring."""

    def __init__(self, df: pd.DataFrame):
        self.original = df.copy()
        self.processed = df.copy()
        self.steps_log: List[Dict] = []
        self._log_step("init", len(df))

    def _log_step(self, step_name: str, n_rows: int):
        self.steps_log.append({
            'step': step_name,
            'n_rows': n_rows,
            'memory_mb': self.processed.memory_usage(deep=True).sum() / 1024 / 1024
        })

    def optimize_dtypes(self) -> 'DataProcessor':
        """Downcast numeric types to reduce memory."""
        for col in self.processed.select_dtypes(include=['int64']).columns:
            self.processed[col] = pd.to_numeric(
                self.processed[col], downcast='integer'
            )
        for col in self.processed.select_dtypes(include=['float64']).columns:
            self.processed[col] = pd.to_numeric(
                self.processed[col], downcast='float'
            )
        self._log_step("optimize_dtypes", len(self.processed))
        return self

    def convert_categoricals(self, columns: List[str]) -> 'DataProcessor':
        """Convert string columns to categorical type."""
        for col in columns:
            if col in self.processed.columns:
                self.processed[col] = pd.Categorical(self.processed[col])
        self._log_step("convert_categoricals", len(self.processed))
        return self

    def remove_duplicates(self, subset: List[str] = None,
                          keep: str = 'first') -> 'DataProcessor':
        """Remove duplicate rows."""
        before = len(self.processed)
        self.processed = self.processed.drop_duplicates(
            subset=subset, keep=keep
        ).reset_index(drop=True)
        self._log_step(
            f"remove_duplicates (removed {before - len(self.processed)})",
            len(self.processed)
        )
        return self

    def handle_missing(self, strategy: Dict[str, str] = None) -> 'DataProcessor':
        """Handle missing values with specified strategies.

        Strategies: 'mean', 'median', 'mode', 'drop', 'ffill', 'bfill'
        """
        if strategy is None:
            # Default: drop rows with >50% missing
            threshold = len(self.processed.columns) * 0.5
            self.processed = self.processed.dropna(thresh=threshold)
        else:
            for col, method in strategy.items():
                if col not in self.processed.columns:
                    continue
                if method == 'mean':
                    self.processed[col].fillna(
                        self.processed[col].mean(), inplace=True
                    )
                elif method == 'median':
                    self.processed[col].fillna(
                        self.processed[col].median(), inplace=True
                    )
                elif method == 'mode':
                    self.processed[col].fillna(
                        self.processed[col].mode()[0], inplace=True
                    )
                elif method == 'drop':
                    self.processed.dropna(subset=[col], inplace=True)
                elif method == 'ffill':
                    self.processed[col].fillna(method='ffill', inplace=True)

        self._log_step("handle_missing", len(self.processed))
        return self

    def add_features(self, feature_dict: Dict[str, callable]) -> 'DataProcessor':
        """Add engineered features using eval or lambda."""
        for name, func in feature_dict.items():
            if callable(func):
                self.processed[name] = func(self.processed)
            elif isinstance(func, str):
                self.processed[name] = self.processed.eval(func)
        self._log_step("add_features", len(self.processed))
        return self

    def filter_rows(self, condition: str) -> 'DataProcessor':
        """Filter rows using query syntax."""
        self.processed = self.processed.query(condition).reset_index(drop=True)
        self._log_step(f"filter_rows: {condition}", len(self.processed))
        return self

    def aggregate(self, group_by: List[str], agg_dict: Dict) -> 'DataProcessor':
        """Group and aggregate data."""
        self.processed = self.processed.groupby(group_by).agg(agg_dict)
        if isinstance(self.processed.index, pd.MultiIndex):
            self.processed = self.processed.reset_index()
        self._log_step("aggregate", len(self.processed))
        return self

    def summary(self) -> pd.DataFrame:
        """Return processing summary."""
        return pd.DataFrame(self.steps_log)


# Usage: Full pipeline
np.random.seed(42)
n = 10000
raw_df = pd.DataFrame({
    'id': range(n),
    'name': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], n),
    'age': np.random.randint(18, 70, n).astype('int64'),
    'income': np.random.normal(50000, 15000, n).astype('float64'),
    'department': np.random.choice(['Eng', 'Sales', 'Mktg', 'HR'], n),
    'score': np.random.uniform(0, 100, n).astype('float64'),
    'date': pd.date_range('2024-01-01', periods=n, freq='h').astype('object')
})

# Introduce some issues
raw_df.loc[np.random.choice(n, 500), 'age'] = np.nan
raw_df.loc[np.random.choice(n, 300), 'income'] = np.nan
raw_df = pd.concat([raw_df, raw_df.sample(200)]).reset_index(drop=True)  # Duplicates

# Run pipeline
result = (DataProcessor(raw_df)
    .optimize_dtypes()
    .convert_categoricals(['department', 'name'])
    .remove_duplicates(subset=['id'])
    .handle_missing({'age': 'median', 'income': 'mean'})
    .add_features({
        'income_log': lambda df: np.log1p(df['income']),
        'age_group': lambda df: pd.cut(
            df['age'],
            bins=[0, 25, 35, 50, 100],
            labels=['Young', 'Mid', 'Senior', 'Expert']
        )
    })
    .filter_rows('age > 20 and income > 30000')
    .aggregate(
        group_by=['department'],
        agg_dict={
            'income': ['mean', 'std'],
            'score': ['mean', 'max'],
            'age': 'mean',
            'id': 'count'
        }
    )
    .summary()

print("\nPipeline Summary:")
print(result)

Performance Tips Cheat Sheet

TipBeforeAfterSpeedup
Use vectorizationdf.apply(f)df['a'] * df['b']10-100x
Use eval()chained opsdf.eval('a*b+c')2-5x
Use query()boolean indexingdf.query('a>0')1.5-3x
Categorical dtypesstring columnspd.Categorical()5-10x mem
Downcast numericsint64/float64int8/float322-8x mem
Use pipe()nested calls.pipe(f).pipe(g)readability
Chunk large filespd.read_csv()chunksize=Nprevents OOM
Avoid .iterrows()row iterationvectorize100-1000x
Use .itertuples()must iteratenamed tuples10-50x

Key Takeaways

πŸ“‹Summary: Advanced Pandas Patterns

  1. Vectorization is the single most important performance technique in pandas. Always prefer vectorized operations over apply(), iterrows(), or Python loops. Vectorized operations execute at C speed, bypassing Python interpreter overhead entirely.
  2. eval() and query() provide further optimization for complex expressions by parsing the entire expression and evaluating in a single C-level pass, reducing memory allocation from O(kΒ·n) to O(n).
  3. Method chaining with pipe() creates readable, composable pipelines that are easier to test and maintain, reading left-to-right like natural language.
  4. Indexing matters: .loc for labels (inclusive on both ends), .iloc for positions (exclusive on right), boolean masks for filtering. Avoid chained indexing to prevent SettingWithCopyWarning.
  5. MultiIndex enables hierarchical data analysis with multi-level aggregation and cross-section access. Use xs() for cross-section access and stack()/unstack() for reshaping between wide and long formats.
  6. Window functions (rolling, expanding, ewm) are essential for time series analysis. EWM assigns exponentially decaying weights, making it more responsive to recent changes than simple rolling windows.
  7. Categorical dtypes can reduce memory by 5-10x for columns with repeated string values by storing integer codes instead of repeated string objects.
  8. Monitor memory with df.info(memory_usage='deep') and downcast dtypes to prevent OOM errors on large datasets. The downcast function automatically selects the smallest numeric type that can represent the data.

Practice Exercises

Exercise 1: Performance Optimization

# Optimize this slow code:
def process_data(df):
    """This function takes 5 seconds on 1M rows. Make it under 0.5s."""
    results = []
    for idx, row in df.iterrows():
        if row['age'] > 30 and row['income'] > 50000:
            results.append({
                'name': row['name'],
                'score': row['income'] / row['age'],
                'category': 'high' if row['income'] > 75000 else 'medium'
            })
    return pd.DataFrame(results)

# Rewrite using vectorized operations
def process_data_optimized(df):
    # Your code here
    pass

Exercise 2: MultiIndex Challenge

# Create a MultiIndex DataFrame from sales data
# and answer these questions:
# 1. What's the total revenue per region?
# 2. What's the average order value per quarter?
# 3. Which product has the highest growth Q1β†’Q2?

sales_data = pd.DataFrame({
    'region': np.random.choice(['North', 'South', 'East', 'West'], 1000),
    'quarter': np.random.choice(['Q1', 'Q2', 'Q3', 'Q4'], 1000),
    'product': np.random.choice(['A', 'B', 'C'], 1000),
    'revenue': np.random.uniform(100, 10000, 1000),
    'units': np.random.randint(1, 100, 1000)
})

# Your MultiIndex operations here

Exercise 3: Build a Rolling Dashboard

# Create a function that computes and visualizes:
# 1. Rolling mean with different windows (7, 14, 30 days)
# 2. Bollinger Bands
# 3. Rolling correlation between two metrics
# 4. Exponential weighted moving average

def rolling_analysis(df, value_col, date_col='date'):
    """Compute rolling statistics and return summary DataFrame."""
    # Your code here
    pass

Exercise 4: Memory-Efficient Pipeline

# Process this 100M-row dataset in chunks:
# 1. Read in chunks of 100K rows
# 2. Optimize dtypes in each chunk
# 3. Filter and aggregate
# 4. Combine results efficiently

def process_large_dataset(filepath):
    """Process large CSV without running out of memory."""
    # Your code here
    pass

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement