Pandas 102: GroupBy, Merge, Pivot

GroupBy Operations

DfSplit-Apply-Combine

The split-apply-combine paradigm (Wickham, 2011) is the foundational pattern for grouped data operations. It consists of three steps:

Split: Divide the data into groups based on one or more keys
Apply: Apply a function independently to each group (aggregation, transformation, or filtering)
Combine: Merge the results back into a single data structure

This pattern is equivalent to SQL's GROUP BY but far more flexible in Python.

Architecture Diagram

Split → Apply → Combine

┌─────────────┐     ┌─────────┐     ┌─────────────┐
│  DataFrame  │ ──→ │  Split  │ ──→ │  Aggregate  │
│             │     │  by key │     │   results   │
└─────────────┘     └─────────┘     └─────────────┘

Basic GroupBy

import pandas as pd
import numpy as np

# Sample sales data
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=12, freq='M'),
    'region': ['North', 'South', 'East', 'West'] * 3,
    'product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
    'sales': [100, 150, 200, 120, 180, 160, 220, 140, 190, 170, 210, 155],
    'quantity': [10, 15, 20, 12, 18, 16, 22, 14, 19, 17, 21, 15]
})

print("Sales Data:")
print(df)

# GroupBy single column
region_sales = df.groupby('region')['sales'].sum()
print(f"\nTotal Sales by Region:\n{region_sales}")

# GroupBy multiple columns
region_product_sales = df.groupby(['region', 'product'])['sales'].sum()
print(f"\nSales by Region & Product:\n{region_product_sales}")

# Multiple aggregations
stats = df.groupby('region')['sales'].agg(['sum', 'mean', 'std', 'min', 'max'])
print(f"\nSales Statistics by Region:\n{stats}")

# Custom aggregation functions
def range_func(x):
    return x.max() - x.min()

custom_stats = df.groupby('region')['sales'].agg(['sum', range_func])
print(f"\nCustom Aggregation:\n{custom_stats}")

Aggregate Methods

# Common aggregation functions
grouped = df.groupby('region')

# Single aggregation
print(f"Sum:\n{grouped['sales'].sum()}")
print(f"\nMean:\n{grouped['sales'].mean()}")

# Multiple aggregations on single column
print(f"\nMultiple agg:\n{grouped['sales'].agg(['sum', 'mean', 'count'])}")

# Different aggregations for different columns
result = df.groupby('region').agg({
    'sales': ['sum', 'mean'],
    'quantity': ['sum', 'mean'],
    'date': 'count'
})
print(f"\nColumn-specific aggregations:\n{result}")

# Named aggregations (clearer output)
result = df.groupby('region').agg(
    total_sales=('sales', 'sum'),
    avg_sales=('sales', 'mean'),
    total_quantity=('quantity', 'sum'),
    transaction_count=('sales', 'count')
)
print(f"\nNamed aggregations:\n{result}")

Transform and Filter

# Transform: Return same-sized DataFrame
# Useful for adding group-level calculations as columns

# Add group mean as new column
df['region_avg_sales'] = df.groupby('region')['sales'].transform('mean')
print(df.head())

# Z-score within each group
df['sales_zscore'] = df.groupby('region')['sales'].transform(
    lambda x: (x - x.mean()) / x.std()
)
print(f"\nWith Z-scores:\n{df.head()}")

# Filter: Keep only groups meeting condition
# Only keep regions with total sales > 400
filtered = df.groupby('region').filter(lambda x: x['sales'].sum() > 400)
print(f"\nFiltered (regions with sales > 400):\n{filtered}")

# Keep only top N groups
top_regions = df.groupby('region')['sales'].sum().nlargest(2).index
top_df = df[df['region'].isin(top_regions)]
print(f"\nTop 2 regions:\n{top_df}")

Apply Method

# Apply custom function to each group
def analyze_group(group):
    return pd.Series({
        'total_sales': group['sales'].sum(),
        'avg_sales': group['sales'].mean(),
        'sales_growth': group['sales'].pct_change().mean(),
        'consistency': group['sales'].std() / group['sales'].mean()
    })

result = df.groupby('region').apply(analyze_group)
print(f"Custom group analysis:\n{result}")

# Apply with multiple groups
def product_region_analysis(group):
    return pd.Series({
        'mean_sales': group['sales'].mean(),
        'median_sales': group['sales'].median(),
        'cv': group['sales'].std() / group['sales'].mean()  # Coefficient of variation
    })

result = df.groupby(['region', 'product']).apply(product_region_analysis)
print(f"\nProduct-Region Analysis:\n{result}")

Merging DataFrames

Types of Joins

Before writing merge code, understand what rows end up in the result for each join type.

Sample Data — df_customers:

customer_id	name	city
1	Alice	New York
2	Bob	LA
3	Charlie	Chicago
4	Diana	Boston

Sample Data — df_orders:

order_id	customer_id	product
101	1	Laptop
102	2	Phone
103	2	Tablet
104	3	Laptop
105	6	Phone

Inner Join — Only rows that match in both tables

Architecture Diagram

         customers          orders              result
        ┌──────┐          ┌──────┐          ┌──────┐
        │  1 ✓ │ ◄──────► │  1 ✓ │          │  1   │  Alice — Laptop
        │  2 ✓ │ ◄──┬───► │  2 ✓ │          │  2   │  Bob — Phone
        │  3 ✓ │ ◄──┼───► │  2 ✓ │          │  2   │  Bob — Tablet
        │  4   │    │     │  3 ✓ │          │  3   │  Charlie — Laptop
        │      │    │     │  6   │          └──────┘
        └──────┘    │     └──────┘
                    │        ×  ← customer 6 doesn't exist → excluded
                    └─────────── ← customer 4 has no orders → excluded

Result: 4 rows — only customers 1, 2, 3 have matching orders.

Left Join — All rows from left table (customers), matching from right

Architecture Diagram

         customers          orders              result
        ┌──────┐          ┌──────┐          ┌──────┐
        │  1 ✓ │ ◄──────► │  1 ✓ │          │  1   │  Alice — Laptop
        │  2 ✓ │ ◄──┬───► │  2 ✓ │          │  2   │  Bob — Phone
        │  3 ✓ │ ◄──┼───► │  2 ✓ │          │  2   │  Bob — Tablet
        │  4 ✓ │ ◄──┼───► │  3 ✓ │          │  3   │  Charlie — Laptop
        │      │    │     │      │          │  4   │  Diana — NaN
        └──────┘    │     └──────┘          └──────┘
                    │        ×  ← order from customer 6 → excluded
                    └─────────

Result: 5 rows — ALL customers kept; Diana gets NaN for order columns.

Right Join — All rows from right table (orders), matching from left

Architecture Diagram

         customers          orders              result
        ┌──────┐          ┌──────┐          ┌──────┐
        │  1 ✓ │ ◄──────► │  1 ✓ │          │  1   │  Alice — Laptop
        │  2 ✓ │ ◄──┬───► │  2 ✓ │          │  2   │  Bob — Phone
        │  3 ✓ │ ◄──┼───► │  2 ✓ │          │  2   │  Bob — Tablet
        │  4   │    │     │  3 ✓ │          │  3   │  Charlie — Laptop
        └──────┘    │     │  6 ✓ │          │  6   │  NaN — Phone
                    │     └──────┘          └──────┘
                    └─────────

Result: 5 rows — ALL orders kept; customer 6 gets NaN for customer columns.

Full Outer Join — All rows from both tables

Architecture Diagram

         customers          orders              result
        ┌──────┐          ┌──────┐          ┌──────┐
        │  1 ✓ │ ◄──────► │  1 ✓ │          │  1   │  Alice — Laptop
        │  2 ✓ │ ◄──┬───► │  2 ✓ │          │  2   │  Bob — Phone
        │  3 ✓ │ ◄──┼───► │  2 ✓ │          │  2   │  Bob — Tablet
        │  4 ✓ │ ◄──┼───► │  3 ✓ │          │  3   │  Charlie — Laptop
        │      │    │     │  6 ✓ │          │  4   │  Diana — NaN
        └──────┘    │     └──────┘          │  6   │  NaN — Phone
                    │                       └──────┘
                    └─────────

Result: 6 rows — EVERYTHING kept; unmatched rows get NaN where needed.

Quick Reference:

Join Type	Kept from Left	Kept from Right	NaN for
`inner`	Only matches	Only matches	Nothing — non-matches excluded
`left`	All	Only matches	Right side when no match
`right`	Only matches	All	Left side when no match
`outer`	All	All	Both sides when no match

Merge Operations

# Create sample DataFrames
df_customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Boston', 'Denver']
})

df_orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 2, 3, 6],
    'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'],
    'amount': [999, 699, 449, 999, 699]
})

# Inner Join (only matching rows)
inner = pd.merge(df_customers, df_orders, on='customer_id', how='inner')
print("Inner Join:")
print(inner)

# Left Join (all customers, even without orders)
left = pd.merge(df_customers, df_orders, on='customer_id', how='left')
print("\nLeft Join:")
print(left)

# Right Join (all orders, even without matching customer)
right = pd.merge(df_customers, df_orders, on='customer_id', how='right')
print("\nRight Join:")
print(right)

# Full Outer Join (all rows from both)
full = pd.merge(df_customers, df_orders, on='customer_id', how='outer')
print("\nFull Outer Join:")
print(full)

# Merge on different column names
df_customers_renamed = df_customers.rename(columns={'customer_id': 'id'})
merged = pd.merge(df_customers_renamed, df_orders, 
                  left_on='id', right_on='customer_id', 
                  suffixes=('_customer', '_order'))
print("\nMerge with different column names:")
print(merged)

Concatenation

# Vertical concatenation (stacking)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
df3 = pd.DataFrame({'A': [9, 10], 'B': [11, 12]})

# Stack rows
vertical = pd.concat([df1, df2, df3], ignore_index=True)
print("Vertical Concatenation:")
print(vertical)

# Horizontal concatenation (side by side)
df_left = pd.DataFrame({'A': [1, 2, 3]})
df_right = pd.DataFrame({'B': [4, 5, 6]})

horizontal = pd.concat([df_left, df_right], axis=1)
print("\nHorizontal Concatenation:")
print(horizontal)

# Concat with keys (create MultiIndex)
combined = pd.concat([df1, df2, df3], keys=['first', 'second', 'third'])
print("\nConcat with Keys:")
print(combined)

Pivot Tables

Pivot tables transform long-format data into wide-format summaries.

# Sample data (long format)
df_long = pd.DataFrame({
    'month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar'] * 2,
    'region': ['North', 'South'] * 3 * 2,
    'product': ['A'] * 6 + ['B'] * 6,
    'sales': np.random.randint(100, 500, 12)
})

print("Long Format Data:")
print(df_long)

# Pivot table
pivot = pd.pivot_table(
    df_long,
    values='sales',
    index='month',
    columns='region',
    aggfunc='sum',
    fill_value=0
)
print("\nPivot Table:")
print(pivot)

# Multiple aggregations
pivot_multi = pd.pivot_table(
    df_long,
    values='sales',
    index='month',
    columns='region',
    aggfunc=['sum', 'mean', 'count'],
    fill_value=0
)
print("\nPivot with Multiple Aggregations:")
print(pivot_multi)

# Pivot with margins (totals)
pivot_margins = pd.pivot_table(
    df_long,
    values='sales',
    index='month',
    columns='region',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)
print("\nPivot with Margins:")
print(pivot_margins)

# GroupBy alternative
grouped_pivot = df_long.groupby(['month', 'region'])['sales'].sum().unstack()
print("\nGroupBy Alternative:")
print(grouped_pivot)

Pivot vs Melt

# Melt (wide to long)
df_wide = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'math': [95, 88, 92],
    'english': [85, 90, 88],
    'science': [90, 85, 95]
})
print("Wide Format:")
print(df_wide)

# Melt to long format
df_melted = pd.melt(
    df_wide,
    id_vars=['name'],
    var_name='subject',
    value_name='score'
)
print("\nLong Format (Melted):")
print(df_melted)

# Reverse: Pivot back to wide
df_wide_again = df_melted.pivot(
    index='name',
    columns='subject',
    values='score'
).reset_index()
df_wide_again.columns.name = None
print("\nBack to Wide:")
print(df_wide_again)

Window Functions

# Rolling window operations
df_sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10),
    'sales': [100, 120, 110, 130, 140, 125, 135, 145, 155, 160]
})

# 3-day rolling average
df_sales['rolling_3d'] = df_sales['sales'].rolling(window=3).mean()

# 3-day rolling sum
df_sales['rolling_sum'] = df_sales['sales'].rolling(window=3).sum()

# Expanding window (cumulative)
df_sales['cumulative_mean'] = df_sales['sales'].expanding().mean()

print("Window Functions:")
print(df_sales)

# Group-wise window functions
df['region_sales_rank'] = df.groupby('region')['sales'].rank(ascending=False)
df['region_sales_pct'] = df.groupby('region')['sales'].transform(
    lambda x: x / x.sum() * 100
)
print("\nWith Rankings:")
print(df.head(10))

Practical Example: Sales Analysis

# Complete sales analysis workflow
import pandas as pd
import numpy as np

# Create comprehensive dataset
np.random.seed(42)
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
regions = ['North', 'South', 'East', 'West']
products = ['Laptop', 'Phone', 'Tablet', 'Accessory']

n = len(dates) * len(regions) * len(products)
data = {
    'date': np.random.choice(dates, n),
    'region': np.random.choice(regions, n),
    'product': np.random.choice(products, n),
    'units_sold': np.random.randint(1, 50, n),
    'unit_price': np.random.choice([999, 699, 449, 149], n)
}

df = pd.DataFrame(data)
df['revenue'] = df['units_sold'] * df['unit_price']

# 1. Revenue by region
print("Revenue by Region:")
print(df.groupby('region')['revenue'].sum().sort_values(ascending=False))

# 2. Monthly trend
df['month'] = df['date'].dt.month
monthly = df.groupby('month')['revenue'].sum()
print("\nMonthly Revenue Trend:")
print(monthly)

# 3. Top products by region
top_products = df.groupby(['region', 'product'])['revenue'].sum()
print("\nTop Products by Region:")
print(top_products.unstack())

# 4. Pivot table summary
pivot_summary = pd.pivot_table(
    df,
    values='revenue',
    index='region',
    columns='product',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)
print("\nPivot Table Summary:")
print(pivot_summary)

# 5. Rolling 7-day average
daily_revenue = df.groupby('date')['revenue'].sum()
rolling_avg = daily_revenue.rolling(window=7).mean()
print("\n7-Day Rolling Average (First 10 days):")
print(rolling_avg.head(10))

Key Takeaways

📋Summary: Pandas GroupBy, Merge, Pivot

GroupBy follows split-apply-combine paradigm
Merge is like SQL joins — choose inner, left, right, or outer
Concat stacks DataFrames vertically or horizontally
Pivot tables transform long to wide format with aggregations
Window functions enable rolling and cumulative calculations

Practice Exercise

Create a sales dataset with dates, regions, products, and amounts
Calculate total sales by region and product using GroupBy
Create a pivot table showing monthly sales by region
Merge with a customer table to add customer information
Add 7-day rolling averages and rankings

Pandas 102: GroupBy, Merge, Pivot

GroupBy Operations

DfSplit-Apply-Combine

Basic GroupBy

Aggregate Methods

Transform and Filter

Apply Method

Merging DataFrames

Types of Joins

Merge Operations

Concatenation

Pivot Tables

Pivot vs Melt

Window Functions

Practical Example: Sales Analysis

Key Takeaways

📋Summary: Pandas GroupBy, Merge, Pivot

Practice Exercise

Need Expert Data Science Help?