Pandas 102: GroupBy, Merge, Pivot

Module 1: FoundationsFree Lesson

Advertisement

GroupBy Operations

DfSplit-Apply-Combine

The split-apply-combine paradigm (Wickham, 2011) is the foundational pattern for grouped data operations. It consists of three steps:

  1. Split: Divide the data into groups based on one or more keys
  2. Apply: Apply a function independently to each group (aggregation, transformation, or filtering)
  3. Combine: Merge the results back into a single data structure

This pattern is equivalent to SQL's GROUP BY but far more flexible in Python.

Architecture Diagram
Split → Apply → Combine

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”     ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”     ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│  DataFrame  │ ──→ │  Split  │ ──→ │  Aggregate  │
│             │     │  by key │     │   results   │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜     ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜     ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Basic GroupBy

import pandas as pd
import numpy as np

# Sample sales data
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=12, freq='M'),
    'region': ['North', 'South', 'East', 'West'] * 3,
    'product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
    'sales': [100, 150, 200, 120, 180, 160, 220, 140, 190, 170, 210, 155],
    'quantity': [10, 15, 20, 12, 18, 16, 22, 14, 19, 17, 21, 15]
})

print("Sales Data:")
print(df)

# GroupBy single column
region_sales = df.groupby('region')['sales'].sum()
print(f"\nTotal Sales by Region:\n{region_sales}")

# GroupBy multiple columns
region_product_sales = df.groupby(['region', 'product'])['sales'].sum()
print(f"\nSales by Region & Product:\n{region_product_sales}")

# Multiple aggregations
stats = df.groupby('region')['sales'].agg(['sum', 'mean', 'std', 'min', 'max'])
print(f"\nSales Statistics by Region:\n{stats}")

# Custom aggregation functions
def range_func(x):
    return x.max() - x.min()

custom_stats = df.groupby('region')['sales'].agg(['sum', range_func])
print(f"\nCustom Aggregation:\n{custom_stats}")

Aggregate Methods

# Common aggregation functions
grouped = df.groupby('region')

# Single aggregation
print(f"Sum:\n{grouped['sales'].sum()}")
print(f"\nMean:\n{grouped['sales'].mean()}")

# Multiple aggregations on single column
print(f"\nMultiple agg:\n{grouped['sales'].agg(['sum', 'mean', 'count'])}")

# Different aggregations for different columns
result = df.groupby('region').agg({
    'sales': ['sum', 'mean'],
    'quantity': ['sum', 'mean'],
    'date': 'count'
})
print(f"\nColumn-specific aggregations:\n{result}")

# Named aggregations (clearer output)
result = df.groupby('region').agg(
    total_sales=('sales', 'sum'),
    avg_sales=('sales', 'mean'),
    total_quantity=('quantity', 'sum'),
    transaction_count=('sales', 'count')
)
print(f"\nNamed aggregations:\n{result}")

Transform and Filter

# Transform: Return same-sized DataFrame
# Useful for adding group-level calculations as columns

# Add group mean as new column
df['region_avg_sales'] = df.groupby('region')['sales'].transform('mean')
print(df.head())

# Z-score within each group
df['sales_zscore'] = df.groupby('region')['sales'].transform(
    lambda x: (x - x.mean()) / x.std()
)
print(f"\nWith Z-scores:\n{df.head()}")

# Filter: Keep only groups meeting condition
# Only keep regions with total sales > 400
filtered = df.groupby('region').filter(lambda x: x['sales'].sum() > 400)
print(f"\nFiltered (regions with sales > 400):\n{filtered}")

# Keep only top N groups
top_regions = df.groupby('region')['sales'].sum().nlargest(2).index
top_df = df[df['region'].isin(top_regions)]
print(f"\nTop 2 regions:\n{top_df}")

Apply Method

# Apply custom function to each group
def analyze_group(group):
    return pd.Series({
        'total_sales': group['sales'].sum(),
        'avg_sales': group['sales'].mean(),
        'sales_growth': group['sales'].pct_change().mean(),
        'consistency': group['sales'].std() / group['sales'].mean()
    })

result = df.groupby('region').apply(analyze_group)
print(f"Custom group analysis:\n{result}")

# Apply with multiple groups
def product_region_analysis(group):
    return pd.Series({
        'mean_sales': group['sales'].mean(),
        'median_sales': group['sales'].median(),
        'cv': group['sales'].std() / group['sales'].mean()  # Coefficient of variation
    })

result = df.groupby(['region', 'product']).apply(product_region_analysis)
print(f"\nProduct-Region Analysis:\n{result}")

Merging DataFrames

Types of Joins

Before writing merge code, understand what rows end up in the result for each join type.

Sample Data — df_customers:

customer_idnamecity
1AliceNew York
2BobLA
3CharlieChicago
4DianaBoston

Sample Data — df_orders:

order_idcustomer_idproduct
1011Laptop
1022Phone
1032Tablet
1043Laptop
1056Phone

Inner Join — Only rows that match in both tables

Architecture Diagram
         customers          orders              result
        ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”          ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”          ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”
        │  1 āœ“ │ ◄──────► │  1 āœ“ │          │  1   │  Alice — Laptop
        │  2 āœ“ │ ◄──┬───► │  2 āœ“ │          │  2   │  Bob — Phone
        │  3 āœ“ │ ◄──┼───► │  2 āœ“ │          │  2   │  Bob — Tablet
        │  4   │    │     │  3 āœ“ │          │  3   │  Charlie — Laptop
        │      │    │     │  6   │          ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
        ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”˜    │     ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                    │        Ɨ  ← customer 6 doesn't exist → excluded
                    └─────────── ← customer 4 has no orders → excluded

Result: 4 rows — only customers 1, 2, 3 have matching orders.


Left Join — All rows from left table (customers), matching from right

Architecture Diagram
         customers          orders              result
        ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”          ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”          ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”
        │  1 āœ“ │ ◄──────► │  1 āœ“ │          │  1   │  Alice — Laptop
        │  2 āœ“ │ ◄──┬───► │  2 āœ“ │          │  2   │  Bob — Phone
        │  3 āœ“ │ ◄──┼───► │  2 āœ“ │          │  2   │  Bob — Tablet
        │  4 āœ“ │ ◄──┼───► │  3 āœ“ │          │  3   │  Charlie — Laptop
        │      │    │     │      │          │  4   │  Diana — NaN
        ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”˜    │     ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”˜          ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                    │        Ɨ  ← order from customer 6 → excluded
                    └─────────

Result: 5 rows — ALL customers kept; Diana gets NaN for order columns.


Right Join — All rows from right table (orders), matching from left

Architecture Diagram
         customers          orders              result
        ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”          ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”          ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”
        │  1 āœ“ │ ◄──────► │  1 āœ“ │          │  1   │  Alice — Laptop
        │  2 āœ“ │ ◄──┬───► │  2 āœ“ │          │  2   │  Bob — Phone
        │  3 āœ“ │ ◄──┼───► │  2 āœ“ │          │  2   │  Bob — Tablet
        │  4   │    │     │  3 āœ“ │          │  3   │  Charlie — Laptop
        ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”˜    │     │  6 āœ“ │          │  6   │  NaN — Phone
                    │     ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”˜          ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                    └─────────

Result: 5 rows — ALL orders kept; customer 6 gets NaN for customer columns.


Full Outer Join — All rows from both tables

Architecture Diagram
         customers          orders              result
        ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”          ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”          ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”
        │  1 āœ“ │ ◄──────► │  1 āœ“ │          │  1   │  Alice — Laptop
        │  2 āœ“ │ ◄──┬───► │  2 āœ“ │          │  2   │  Bob — Phone
        │  3 āœ“ │ ◄──┼───► │  2 āœ“ │          │  2   │  Bob — Tablet
        │  4 āœ“ │ ◄──┼───► │  3 āœ“ │          │  3   │  Charlie — Laptop
        │      │    │     │  6 āœ“ │          │  4   │  Diana — NaN
        ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”˜    │     ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”˜          │  6   │  NaN — Phone
                    │                       ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                    └─────────

Result: 6 rows — EVERYTHING kept; unmatched rows get NaN where needed.


Quick Reference:

Join TypeKept from LeftKept from RightNaN for
innerOnly matchesOnly matchesNothing — non-matches excluded
leftAllOnly matchesRight side when no match
rightOnly matchesAllLeft side when no match
outerAllAllBoth sides when no match

Merge Operations

# Create sample DataFrames
df_customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Boston', 'Denver']
})

df_orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 2, 3, 6],
    'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'],
    'amount': [999, 699, 449, 999, 699]
})

# Inner Join (only matching rows)
inner = pd.merge(df_customers, df_orders, on='customer_id', how='inner')
print("Inner Join:")
print(inner)

# Left Join (all customers, even without orders)
left = pd.merge(df_customers, df_orders, on='customer_id', how='left')
print("\nLeft Join:")
print(left)

# Right Join (all orders, even without matching customer)
right = pd.merge(df_customers, df_orders, on='customer_id', how='right')
print("\nRight Join:")
print(right)

# Full Outer Join (all rows from both)
full = pd.merge(df_customers, df_orders, on='customer_id', how='outer')
print("\nFull Outer Join:")
print(full)

# Merge on different column names
df_customers_renamed = df_customers.rename(columns={'customer_id': 'id'})
merged = pd.merge(df_customers_renamed, df_orders, 
                  left_on='id', right_on='customer_id', 
                  suffixes=('_customer', '_order'))
print("\nMerge with different column names:")
print(merged)

Concatenation

# Vertical concatenation (stacking)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
df3 = pd.DataFrame({'A': [9, 10], 'B': [11, 12]})

# Stack rows
vertical = pd.concat([df1, df2, df3], ignore_index=True)
print("Vertical Concatenation:")
print(vertical)

# Horizontal concatenation (side by side)
df_left = pd.DataFrame({'A': [1, 2, 3]})
df_right = pd.DataFrame({'B': [4, 5, 6]})

horizontal = pd.concat([df_left, df_right], axis=1)
print("\nHorizontal Concatenation:")
print(horizontal)

# Concat with keys (create MultiIndex)
combined = pd.concat([df1, df2, df3], keys=['first', 'second', 'third'])
print("\nConcat with Keys:")
print(combined)

Pivot Tables

Pivot tables transform long-format data into wide-format summaries.

# Sample data (long format)
df_long = pd.DataFrame({
    'month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar'] * 2,
    'region': ['North', 'South'] * 3 * 2,
    'product': ['A'] * 6 + ['B'] * 6,
    'sales': np.random.randint(100, 500, 12)
})

print("Long Format Data:")
print(df_long)

# Pivot table
pivot = pd.pivot_table(
    df_long,
    values='sales',
    index='month',
    columns='region',
    aggfunc='sum',
    fill_value=0
)
print("\nPivot Table:")
print(pivot)

# Multiple aggregations
pivot_multi = pd.pivot_table(
    df_long,
    values='sales',
    index='month',
    columns='region',
    aggfunc=['sum', 'mean', 'count'],
    fill_value=0
)
print("\nPivot with Multiple Aggregations:")
print(pivot_multi)

# Pivot with margins (totals)
pivot_margins = pd.pivot_table(
    df_long,
    values='sales',
    index='month',
    columns='region',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)
print("\nPivot with Margins:")
print(pivot_margins)

# GroupBy alternative
grouped_pivot = df_long.groupby(['month', 'region'])['sales'].sum().unstack()
print("\nGroupBy Alternative:")
print(grouped_pivot)

Pivot vs Melt

# Melt (wide to long)
df_wide = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'math': [95, 88, 92],
    'english': [85, 90, 88],
    'science': [90, 85, 95]
})
print("Wide Format:")
print(df_wide)

# Melt to long format
df_melted = pd.melt(
    df_wide,
    id_vars=['name'],
    var_name='subject',
    value_name='score'
)
print("\nLong Format (Melted):")
print(df_melted)

# Reverse: Pivot back to wide
df_wide_again = df_melted.pivot(
    index='name',
    columns='subject',
    values='score'
).reset_index()
df_wide_again.columns.name = None
print("\nBack to Wide:")
print(df_wide_again)

Window Functions

# Rolling window operations
df_sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10),
    'sales': [100, 120, 110, 130, 140, 125, 135, 145, 155, 160]
})

# 3-day rolling average
df_sales['rolling_3d'] = df_sales['sales'].rolling(window=3).mean()

# 3-day rolling sum
df_sales['rolling_sum'] = df_sales['sales'].rolling(window=3).sum()

# Expanding window (cumulative)
df_sales['cumulative_mean'] = df_sales['sales'].expanding().mean()

print("Window Functions:")
print(df_sales)

# Group-wise window functions
df['region_sales_rank'] = df.groupby('region')['sales'].rank(ascending=False)
df['region_sales_pct'] = df.groupby('region')['sales'].transform(
    lambda x: x / x.sum() * 100
)
print("\nWith Rankings:")
print(df.head(10))

Practical Example: Sales Analysis

# Complete sales analysis workflow
import pandas as pd
import numpy as np

# Create comprehensive dataset
np.random.seed(42)
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
regions = ['North', 'South', 'East', 'West']
products = ['Laptop', 'Phone', 'Tablet', 'Accessory']

n = len(dates) * len(regions) * len(products)
data = {
    'date': np.random.choice(dates, n),
    'region': np.random.choice(regions, n),
    'product': np.random.choice(products, n),
    'units_sold': np.random.randint(1, 50, n),
    'unit_price': np.random.choice([999, 699, 449, 149], n)
}

df = pd.DataFrame(data)
df['revenue'] = df['units_sold'] * df['unit_price']

# 1. Revenue by region
print("Revenue by Region:")
print(df.groupby('region')['revenue'].sum().sort_values(ascending=False))

# 2. Monthly trend
df['month'] = df['date'].dt.month
monthly = df.groupby('month')['revenue'].sum()
print("\nMonthly Revenue Trend:")
print(monthly)

# 3. Top products by region
top_products = df.groupby(['region', 'product'])['revenue'].sum()
print("\nTop Products by Region:")
print(top_products.unstack())

# 4. Pivot table summary
pivot_summary = pd.pivot_table(
    df,
    values='revenue',
    index='region',
    columns='product',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)
print("\nPivot Table Summary:")
print(pivot_summary)

# 5. Rolling 7-day average
daily_revenue = df.groupby('date')['revenue'].sum()
rolling_avg = daily_revenue.rolling(window=7).mean()
print("\n7-Day Rolling Average (First 10 days):")
print(rolling_avg.head(10))

Key Takeaways

šŸ“‹Summary: Pandas GroupBy, Merge, Pivot

  1. GroupBy follows split-apply-combine paradigm
  2. Merge is like SQL joins — choose inner, left, right, or outer
  3. Concat stacks DataFrames vertically or horizontally
  4. Pivot tables transform long to wide format with aggregations
  5. Window functions enable rolling and cumulative calculations

Practice Exercise

  1. Create a sales dataset with dates, regions, products, and amounts
  2. Calculate total sales by region and product using GroupBy
  3. Create a pivot table showing monthly sales by region
  4. Merge with a customer table to add customer information
  5. Add 7-day rolling averages and rankings

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement