Pandas GroupBy, Merge and Pivot
Master data aggregation, merging, and reshaping in Pandas.
GroupBy: Split-Apply-Combine
DfGroupBy (Split-Apply-Combine)
The GroupBy operation follows the Split-Apply-Combine paradigm:
- Split: Partition the DataFrame into groups based on one or more keys
- Apply: Apply a function to each group independently (aggregation, transformation, or filtering)
- Combine: Concatenate the results back into a single structure
Formally, for a function , GroupBy computes for each group , where is a subset of rows sharing the same key value.
Basic GroupBy
import pandas as pd
import numpy as np
df = pd.DataFrame({
'department': ['Engineering', 'Engineering', 'Sales', 'Sales', 'Marketing', 'Marketing'],
'employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
'salary': [120000, 110000, 90000, 95000, 85000, 88000],
'bonus': [15000, 12000, 8000, 9000, 7000, 7500]
})
# GroupBy creates a DataFrameGroupBy object
grouped = df.groupby('department')
print(type(grouped)) # <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
# Aggregation — single function
print(grouped['salary'].mean())
# department
# Engineering 115000.0
# Marketing 86500.0
# Sales 92500.0
# Aggregation — multiple functions
print(grouped['salary'].agg(['mean', 'median', 'std', 'count']))
Multiple Aggregation Functions
# Single column, multiple functions
result = df.groupby('department')['salary'].agg(['mean', 'min', 'max'])
print(result)
# mean min max
# department
# Engineering 115000 110000 120000
# Marketing 86500 85000 88000
# Sales 92500 90000 95000
# Multiple columns, multiple functions
result = df.groupby('department').agg({
'salary': ['mean', 'sum'],
'bonus': ['mean', 'max']
})
print(result)
# salary bonus
# mean sum mean max
# department
# Engineering 115000 230000 13500 15000
# Marketing 86500 173000 7250 7500
# Sales 92500 185000 8500 9000
Named Aggregation (Pandas 0.25+)
Named Aggregation
Named aggregation provides a clean, readable syntax for specifying both the aggregation function and the output column name. It avoids the multi-level column index problem.
result = df.groupby('department').agg(
avg_salary=('salary', 'mean'),
total_bonus=('bonus', 'sum'),
employee_count=('employee', 'count'),
salary_range=('salary', lambda x: x.max() - x.min())
)
print(result)
# avg_salary total_bonus employee_count salary_range
# department
# Engineering 115000 27000 2 10000
# Marketing 86500 14500 2 3000
# Sales 92500 17000 2 5000
GroupBy with Multiple Keys
# Multi-level grouping
df_multi = pd.DataFrame({
'region': ['North', 'North', 'South', 'South', 'North', 'South'],
'department': ['Eng', 'Sales', 'Eng', 'Sales', 'Eng', 'Sales'],
'revenue': [100, 80, 90, 70, 110, 85]
})
result = df_multi.groupby(['region', 'department'])['revenue'].sum()
print(result)
# region department
# North Eng 210
# Sales 80
# South Eng 90
# Sales 155
# Reset index for flat DataFrame
result_flat = result.reset_index()
print(result_flat)
Transform vs Apply
Transform vs Apply
transform()returns a result with the same index as the input — useful for adding group-level statistics as new columnsapply()can return any shape — more flexible but slowerfilter()returns a subset of rows where the group-level condition is True
# transform — same size as input
df['salary_zscore'] = df.groupby('department')['salary'].transform(
lambda x: (x - x.mean()) / x.std()
)
print(df[['department', 'employee', 'salary', 'salary_zscore']])
# filter — subset of groups
high_salary = df.groupby('department').filter(
lambda x: x['salary'].mean() > 90000
)
print(high_salary)
# apply — arbitrary function
def top_n(group, n=1):
return group.nlargest(n, 'salary')
top_earners = df.groupby('department').apply(top_n, n=1)
print(top_earners)
Merge / Join
DfMerge
Merge combines two DataFrames based on common keys (column values or indices). It implements relational algebra join operations. The on parameter specifies the join key; how specifies the join type (inner, left, right, outer).
Join Types
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'dept_id': [101, 102, 101, 103]
})
departments = pd.DataFrame({
'dept_id': [101, 102, 104],
'dept_name': ['Engineering', 'Sales', 'Finance']
})
# Inner join (default) — only matching rows
inner = pd.merge(employees, departments, on='dept_id', how='inner')
print(inner)
# emp_id name dept_id dept_name
# 0 1 Alice 101 Engineering
# 1 3 Charlie 101 Engineering
# 2 2 Bob 102 Sales
# Left join — all from left, matching from right
left = pd.merge(employees, departments, on='dept_id', how='left')
print(left)
# emp_id name dept_id dept_name
# 0 1 Alice 101 Engineering
# 1 2 Bob 102 Sales
# 2 3 Charlie 101 Engineering
# 3 4 Diana 103 NaN
# Outer join — all rows from both
outer = pd.merge(employees, departments, on='dept_id', how='outer')
print(outer)
# emp_id name dept_id dept_name
# 0 1.0 Alice 101.0 Engineering
# 1 2.0 Bob 102.0 Sales
# 2 3.0 Charlie 101.0 Engineering
# 3 4.0 Diana 103.0 NaN
# 4 NaN NaN 104.0 Finance
Merge on Different Column Names
# When join keys have different names
pd.merge(employees, departments,
left_on='dept_id',
right_on='dept_id',
how='left')
# On index
pd.merge(employees, departments,
left_index=True,
right_index=True,
how='inner')
# Suffixes for overlapping column names
pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))
Cross Join (Cartesian Product)
# Every combination of rows
sizes = pd.DataFrame({'size': ['S', 'M', 'L']})
colors = pd.DataFrame({'color': ['Red', 'Blue']})
products = pd.merge(sizes, colors, how='cross')
print(products)
# size color
# 0 S Red
# 1 S Blue
# 2 M Red
# 3 M Blue
# 4 L Red
# 5 L Blue
Concat
DfConcat
pd.concat() stacks DataFrames along an axis (vertically by default). Unlike merge, it doesn't require common keys — it simply appends or aligns along the specified axis.
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
df3 = pd.DataFrame({'C': [9, 10]}) # Different columns
# Vertical concat (axis=0)
vertical = pd.concat([df1, df2], ignore_index=True)
print(vertical)
# A B
# 0 1 3
# 1 2 4
# 2 5 7
# 3 6 8
# Horizontal concat (axis=1)
horizontal = pd.concat([df1, df3], axis=1)
print(horizontal)
# A B C
# 0 1 3 9
# 1 2 4 10
# With mismatched columns (outer join by default)
mismatched = pd.concat([df1, df3], axis=0, ignore_index=True)
print(mismatched)
# A B C
# 0 1.0 3.0 NaN
# 1 2.0 4.0 NaN
# 2 NaN NaN 9.0
# 3 NaN NaN 10.0
Pivot Tables
DfPivot Table
A pivot table reshapes long-format data into wide-format by aggregating values at the intersection of specified row and column indices. It is equivalent to a GroupBy with multiple keys followed by unstacking.
sales = pd.DataFrame({
'date': ['2024-01', '2024-01', '2024-02', '2024-02', '2024-01', '2024-02'],
'region': ['North', 'South', 'North', 'South', 'North', 'South'],
'product': ['A', 'A', 'A', 'A', 'B', 'B'],
'revenue': [100, 150, 120, 160, 200, 180]
})
# Basic pivot table
pivot = sales.pivot_table(
values='revenue',
index='region',
columns='product',
aggfunc='sum'
)
print(pivot)
# product A B
# region
# North 220 200
# South 310 180
# Multiple aggregations
pivot_multi = sales.pivot_table(
values='revenue',
index=['region', 'date'],
columns='product',
aggfunc=['sum', 'mean', 'count']
)
# With margins (row/column totals)
pivot_margins = sales.pivot_table(
values='revenue',
index='region',
columns='product',
aggfunc='sum',
margins=True,
margins_name='Total'
)
print(pivot_margins)
# product A B Total
# region
# North 220 200 420
# South 310 180 490
# Total 530 380 910
Pivot vs Pivot Table
pd.DataFrame.pivot()reshapes data without aggregation — requires unique index/column combinationspd.DataFrame.pivot_table()supports aggregation and handles duplicate index/column pairs by applying the aggregation function
Melt (Unpivot)
DfMelt
Melt is the inverse of pivot. It unpivots a DataFrame from wide format to long format by converting columns into rows. Each original column becomes a value in a new "variable" column, and the cell values go into a "value" column.
wide = pd.DataFrame({
'name': ['Alice', 'Bob'],
'math': [95, 87],
'science': [92, 85],
'english': [88, 90]
})
print(wide)
# name math science english
# 0 Alice 95 92 88
# 1 Bob 87 85 90
# Melt to long format
long = pd.melt(
wide,
id_vars=['name'],
var_name='subject',
value_name='score'
)
print(long)
# name subject score
# 0 Alice math 95
# 1 Bob math 87
# 2 Alice science 92
# 3 Bob science 85
# 4 Alice english 88
# 5 Bob english 90
# Filter after melting
long[long['score'] > 90]
Stack and Unstack
# Stack — columns → index (wide to long)
df_wide = pd.DataFrame({
'A': [1, 2],
'B': [3, 4]
}, index=['x', 'y'])
stacked = df_wide.stack()
print(stacked)
# x A 1
# B 3
# y A 2
# B 4
# Unstack — index → columns (long to wide)
unstacked = stacked.unstack()
print(unstacked)
# A B
# x 1 3
# y 2 4
Cross Tabulation
# Crosstab — frequency table
survey = pd.DataFrame({
'gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],
'preference': ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'B']
})
ct = pd.crosstab(
survey['gender'],
survey['preference'],
margins=True,
normalize='index' # Row percentages
)
print(ct)
# preference A B All
# gender
# F 0.5 0.5 1.0
# M 0.5 0.5 1.0
# All 0.5 0.5 1.0
Key Takeaways
Summary: GroupBy, Merge and Pivot
- GroupBy follows Split-Apply-Combine: partition data, apply function, combine results
- Use
agg()for multiple aggregations,transform()to broadcast group results back,filter()to subset groups merge()is like SQL JOIN — supports inner, left, right, and outer joins on keysconcat()stacks DataFrames along an axis without requiring common keyspivot_table()reshapes long data to wide with aggregation;melt()is the inversestack()/unstack()move between multi-index levels and columns- Prefer named aggregation for readable, maintainable groupby operations
Practice Exercise
- Given a dataset of 1000 orders with columns (order_id, customer_id, product, quantity, price, date), compute total revenue per product per month using
pivot_table(). - Merge two employee DataFrames (one with salary info, one with department info) where the department DataFrame has duplicate entries. Handle the duplicate keys appropriately.
- Using
groupby()andtransform(), add a column showing each employee's salary as a percentage of their department's total salary. - Reshape a wide-format survey results DataFrame (columns: respondent_id, q1, q2, q3, ..., q20) into long format, then compute average score per question.
- Implement a "running total" within each group using
groupby().cumsum()and verify against manual calculation.