CW

Pandas GroupBy, Merge and Pivot

Module 2: NumPy & PandasFree Lesson

Advertisement

Pandas GroupBy, Merge and Pivot

Master data aggregation, merging, and reshaping in Pandas.

Split-Apply-Combine (GroupBy)1. SPLITGroup DataFrame by keydf.groupby('col')Dept ADept B2. APPLYApply aggregation function.mean() / .sum() / .agg()mean()mean()3. COMBINEMerge results backNew indexed DataFrameResult DataFrameMerge / Join OperationsINNERA ∩ B onlyLEFTAll from A, match BRIGHTAll from B, match AFULL OUTERA ∪ B

GroupBy: Split-Apply-Combine

DfGroupBy (Split-Apply-Combine)

The GroupBy operation follows the Split-Apply-Combine paradigm:

  1. Split: Partition the DataFrame into groups based on one or more keys
  2. Apply: Apply a function to each group independently (aggregation, transformation, or filtering)
  3. Combine: Concatenate the results back into a single structure

Formally, for a function f:RntomathbbRkf: \mathbb{R}^n \\to \\mathbb{R}^k, GroupBy computes f(gi)f(g_i) for each group gig_i, where gig_i is a subset of rows sharing the same key value.

Basic GroupBy

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'department': ['Engineering', 'Engineering', 'Sales', 'Sales', 'Marketing', 'Marketing'],
    'employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'salary': [120000, 110000, 90000, 95000, 85000, 88000],
    'bonus': [15000, 12000, 8000, 9000, 7000, 7500]
})

# GroupBy creates a DataFrameGroupBy object
grouped = df.groupby('department')
print(type(grouped))  # <class 'pandas.core.groupby.generic.DataFrameGroupBy'>

# Aggregation — single function
print(grouped['salary'].mean())
# department
# Engineering    115000.0
# Marketing       86500.0
# Sales           92500.0

# Aggregation — multiple functions
print(grouped['salary'].agg(['mean', 'median', 'std', 'count']))

Multiple Aggregation Functions

# Single column, multiple functions
result = df.groupby('department')['salary'].agg(['mean', 'min', 'max'])
print(result)
#                 mean     min     max
# department
# Engineering  115000  110000  120000
# Marketing     86500   85000   88000
# Sales         92500   90000   95000

# Multiple columns, multiple functions
result = df.groupby('department').agg({
    'salary': ['mean', 'sum'],
    'bonus': ['mean', 'max']
})
print(result)
#                   salary           bonus
#                     mean     sum   mean    max
# department
# Engineering     115000  230000  13500  15000
# Marketing        86500  173000   7250   7500
# Sales            92500  185000   8500   9000

Named Aggregation (Pandas 0.25+)

Named Aggregation

Named aggregation provides a clean, readable syntax for specifying both the aggregation function and the output column name. It avoids the multi-level column index problem.

result = df.groupby('department').agg(
    avg_salary=('salary', 'mean'),
    total_bonus=('bonus', 'sum'),
    employee_count=('employee', 'count'),
    salary_range=('salary', lambda x: x.max() - x.min())
)
print(result)
#                 avg_salary  total_bonus  employee_count  salary_range
# department
# Engineering        115000        27000               2         10000
# Marketing           86500        14500               2          3000
# Sales               92500        17000               2          5000

GroupBy with Multiple Keys

# Multi-level grouping
df_multi = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South', 'North', 'South'],
    'department': ['Eng', 'Sales', 'Eng', 'Sales', 'Eng', 'Sales'],
    'revenue': [100, 80, 90, 70, 110, 85]
})

result = df_multi.groupby(['region', 'department'])['revenue'].sum()
print(result)
# region  department
# North   Eng            210
#         Sales           80
# South   Eng             90
#         Sales          155

# Reset index for flat DataFrame
result_flat = result.reset_index()
print(result_flat)

Transform vs Apply

Transform vs Apply

  • transform() returns a result with the same index as the input — useful for adding group-level statistics as new columns
  • apply() can return any shape — more flexible but slower
  • filter() returns a subset of rows where the group-level condition is True
# transform — same size as input
df['salary_zscore'] = df.groupby('department')['salary'].transform(
    lambda x: (x - x.mean()) / x.std()
)
print(df[['department', 'employee', 'salary', 'salary_zscore']])

# filter — subset of groups
high_salary = df.groupby('department').filter(
    lambda x: x['salary'].mean() > 90000
)
print(high_salary)

# apply — arbitrary function
def top_n(group, n=1):
    return group.nlargest(n, 'salary')

top_earners = df.groupby('department').apply(top_n, n=1)
print(top_earners)

Merge / Join

DfMerge

Merge combines two DataFrames based on common keys (column values or indices). It implements relational algebra join operations. The on parameter specifies the join key; how specifies the join type (inner, left, right, outer).

Join Types

SQL Join Types in PandasINNERA ∩ BLEFTAll A+ match BFULL OUTERA ∪ BRIGHTAll B+ match A
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'dept_id': [101, 102, 101, 103]
})

departments = pd.DataFrame({
    'dept_id': [101, 102, 104],
    'dept_name': ['Engineering', 'Sales', 'Finance']
})

# Inner join (default) — only matching rows
inner = pd.merge(employees, departments, on='dept_id', how='inner')
print(inner)
#    emp_id     name  dept_id    dept_name
# 0       1    Alice      101  Engineering
# 1       3  Charlie      101  Engineering
# 2       2      Bob      102        Sales

# Left join — all from left, matching from right
left = pd.merge(employees, departments, on='dept_id', how='left')
print(left)
#    emp_id     name  dept_id    dept_name
# 0       1    Alice      101  Engineering
# 1       2      Bob      102        Sales
# 2       3  Charlie      101  Engineering
# 3       4    Diana      103         NaN

# Outer join — all rows from both
outer = pd.merge(employees, departments, on='dept_id', how='outer')
print(outer)
#    emp_id     name  dept_id    dept_name
# 0     1.0    Alice    101.0  Engineering
# 1     2.0      Bob    102.0        Sales
# 2     3.0  Charlie    101.0  Engineering
# 3     4.0    Diana    103.0         NaN
# 4     NaN      NaN    104.0      Finance

Merge on Different Column Names

# When join keys have different names
pd.merge(employees, departments,
         left_on='dept_id',
         right_on='dept_id',
         how='left')

# On index
pd.merge(employees, departments,
         left_index=True,
         right_index=True,
         how='inner')

# Suffixes for overlapping column names
pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))

Cross Join (Cartesian Product)

# Every combination of rows
sizes = pd.DataFrame({'size': ['S', 'M', 'L']})
colors = pd.DataFrame({'color': ['Red', 'Blue']})
products = pd.merge(sizes, colors, how='cross')
print(products)
#   size color
# 0    S   Red
# 1    S  Blue
# 2    M   Red
# 3    M  Blue
# 4    L   Red
# 5    L  Blue

Concat

DfConcat

pd.concat() stacks DataFrames along an axis (vertically by default). Unlike merge, it doesn't require common keys — it simply appends or aligns along the specified axis.

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
df3 = pd.DataFrame({'C': [9, 10]})  # Different columns

# Vertical concat (axis=0)
vertical = pd.concat([df1, df2], ignore_index=True)
print(vertical)
#    A  B
# 0  1  3
# 1  2  4
# 2  5  7
# 3  6  8

# Horizontal concat (axis=1)
horizontal = pd.concat([df1, df3], axis=1)
print(horizontal)
#    A  B   C
# 0  1  3   9
# 1  2  4  10

# With mismatched columns (outer join by default)
mismatched = pd.concat([df1, df3], axis=0, ignore_index=True)
print(mismatched)
#     A    B     C
# 0  1.0  3.0   NaN
# 1  2.0  4.0   NaN
# 2  NaN  NaN   9.0
# 3  NaN  NaN  10.0

Pivot Tables

DfPivot Table

A pivot table reshapes long-format data into wide-format by aggregating values at the intersection of specified row and column indices. It is equivalent to a GroupBy with multiple keys followed by unstacking.

sales = pd.DataFrame({
    'date': ['2024-01', '2024-01', '2024-02', '2024-02', '2024-01', '2024-02'],
    'region': ['North', 'South', 'North', 'South', 'North', 'South'],
    'product': ['A', 'A', 'A', 'A', 'B', 'B'],
    'revenue': [100, 150, 120, 160, 200, 180]
})

# Basic pivot table
pivot = sales.pivot_table(
    values='revenue',
    index='region',
    columns='product',
    aggfunc='sum'
)
print(pivot)
# product      A    B
# region
# North      220  200
# South      310  180

# Multiple aggregations
pivot_multi = sales.pivot_table(
    values='revenue',
    index=['region', 'date'],
    columns='product',
    aggfunc=['sum', 'mean', 'count']
)

# With margins (row/column totals)
pivot_margins = sales.pivot_table(
    values='revenue',
    index='region',
    columns='product',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)
print(pivot_margins)
# product      A    B  Total
# region
# North      220  200    420
# South      310  180    490
# Total      530  380    910

Pivot vs Pivot Table

  • pd.DataFrame.pivot() reshapes data without aggregation — requires unique index/column combinations
  • pd.DataFrame.pivot_table() supports aggregation and handles duplicate index/column pairs by applying the aggregation function

Melt (Unpivot)

DfMelt

Melt is the inverse of pivot. It unpivots a DataFrame from wide format to long format by converting columns into rows. Each original column becomes a value in a new "variable" column, and the cell values go into a "value" column.

wide = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'math': [95, 87],
    'science': [92, 85],
    'english': [88, 90]
})
print(wide)
#    name  math  science  english
# 0  Alice    95       92       88
# 1    Bob    87       85       90

# Melt to long format
long = pd.melt(
    wide,
    id_vars=['name'],
    var_name='subject',
    value_name='score'
)
print(long)
#     name  subject  score
# 0  Alice     math     95
# 1    Bob     math     87
# 2  Alice  science     92
# 3    Bob  science     85
# 4  Alice  english     88
# 5    Bob  english     90

# Filter after melting
long[long['score'] > 90]

Stack and Unstack

# Stack — columns → index (wide to long)
df_wide = pd.DataFrame({
    'A': [1, 2],
    'B': [3, 4]
}, index=['x', 'y'])

stacked = df_wide.stack()
print(stacked)
# x  A    1
#    B    3
# y  A    2
#    B    4

# Unstack — index → columns (long to wide)
unstacked = stacked.unstack()
print(unstacked)
#    A  B
# x  1  3
# y  2  4

Cross Tabulation

# Crosstab — frequency table
survey = pd.DataFrame({
    'gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],
    'preference': ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'B']
})

ct = pd.crosstab(
    survey['gender'],
    survey['preference'],
    margins=True,
    normalize='index'  # Row percentages
)
print(ct)
# preference    A    B  All
# gender
# F           0.5  0.5  1.0
# M           0.5  0.5  1.0
# All         0.5  0.5  1.0

Key Takeaways

Summary: GroupBy, Merge and Pivot

  1. GroupBy follows Split-Apply-Combine: partition data, apply function, combine results
  2. Use agg() for multiple aggregations, transform() to broadcast group results back, filter() to subset groups
  3. merge() is like SQL JOIN — supports inner, left, right, and outer joins on keys
  4. concat() stacks DataFrames along an axis without requiring common keys
  5. pivot_table() reshapes long data to wide with aggregation; melt() is the inverse
  6. stack()/unstack() move between multi-index levels and columns
  7. Prefer named aggregation for readable, maintainable groupby operations

Practice Exercise

  1. Given a dataset of 1000 orders with columns (order_id, customer_id, product, quantity, price, date), compute total revenue per product per month using pivot_table().
  2. Merge two employee DataFrames (one with salary info, one with department info) where the department DataFrame has duplicate entries. Handle the duplicate keys appropriately.
  3. Using groupby() and transform(), add a column showing each employee's salary as a percentage of their department's total salary.
  4. Reshape a wide-format survey results DataFrame (columns: respondent_id, q1, q2, q3, ..., q20) into long format, then compute average score per question.
  5. Implement a "running total" within each group using groupby().cumsum() and verify against manual calculation.

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement