GroupBy Operations
DfSplit-Apply-Combine
The split-apply-combine paradigm (Wickham, 2011) is the foundational pattern for grouped data operations. It consists of three steps:
- Split: Divide the data into groups based on one or more keys
- Apply: Apply a function independently to each group (aggregation, transformation, or filtering)
- Combine: Merge the results back into a single data structure
This pattern is equivalent to SQL's GROUP BY but far more flexible in Python.
Split ā Apply ā Combine
āāāāāāāāāāāāāāā āāāāāāāāāāā āāāāāāāāāāāāāāā
ā DataFrame ā āāā ā Split ā āāā ā Aggregate ā
ā ā ā by key ā ā results ā
āāāāāāāāāāāāāāā āāāāāāāāāāā āāāāāāāāāāāāāāā
Basic GroupBy
import pandas as pd
import numpy as np
# Sample sales data
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=12, freq='M'),
'region': ['North', 'South', 'East', 'West'] * 3,
'product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'sales': [100, 150, 200, 120, 180, 160, 220, 140, 190, 170, 210, 155],
'quantity': [10, 15, 20, 12, 18, 16, 22, 14, 19, 17, 21, 15]
})
print("Sales Data:")
print(df)
# GroupBy single column
region_sales = df.groupby('region')['sales'].sum()
print(f"\nTotal Sales by Region:\n{region_sales}")
# GroupBy multiple columns
region_product_sales = df.groupby(['region', 'product'])['sales'].sum()
print(f"\nSales by Region & Product:\n{region_product_sales}")
# Multiple aggregations
stats = df.groupby('region')['sales'].agg(['sum', 'mean', 'std', 'min', 'max'])
print(f"\nSales Statistics by Region:\n{stats}")
# Custom aggregation functions
def range_func(x):
return x.max() - x.min()
custom_stats = df.groupby('region')['sales'].agg(['sum', range_func])
print(f"\nCustom Aggregation:\n{custom_stats}")
Aggregate Methods
# Common aggregation functions
grouped = df.groupby('region')
# Single aggregation
print(f"Sum:\n{grouped['sales'].sum()}")
print(f"\nMean:\n{grouped['sales'].mean()}")
# Multiple aggregations on single column
print(f"\nMultiple agg:\n{grouped['sales'].agg(['sum', 'mean', 'count'])}")
# Different aggregations for different columns
result = df.groupby('region').agg({
'sales': ['sum', 'mean'],
'quantity': ['sum', 'mean'],
'date': 'count'
})
print(f"\nColumn-specific aggregations:\n{result}")
# Named aggregations (clearer output)
result = df.groupby('region').agg(
total_sales=('sales', 'sum'),
avg_sales=('sales', 'mean'),
total_quantity=('quantity', 'sum'),
transaction_count=('sales', 'count')
)
print(f"\nNamed aggregations:\n{result}")
Transform and Filter
# Transform: Return same-sized DataFrame
# Useful for adding group-level calculations as columns
# Add group mean as new column
df['region_avg_sales'] = df.groupby('region')['sales'].transform('mean')
print(df.head())
# Z-score within each group
df['sales_zscore'] = df.groupby('region')['sales'].transform(
lambda x: (x - x.mean()) / x.std()
)
print(f"\nWith Z-scores:\n{df.head()}")
# Filter: Keep only groups meeting condition
# Only keep regions with total sales > 400
filtered = df.groupby('region').filter(lambda x: x['sales'].sum() > 400)
print(f"\nFiltered (regions with sales > 400):\n{filtered}")
# Keep only top N groups
top_regions = df.groupby('region')['sales'].sum().nlargest(2).index
top_df = df[df['region'].isin(top_regions)]
print(f"\nTop 2 regions:\n{top_df}")
Apply Method
# Apply custom function to each group
def analyze_group(group):
return pd.Series({
'total_sales': group['sales'].sum(),
'avg_sales': group['sales'].mean(),
'sales_growth': group['sales'].pct_change().mean(),
'consistency': group['sales'].std() / group['sales'].mean()
})
result = df.groupby('region').apply(analyze_group)
print(f"Custom group analysis:\n{result}")
# Apply with multiple groups
def product_region_analysis(group):
return pd.Series({
'mean_sales': group['sales'].mean(),
'median_sales': group['sales'].median(),
'cv': group['sales'].std() / group['sales'].mean() # Coefficient of variation
})
result = df.groupby(['region', 'product']).apply(product_region_analysis)
print(f"\nProduct-Region Analysis:\n{result}")
Merging DataFrames
Types of Joins
Before writing merge code, understand what rows end up in the result for each join type.
Sample Data ā df_customers:
| customer_id | name | city |
|---|---|---|
| 1 | Alice | New York |
| 2 | Bob | LA |
| 3 | Charlie | Chicago |
| 4 | Diana | Boston |
Sample Data ā df_orders:
| order_id | customer_id | product |
|---|---|---|
| 101 | 1 | Laptop |
| 102 | 2 | Phone |
| 103 | 2 | Tablet |
| 104 | 3 | Laptop |
| 105 | 6 | Phone |
Inner Join ā Only rows that match in both tables
customers orders result
āāāāāāāā āāāāāāāā āāāāāāāā
ā 1 ā ā āāāāāāāāŗ ā 1 ā ā ā 1 ā Alice ā Laptop
ā 2 ā ā āāāā¬āāāāŗ ā 2 ā ā ā 2 ā Bob ā Phone
ā 3 ā ā āāāā¼āāāāŗ ā 2 ā ā ā 2 ā Bob ā Tablet
ā 4 ā ā ā 3 ā ā ā 3 ā Charlie ā Laptop
ā ā ā ā 6 ā āāāāāāāā
āāāāāāāā ā āāāāāāāā
ā Ć ā customer 6 doesn't exist ā excluded
āāāāāāāāāāāā ā customer 4 has no orders ā excluded
Result: 4 rows ā only customers 1, 2, 3 have matching orders.
Left Join ā All rows from left table (customers), matching from right
customers orders result
āāāāāāāā āāāāāāāā āāāāāāāā
ā 1 ā ā āāāāāāāāŗ ā 1 ā ā ā 1 ā Alice ā Laptop
ā 2 ā ā āāāā¬āāāāŗ ā 2 ā ā ā 2 ā Bob ā Phone
ā 3 ā ā āāāā¼āāāāŗ ā 2 ā ā ā 2 ā Bob ā Tablet
ā 4 ā ā āāāā¼āāāāŗ ā 3 ā ā ā 3 ā Charlie ā Laptop
ā ā ā ā ā ā 4 ā Diana ā NaN
āāāāāāāā ā āāāāāāāā āāāāāāāā
ā Ć ā order from customer 6 ā excluded
āāāāāāāāāā
Result: 5 rows ā ALL customers kept; Diana gets NaN for order columns.
Right Join ā All rows from right table (orders), matching from left
customers orders result
āāāāāāāā āāāāāāāā āāāāāāāā
ā 1 ā ā āāāāāāāāŗ ā 1 ā ā ā 1 ā Alice ā Laptop
ā 2 ā ā āāāā¬āāāāŗ ā 2 ā ā ā 2 ā Bob ā Phone
ā 3 ā ā āāāā¼āāāāŗ ā 2 ā ā ā 2 ā Bob ā Tablet
ā 4 ā ā ā 3 ā ā ā 3 ā Charlie ā Laptop
āāāāāāāā ā ā 6 ā ā ā 6 ā NaN ā Phone
ā āāāāāāāā āāāāāāāā
āāāāāāāāāā
Result: 5 rows ā ALL orders kept; customer 6 gets NaN for customer columns.
Full Outer Join ā All rows from both tables
customers orders result
āāāāāāāā āāāāāāāā āāāāāāāā
ā 1 ā ā āāāāāāāāŗ ā 1 ā ā ā 1 ā Alice ā Laptop
ā 2 ā ā āāāā¬āāāāŗ ā 2 ā ā ā 2 ā Bob ā Phone
ā 3 ā ā āāāā¼āāāāŗ ā 2 ā ā ā 2 ā Bob ā Tablet
ā 4 ā ā āāāā¼āāāāŗ ā 3 ā ā ā 3 ā Charlie ā Laptop
ā ā ā ā 6 ā ā ā 4 ā Diana ā NaN
āāāāāāāā ā āāāāāāāā ā 6 ā NaN ā Phone
ā āāāāāāāā
āāāāāāāāāā
Result: 6 rows ā EVERYTHING kept; unmatched rows get NaN where needed.
Quick Reference:
| Join Type | Kept from Left | Kept from Right | NaN for |
|---|---|---|---|
inner | Only matches | Only matches | Nothing ā non-matches excluded |
left | All | Only matches | Right side when no match |
right | Only matches | All | Left side when no match |
outer | All | All | Both sides when no match |
Merge Operations
# Create sample DataFrames
df_customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston', 'Denver']
})
df_orders = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105],
'customer_id': [1, 2, 2, 3, 6],
'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'],
'amount': [999, 699, 449, 999, 699]
})
# Inner Join (only matching rows)
inner = pd.merge(df_customers, df_orders, on='customer_id', how='inner')
print("Inner Join:")
print(inner)
# Left Join (all customers, even without orders)
left = pd.merge(df_customers, df_orders, on='customer_id', how='left')
print("\nLeft Join:")
print(left)
# Right Join (all orders, even without matching customer)
right = pd.merge(df_customers, df_orders, on='customer_id', how='right')
print("\nRight Join:")
print(right)
# Full Outer Join (all rows from both)
full = pd.merge(df_customers, df_orders, on='customer_id', how='outer')
print("\nFull Outer Join:")
print(full)
# Merge on different column names
df_customers_renamed = df_customers.rename(columns={'customer_id': 'id'})
merged = pd.merge(df_customers_renamed, df_orders,
left_on='id', right_on='customer_id',
suffixes=('_customer', '_order'))
print("\nMerge with different column names:")
print(merged)
Concatenation
# Vertical concatenation (stacking)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
df3 = pd.DataFrame({'A': [9, 10], 'B': [11, 12]})
# Stack rows
vertical = pd.concat([df1, df2, df3], ignore_index=True)
print("Vertical Concatenation:")
print(vertical)
# Horizontal concatenation (side by side)
df_left = pd.DataFrame({'A': [1, 2, 3]})
df_right = pd.DataFrame({'B': [4, 5, 6]})
horizontal = pd.concat([df_left, df_right], axis=1)
print("\nHorizontal Concatenation:")
print(horizontal)
# Concat with keys (create MultiIndex)
combined = pd.concat([df1, df2, df3], keys=['first', 'second', 'third'])
print("\nConcat with Keys:")
print(combined)
Pivot Tables
Pivot tables transform long-format data into wide-format summaries.
# Sample data (long format)
df_long = pd.DataFrame({
'month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar'] * 2,
'region': ['North', 'South'] * 3 * 2,
'product': ['A'] * 6 + ['B'] * 6,
'sales': np.random.randint(100, 500, 12)
})
print("Long Format Data:")
print(df_long)
# Pivot table
pivot = pd.pivot_table(
df_long,
values='sales',
index='month',
columns='region',
aggfunc='sum',
fill_value=0
)
print("\nPivot Table:")
print(pivot)
# Multiple aggregations
pivot_multi = pd.pivot_table(
df_long,
values='sales',
index='month',
columns='region',
aggfunc=['sum', 'mean', 'count'],
fill_value=0
)
print("\nPivot with Multiple Aggregations:")
print(pivot_multi)
# Pivot with margins (totals)
pivot_margins = pd.pivot_table(
df_long,
values='sales',
index='month',
columns='region',
aggfunc='sum',
margins=True,
margins_name='Total'
)
print("\nPivot with Margins:")
print(pivot_margins)
# GroupBy alternative
grouped_pivot = df_long.groupby(['month', 'region'])['sales'].sum().unstack()
print("\nGroupBy Alternative:")
print(grouped_pivot)
Pivot vs Melt
# Melt (wide to long)
df_wide = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'math': [95, 88, 92],
'english': [85, 90, 88],
'science': [90, 85, 95]
})
print("Wide Format:")
print(df_wide)
# Melt to long format
df_melted = pd.melt(
df_wide,
id_vars=['name'],
var_name='subject',
value_name='score'
)
print("\nLong Format (Melted):")
print(df_melted)
# Reverse: Pivot back to wide
df_wide_again = df_melted.pivot(
index='name',
columns='subject',
values='score'
).reset_index()
df_wide_again.columns.name = None
print("\nBack to Wide:")
print(df_wide_again)
Window Functions
# Rolling window operations
df_sales = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=10),
'sales': [100, 120, 110, 130, 140, 125, 135, 145, 155, 160]
})
# 3-day rolling average
df_sales['rolling_3d'] = df_sales['sales'].rolling(window=3).mean()
# 3-day rolling sum
df_sales['rolling_sum'] = df_sales['sales'].rolling(window=3).sum()
# Expanding window (cumulative)
df_sales['cumulative_mean'] = df_sales['sales'].expanding().mean()
print("Window Functions:")
print(df_sales)
# Group-wise window functions
df['region_sales_rank'] = df.groupby('region')['sales'].rank(ascending=False)
df['region_sales_pct'] = df.groupby('region')['sales'].transform(
lambda x: x / x.sum() * 100
)
print("\nWith Rankings:")
print(df.head(10))
Practical Example: Sales Analysis
# Complete sales analysis workflow
import pandas as pd
import numpy as np
# Create comprehensive dataset
np.random.seed(42)
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
regions = ['North', 'South', 'East', 'West']
products = ['Laptop', 'Phone', 'Tablet', 'Accessory']
n = len(dates) * len(regions) * len(products)
data = {
'date': np.random.choice(dates, n),
'region': np.random.choice(regions, n),
'product': np.random.choice(products, n),
'units_sold': np.random.randint(1, 50, n),
'unit_price': np.random.choice([999, 699, 449, 149], n)
}
df = pd.DataFrame(data)
df['revenue'] = df['units_sold'] * df['unit_price']
# 1. Revenue by region
print("Revenue by Region:")
print(df.groupby('region')['revenue'].sum().sort_values(ascending=False))
# 2. Monthly trend
df['month'] = df['date'].dt.month
monthly = df.groupby('month')['revenue'].sum()
print("\nMonthly Revenue Trend:")
print(monthly)
# 3. Top products by region
top_products = df.groupby(['region', 'product'])['revenue'].sum()
print("\nTop Products by Region:")
print(top_products.unstack())
# 4. Pivot table summary
pivot_summary = pd.pivot_table(
df,
values='revenue',
index='region',
columns='product',
aggfunc='sum',
margins=True,
margins_name='Total'
)
print("\nPivot Table Summary:")
print(pivot_summary)
# 5. Rolling 7-day average
daily_revenue = df.groupby('date')['revenue'].sum()
rolling_avg = daily_revenue.rolling(window=7).mean()
print("\n7-Day Rolling Average (First 10 days):")
print(rolling_avg.head(10))
Key Takeaways
šSummary: Pandas GroupBy, Merge, Pivot
- GroupBy follows split-apply-combine paradigm
- Merge is like SQL joins ā choose inner, left, right, or outer
- Concat stacks DataFrames vertically or horizontally
- Pivot tables transform long to wide format with aggregations
- Window functions enable rolling and cumulative calculations
Practice Exercise
- Create a sales dataset with dates, regions, products, and amounts
- Calculate total sales by region and product using GroupBy
- Create a pivot table showing monthly sales by region
- Merge with a customer table to add customer information
- Add 7-day rolling averages and rankings