The Median
The median is the middle value of a dataset when sorted in ascending order. It divides the distribution exactly in half — 50% of values fall below, 50% above.
Calculation
For odd n: Median = the middle value (position (n+1)/2)
For even n: Median = average of the two middle values
import numpy as np
import pandas as pd
from scipy import stats
# Odd n
data_odd = [3, 7, 12, 8, 5, 15, 9]
sorted_odd = sorted(data_odd)
print(f"Sorted: {sorted_odd}")
print(f"n = {len(sorted_odd)} (odd)")
middle_pos = (len(sorted_odd) + 1) // 2
print(f"Middle position: {middle_pos}")
print(f"Median = {sorted_odd[middle_pos - 1]}")
print(f"NumPy confirms: {np.median(data_odd)}")
# Even n
data_even = [3, 7, 12, 8, 5, 15, 9, 11]
sorted_even = sorted(data_even)
print(f"\nSorted: {sorted_even}")
print(f"n = {len(sorted_even)} (even)")
n = len(sorted_even)
lower_mid = sorted_even[n//2 - 1]
upper_mid = sorted_even[n//2]
print(f"Two middle values: {lower_mid} and {upper_mid}")
print(f"Median = ({lower_mid} + {upper_mid})/2 = {(lower_mid + upper_mid)/2}")
print(f"NumPy confirms: {np.median(data_even)}")
Robustness to Outliers
The median's greatest strength: a single extreme value cannot move it far.
import matplotlib.pyplot as plt
np.random.seed(42)
# Compare sensitivity to outliers
base_data = np.random.normal(50, 5, 100)
# Add increasingly extreme outliers
multipliers = [1, 2, 5, 10, 50, 100, 1000]
means = []
medians = []
for mult in multipliers:
data_with_outlier = np.append(base_data, 50 * mult)
means.append(np.mean(data_with_outlier))
medians.append(np.median(data_with_outlier))
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
outlier_values = [50 * m for m in multipliers]
ax1.plot(outlier_values, means, 'r-o', label='Mean')
ax1.plot(outlier_values, medians, 'b-o', label='Median')
ax1.set_xlabel('Outlier Value')
ax1.set_ylabel('Statistic Value')
ax1.set_title('Effect of Outlier on Mean vs Median')
ax1.legend()
ax1.set_xscale('log')
# Show breakdown point
print("Breakdown point of median: 50%")
print("Breakdown point of mean: 0% (any single outlier affects it)")
print("\nWith outlier = 50000:")
data_extreme = np.append(base_data, 50000)
print(f"Mean = {np.mean(data_extreme):.2f} (was ~50)")
print(f"Median = {np.median(data_extreme):.2f} (barely changed!)")
Median for Grouped Data
When data is summarized in a frequency table:
Where:
- L = lower boundary of median class
- F = cumulative frequency before median class
- f = frequency of median class
- h = class width
# Frequency table
freq_table = pd.DataFrame({
'Class': ['20-29', '30-39', '40-49', '50-59', '60-69'],
'f': [5, 12, 20, 18, 5]
})
freq_table['cum_f'] = freq_table['f'].cumsum()
n = freq_table['f'].sum()
print(f"Total n = {n}, n/2 = {n/2}")
print(freq_table)
# Find median class (where cumulative frequency first exceeds n/2)
median_class_idx = (freq_table['cum_f'] >= n/2).idxmax()
print(f"\nMedian class: {freq_table.loc[median_class_idx, 'Class']}")
L = 40 # lower boundary of median class (40-49)
F = freq_table.loc[median_class_idx - 1, 'cum_f'] # cumulative frequency BEFORE
f = freq_table.loc[median_class_idx, 'f']
h = 10 # class width
median_grouped = L + ((n/2 - F) / f) * h
print(f"Estimated median = {L} + ({n/2} - {F}) / {f} × {h} = {median_grouped:.2f}")
Quartiles: Generalizing the Median
The median is the 50th percentile. Quartiles extend this:
- Q1 = 25th percentile
- Q2 = median = 50th percentile
- Q3 = 75th percentile
data = np.random.normal(70, 15, 200)
q1, q2, q3 = np.percentile(data, [25, 50, 75])
iqr = q3 - q1
print(f"Q1 (25th pct): {q1:.2f}")
print(f"Q2 / Median: {q2:.2f}")
print(f"Q3 (75th pct): {q3:.2f}")
print(f"IQR = Q3 - Q1: {iqr:.2f}")
Key Takeaways
- The median splits the distribution in half by frequency, not by value
- Breakdown point of 50% — the median stays resistant until > 50% of data is contaminated
- For skewed data, income, prices — always report median alongside mean
- The median minimizes the sum of absolute deviations (MAD minimization)
- Quartiles extend the median concept — together they form the five-number summary
- For grouped data, use the interpolation formula — the result is an estimate, not exact