Median — Calculation, Robustness, and When to Use It

Foundations of StatisticsDescriptive StatisticsFree Lesson

Advertisement

The Median

The median is the middle value of a dataset when sorted in ascending order. It divides the distribution exactly in half — 50% of values fall below, 50% above.


Calculation

For odd n: Median = the middle value (position (n+1)/2)

For even n: Median = average of the two middle values

M=x(n/2)+x(n/2+1)2M = \frac{x_{(n/2)} + x_{(n/2+1)}}{2}

import numpy as np
import pandas as pd
from scipy import stats

# Odd n
data_odd = [3, 7, 12, 8, 5, 15, 9]
sorted_odd = sorted(data_odd)
print(f"Sorted: {sorted_odd}")
print(f"n = {len(sorted_odd)} (odd)")
middle_pos = (len(sorted_odd) + 1) // 2
print(f"Middle position: {middle_pos}")
print(f"Median = {sorted_odd[middle_pos - 1]}")
print(f"NumPy confirms: {np.median(data_odd)}")

# Even n
data_even = [3, 7, 12, 8, 5, 15, 9, 11]
sorted_even = sorted(data_even)
print(f"\nSorted: {sorted_even}")
print(f"n = {len(sorted_even)} (even)")
n = len(sorted_even)
lower_mid = sorted_even[n//2 - 1]
upper_mid = sorted_even[n//2]
print(f"Two middle values: {lower_mid} and {upper_mid}")
print(f"Median = ({lower_mid} + {upper_mid})/2 = {(lower_mid + upper_mid)/2}")
print(f"NumPy confirms: {np.median(data_even)}")

Robustness to Outliers

The median's greatest strength: a single extreme value cannot move it far.

import matplotlib.pyplot as plt

np.random.seed(42)

# Compare sensitivity to outliers
base_data = np.random.normal(50, 5, 100)

# Add increasingly extreme outliers
multipliers = [1, 2, 5, 10, 50, 100, 1000]
means = []
medians = []

for mult in multipliers:
    data_with_outlier = np.append(base_data, 50 * mult)
    means.append(np.mean(data_with_outlier))
    medians.append(np.median(data_with_outlier))

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

outlier_values = [50 * m for m in multipliers]
ax1.plot(outlier_values, means, 'r-o', label='Mean')
ax1.plot(outlier_values, medians, 'b-o', label='Median')
ax1.set_xlabel('Outlier Value')
ax1.set_ylabel('Statistic Value')
ax1.set_title('Effect of Outlier on Mean vs Median')
ax1.legend()
ax1.set_xscale('log')

# Show breakdown point
print("Breakdown point of median: 50%")
print("Breakdown point of mean: 0% (any single outlier affects it)")
print("\nWith outlier = 50000:")
data_extreme = np.append(base_data, 50000)
print(f"Mean = {np.mean(data_extreme):.2f} (was ~50)")
print(f"Median = {np.median(data_extreme):.2f} (barely changed!)")

Median for Grouped Data

When data is summarized in a frequency table:

M=L+(n/2Ff)×hM = L + \left(\frac{n/2 - F}{f}\right) \times h

Where:

  • L = lower boundary of median class
  • F = cumulative frequency before median class
  • f = frequency of median class
  • h = class width
# Frequency table
freq_table = pd.DataFrame({
    'Class': ['20-29', '30-39', '40-49', '50-59', '60-69'],
    'f': [5, 12, 20, 18, 5]
})
freq_table['cum_f'] = freq_table['f'].cumsum()
n = freq_table['f'].sum()
print(f"Total n = {n}, n/2 = {n/2}")
print(freq_table)

# Find median class (where cumulative frequency first exceeds n/2)
median_class_idx = (freq_table['cum_f'] >= n/2).idxmax()
print(f"\nMedian class: {freq_table.loc[median_class_idx, 'Class']}")

L = 40  # lower boundary of median class (40-49)
F = freq_table.loc[median_class_idx - 1, 'cum_f']  # cumulative frequency BEFORE
f = freq_table.loc[median_class_idx, 'f']
h = 10  # class width

median_grouped = L + ((n/2 - F) / f) * h
print(f"Estimated median = {L} + ({n/2} - {F}) / {f} × {h} = {median_grouped:.2f}")

Quartiles: Generalizing the Median

The median is the 50th percentile. Quartiles extend this:

  • Q1 = 25th percentile
  • Q2 = median = 50th percentile
  • Q3 = 75th percentile
data = np.random.normal(70, 15, 200)

q1, q2, q3 = np.percentile(data, [25, 50, 75])
iqr = q3 - q1

print(f"Q1 (25th pct): {q1:.2f}")
print(f"Q2 / Median:   {q2:.2f}")
print(f"Q3 (75th pct): {q3:.2f}")
print(f"IQR = Q3 - Q1: {iqr:.2f}")

Key Takeaways

  1. The median splits the distribution in half by frequency, not by value
  2. Breakdown point of 50% — the median stays resistant until > 50% of data is contaminated
  3. For skewed data, income, prices — always report median alongside mean
  4. The median minimizes the sum of absolute deviations (MAD minimization)
  5. Quartiles extend the median concept — together they form the five-number summary
  6. For grouped data, use the interpolation formula — the result is an estimate, not exact

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement