The Median

Descriptive Statistics

The Value That Splits Your Data Exactly in Half

The median is the middle value of a dataset when sorted. It divides the distribution exactly in half — 50% below, 50% above.

Robust to outliers — One extreme value cannot pull the median away from center
Works for ordinal data — The mean cannot; the median can
Minimizes absolute deviations — The mathematically optimal center for absolute loss
Income, housing, and skewed data — The median tells the truth when the mean lies

When data is skewed or contaminated with outliers, the median is your most honest summary.

What is the Median?

Definition

The median is the middle value of a dataset when sorted in ascending order. It divides the distribution exactly in half — 50% of values fall below, 50% above.

Calculation

For odd n: Median = the middle value (position (n+1)/2)

For even n: Median = average of the two middle values

import numpy as np
import pandas as pd
from scipy import stats

# Odd n
data_odd = [3, 7, 12, 8, 5, 15, 9]
sorted_odd = sorted(data_odd)
print(f"Sorted: {sorted_odd}")
print(f"n = {len(sorted_odd)} (odd)")
middle_pos = (len(sorted_odd) + 1) // 2
print(f"Middle position: {middle_pos}")
print(f"Median = {sorted_odd[middle_pos - 1]}")
print(f"NumPy confirms: {np.median(data_odd)}")

# Even n
data_even = [3, 7, 12, 8, 5, 15, 9, 11]
sorted_even = sorted(data_even)
print(f"\nSorted: {sorted_even}")
print(f"n = {len(sorted_even)} (even)")
n = len(sorted_even)
lower_mid = sorted_even[n//2 - 1]
upper_mid = sorted_even[n//2]
print(f"Two middle values: {lower_mid} and {upper_mid}")
print(f"Median = ({lower_mid} + {upper_mid})/2 = {(lower_mid + upper_mid)/2}")
print(f"NumPy confirms: {np.median(data_even)}")

Robustness to Outliers

import matplotlib.pyplot as plt

np.random.seed(42)

# Compare sensitivity to outliers
base_data = np.random.normal(50, 5, 100)

# Add increasingly extreme outliers
multipliers = [1, 2, 5, 10, 50, 100, 1000]
means = []
medians = []

for mult in multipliers:
    data_with_outlier = np.append(base_data, 50 * mult)
    means.append(np.mean(data_with_outlier))
    medians.append(np.median(data_with_outlier))

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

outlier_values = [50 * m for m in multipliers]
ax1.plot(outlier_values, means, 'r-o', label='Mean')
ax1.plot(outlier_values, medians, 'b-o', label='Median')
ax1.set_xlabel('Outlier Value')
ax1.set_ylabel('Statistic Value')
ax1.set_title('Effect of Outlier on Mean vs Median')
ax1.legend()
ax1.set_xscale('log')

# Show breakdown point
print("Breakdown point of median: 50%")
print("Breakdown point of mean: 0% (any single outlier affects it)")
print("\nWith outlier = 50000:")
data_extreme = np.append(base_data, 50000)
print(f"Mean = {np.mean(data_extreme):.2f} (was ~50)")
print(f"Median = {np.median(data_extreme):.2f} (barely changed!)")

Median for Grouped Data

# Frequency table
freq_table = pd.DataFrame({
    'Class': ['20-29', '30-39', '40-49', '50-59', '60-69'],
    'f': [5, 12, 20, 18, 5]
})
freq_table['cum_f'] = freq_table['f'].cumsum()
n = freq_table['f'].sum()
print(f"Total n = {n}, n/2 = {n/2}")
print(freq_table)

# Find median class (where cumulative frequency first exceeds n/2)
median_class_idx = (freq_table['cum_f'] >= n/2).idxmax()
print(f"\nMedian class: {freq_table.loc[median_class_idx, 'Class']}")

L = 40  # lower boundary of median class (40-49)
F = freq_table.loc[median_class_idx - 1, 'cum_f']  # cumulative frequency BEFORE
f = freq_table.loc[median_class_idx, 'f']
h = 10  # class width

median_grouped = L + ((n/2 - F) / f) * h
print(f"Estimated median = {L} + ({n/2} - {F}) / {f} × {h} = {median_grouped:.2f}")

Quartiles: Generalizing the Median

data = np.random.normal(70, 15, 200)

q1, q2, q3 = np.percentile(data, [25, 50, 75])
iqr = q3 - q1

print(f"Q1 (25th pct): {q1:.2f}")
print(f"Q2 / Median:   {q2:.2f}")
print(f"Q3 (75th pct): {q3:.2f}")
print(f"IQR = Q3 - Q1: {iqr:.2f}")

The Median in Machine Learning

In ML, the median is the robust choice:

ML Application	Why Median?	What Happens with Mean
MAE / Huber Loss	Minimizes to median	MSE is pulled by outliers
RobustScaler	Centers at median, scales by IQR	StandardScaler affected by outliers
Missing value imputation	Robust to extreme values	Mean imputation distorted by outliers
Outlier detection	IQR fence uses quartiles	Mean ± std fails on skewed data
Feature ranking	Median test for non-normal data	t-test assumes normality

import numpy as np
from sklearn.linear_model import HuberRegressor, LinearRegression
from sklearn.preprocessing import RobustScaler, StandardScaler

np.random.seed(42)

# Compare MSE vs MAE on data with outliers
n = 200
X = np.random.randn(n, 1) * 10
y = 3 * X[:,0] + np.random.randn(n) * 5

# Add outliers
y[::20] += np.random.randn(10) * 50  # every 20th point is extreme

# MSE model (minimizes to mean) — affected by outliers
mse_model = LinearRegression().fit(X, y)
print(f"MSE model (uses mean): coef = {mse_model.coef_[0]:.3f}")

# MAE model (minimizes to median) — robust
mae_model = HuberRegressor(epsilon=1.35).fit(X, y)
print(f"MAE model (uses median): coef = {mae_model.coef_[0]:.3f}")
print(f"True coefficient: 3.0\n")

# RobustScaler vs StandardScaler
data = np.concatenate([np.random.normal(50, 10, 100), [500, -200]])  # with outliers

robust = RobustScaler()  # uses median and IQR
standard = StandardScaler()  # uses mean and std

print(f"Data with outliers: mean={np.mean(data):.1f}, median={np.median(data):.1f}")
print(f"StandardScaler center: {standard.fit_transform(data.reshape(-1,1)).mean():.1f}")
print(f"RobustScaler center: {robust.fit_transform(data.reshape(-1,1)).mean():.1f}")

Median — Calculation, Robustness, and When to Use It