Box Plots — Five-Number Summary, IQR, and Outlier Detection

Foundations of StatisticsData VisualizationFree Lesson

Advertisement

Box Plots (Box-and-Whisker Plots)

A box plot compactly summarizes a distribution using the five-number summary and clearly shows outliers. It's one of the most powerful tools for comparing distributions across groups.


The Five-Number Summary

StatisticSymbolDescription
MinimumMinSmallest non-outlier value
First QuartileQ125th percentile
MedianQ250th percentile
Third QuartileQ375th percentile
MaximumMaxLargest non-outlier value

Interquartile Range (IQR): Q3 − Q1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

np.random.seed(42)

# Dataset: salary data for three departments
data = {
    'Engineering': np.random.normal(95000, 15000, 50),
    'Marketing': np.random.normal(75000, 12000, 50),
    'Operations': np.random.normal(65000, 10000, 50)
}

# Compute five-number summaries
print("Five-Number Summaries:")
print(f"{'Dept':<15} {'Min':>8} {'Q1':>8} {'Median':>8} {'Q3':>8} {'Max':>8} {'IQR':>8}")
print("-" * 65)
for dept, salaries in data.items():
    q1, med, q3 = np.percentile(salaries, [25, 50, 75])
    iqr = q3 - q1
    # Whisker bounds (1.5 * IQR rule)
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    min_val = salaries[salaries >= lower].min()
    max_val = salaries[salaries <= upper].max()
    outliers = salaries[(salaries < lower) | (salaries > upper)]
    print(f"{dept:<15} {min_val:>8,.0f} {q1:>8,.0f} {med:>8,.0f} {q3:>8,.0f} {max_val:>8,.0f} {iqr:>8,.0f}")
    if len(outliers) > 0:
        print(f"  → {len(outliers)} outlier(s): {[f'${o:,.0f}' for o in outliers]}")

Outlier Detection: The 1.5×IQR Rule

A point is an outlier if it falls:

  • Below Q1 − 1.5 × IQR (lower fence)
  • Above Q3 + 1.5 × IQR (upper fence)
def detect_outliers(data):
    q1, q3 = np.percentile(data, [25, 75])
    iqr = q3 - q1
    lower_fence = q1 - 1.5 * iqr
    upper_fence = q3 + 1.5 * iqr
    outliers = data[(data < lower_fence) | (data > upper_fence)]
    return outliers, lower_fence, upper_fence

# Add some outliers to data
dept_data = np.concatenate([data['Engineering'], [200000, 25000]])  # two outliers
outliers, lower, upper = detect_outliers(dept_data)
print(f"Lower fence: ${lower:,.0f}")
print(f"Upper fence: ${upper:,.0f}")
print(f"Outliers found: {outliers}")

Creating Box Plots in Python

fig, axes = plt.subplots(1, 3, figsize=(15, 6))

# 1. matplotlib basic box plot
df_long = pd.DataFrame({
    'Salary': np.concatenate(list(data.values())),
    'Department': np.repeat(list(data.keys()), 50)
})

axes[0].boxplot([data[d] for d in data.keys()],
                labels=list(data.keys()),
                patch_artist=True,
                boxprops=dict(facecolor='lightblue', color='navy'),
                medianprops=dict(color='red', linewidth=2),
                flierprops=dict(marker='o', markerfacecolor='red', markersize=6))
axes[0].set_title('Department Salaries\n(Box Plot)')
axes[0].set_ylabel('Salary ($)')
axes[0].yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

# 2. Seaborn box plot (easier, prettier)
sns.boxplot(data=df_long, x='Department', y='Salary', ax=axes[1],
            palette='Set2', width=0.5)
axes[1].set_title('Seaborn Box Plot')
axes[1].yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

# 3. Violin + Box (best of both worlds)
sns.violinplot(data=df_long, x='Department', y='Salary', ax=axes[2],
               palette='Set3', inner='box')
axes[2].set_title('Violin + Box Plot\n(shows full distribution)')
axes[2].yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))

plt.tight_layout()
plt.savefig('box_plots.png', dpi=150)
plt.show()

Reading a Box Plot

     |←── whisker ──→|      |←── whisker ──→|
     |                |      |                |
     Min              Q1---Median---Q3        Max
                      |←── IQR ──→|

     ○                                        ○
 (outlier)                                (outlier)
  • Box width: spans IQR (middle 50% of data)
  • Center line: median
  • Whiskers: extend to furthest non-outlier within 1.5×IQR
  • Points beyond whiskers: outliers (plotted individually)

Key Takeaways

  1. Box plots are ideal for comparing multiple groups at a glance
  2. The IQR is robust — it's not affected by extreme values
  3. The 1.5×IQR rule identifies potential outliers but always investigate them
  4. Violin plots add distribution shape to box plots — use them when n is large
  5. Symmetric distributions have median centered in the box; skewed data has it off-center
  6. Always check outliers — they might be data errors or genuinely important observations

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement