Histograms — Construction, Interpretation, and Common Shapes

Foundations of StatisticsData VisualizationFree Lesson

Advertisement

Histograms

A histogram is a bar graph that shows the distribution of numerical data by grouping it into bins (intervals). Unlike bar charts (for categorical data), histograms have no gaps between bars — because the data is continuous.


Anatomy of a Histogram

Frequency
    │
 16 │         ████
 14 │       ████████
 12 │     ██████████████
 10 │   ████████████████████
  8 │ ████████████████████████
  6 │ ██████████████████████████████
    └───────────────────────────────── Score
        50   60   70   80   90   100
  • X-axis: The range of the variable, divided into equal-width bins
  • Y-axis: Frequency (count), relative frequency, or density
  • Bar height: Frequency of observations in that bin
  • Bar width: The bin width (class interval)

Building a Histogram in Python

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

np.random.seed(42)

# Generate example data: time to complete a task (minutes)
task_times = np.random.lognormal(mean=3.5, sigma=0.4, size=200)

fig, axes = plt.subplots(2, 3, figsize=(14, 8))

# 1. Basic histogram
axes[0,0].hist(task_times, bins=20, edgecolor='black', color='steelblue', alpha=0.7)
axes[0,0].set_title('Basic Histogram (bins=20)')
axes[0,0].set_xlabel('Time (minutes)')
axes[0,0].set_ylabel('Frequency')

# 2. Too few bins (underfitting)
axes[0,1].hist(task_times, bins=5, edgecolor='black', color='coral', alpha=0.7)
axes[0,1].set_title('Too Few Bins (bins=5)\n→ Hides structure')

# 3. Too many bins (overfitting)
axes[0,2].hist(task_times, bins=80, edgecolor='black', color='orchid', alpha=0.7)
axes[0,2].set_title('Too Many Bins (bins=80)\n→ Too noisy')

# 4. Density histogram with KDE
axes[1,0].hist(task_times, bins=20, density=True, edgecolor='black',
               color='steelblue', alpha=0.5, label='Histogram')
kde = stats.gaussian_kde(task_times)
x = np.linspace(task_times.min(), task_times.max(), 200)
axes[1,0].plot(x, kde(x), 'r-', linewidth=2, label='KDE')
axes[1,0].set_title('Density Histogram + KDE')
axes[1,0].legend()

# 5. Seaborn histplot
sns.histplot(task_times, bins=20, kde=True, ax=axes[1,1], color='teal')
axes[1,1].set_title('Seaborn histplot (built-in KDE)')

# 6. Compare two distributions
data_a = np.random.normal(35, 8, 200)
data_b = np.random.normal(42, 6, 200)
axes[1,2].hist(data_a, bins=20, alpha=0.6, color='blue', label='Method A', density=True)
axes[1,2].hist(data_b, bins=20, alpha=0.6, color='orange', label='Method B', density=True)
axes[1,2].set_title('Comparing Two Groups')
axes[1,2].legend()

plt.tight_layout()
plt.savefig('histograms.png', dpi=150)
plt.show()

Common Distribution Shapes

Symmetric / Bell-Shaped (Normal)

    ████
  ████████
 ██████████
████████████

Both tails are mirror images. Mean ≈ Median ≈ Mode.

Right-Skewed (Positive Skew)

████
██████
████████████████

Long right tail. Mean > Median > Mode. Common in: income, wait times, stock returns.

Left-Skewed (Negative Skew)

             ████
       ████████
████████████████

Long left tail. Mean < Median < Mode. Common in: age at death, exam scores on an easy test.

Bimodal

████       ████
████████ ████████

Two peaks. Often indicates two distinct subpopulations mixed together.

Uniform

████████████████
████████████████

Roughly equal frequency across all values. Random number generators produce this.

# Visualize all shapes
fig, axes = plt.subplots(1, 5, figsize=(18, 4))

np.random.seed(0)
shapes = {
    'Normal\n(Symmetric)': np.random.normal(50, 10, 1000),
    'Right-Skewed\n(Income-like)': np.random.lognormal(3, 0.8, 1000),
    'Left-Skewed\n(Exam scores)': 100 - np.random.exponential(10, 1000),
    'Bimodal\n(Two populations)': np.concatenate([np.random.normal(30,5,500),
                                                    np.random.normal(70,5,500)]),
    'Uniform': np.random.uniform(0, 100, 1000)
}

for ax, (title, data) in zip(axes, shapes.items()):
    ax.hist(data, bins=30, color='steelblue', edgecolor='black', alpha=0.7, density=True)
    mean_val = np.mean(data)
    median_val = np.median(data)
    ax.axvline(mean_val, color='red', linewidth=2, linestyle='--', label=f'Mean={mean_val:.0f}')
    ax.axvline(median_val, color='green', linewidth=2, linestyle='-', label=f'Median={median_val:.0f}')
    ax.set_title(title)
    ax.legend(fontsize=7)

plt.tight_layout()
plt.savefig('distribution_shapes.png', dpi=150)
plt.show()

Choosing the Right Number of Bins

RuleFormulaBest For
Sturgesk = 1 + log₂(n)Normal-ish, small n
Scotth = 3.49σ/n^(1/3)Normal data
Freedman-Diaconish = 2·IQR/n^(1/3)Skewed or outlier-prone
def optimal_bins(data):
    n = len(data)
    iqr = np.percentile(data, 75) - np.percentile(data, 25)
    data_range = data.max() - data.min()
    
    sturges = int(np.ceil(1 + np.log2(n)))
    scott_width = 3.49 * np.std(data) / n**(1/3)
    scott_bins = int(np.ceil(data_range / scott_width))
    fd_width = 2 * iqr / n**(1/3)
    fd_bins = int(np.ceil(data_range / fd_width)) if fd_width > 0 else sturges
    
    print(f"Sturges: {sturges} bins")
    print(f"Scott: {scott_bins} bins (width = {scott_width:.2f})")
    print(f"Freedman-Diaconis: {fd_bins} bins (width = {fd_width:.2f})")
    return sturges, scott_bins, fd_bins

print("Task times data:")
optimal_bins(task_times)

Key Takeaways

  1. Histograms reveal the shape, center, spread, and gaps in data — always plot one first
  2. Bin width is a critical choice — too wide hides structure; too narrow creates noise
  3. Shape tells you which statistics to use: symmetric → mean; skewed → median
  4. Bimodal distributions often signal mixed populations that should be analyzed separately
  5. Use density (not count) on y-axis when comparing groups of different sizes
  6. Add KDE (kernel density estimate) to smooth the histogram for a cleaner shape estimate

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement