Histograms
A histogram is a bar graph that shows the distribution of numerical data by grouping it into bins (intervals). Unlike bar charts (for categorical data), histograms have no gaps between bars — because the data is continuous.
Anatomy of a Histogram
Frequency
│
16 │ ████
14 │ ████████
12 │ ██████████████
10 │ ████████████████████
8 │ ████████████████████████
6 │ ██████████████████████████████
└───────────────────────────────── Score
50 60 70 80 90 100
- X-axis: The range of the variable, divided into equal-width bins
- Y-axis: Frequency (count), relative frequency, or density
- Bar height: Frequency of observations in that bin
- Bar width: The bin width (class interval)
Building a Histogram in Python
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
np.random.seed(42)
# Generate example data: time to complete a task (minutes)
task_times = np.random.lognormal(mean=3.5, sigma=0.4, size=200)
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
# 1. Basic histogram
axes[0,0].hist(task_times, bins=20, edgecolor='black', color='steelblue', alpha=0.7)
axes[0,0].set_title('Basic Histogram (bins=20)')
axes[0,0].set_xlabel('Time (minutes)')
axes[0,0].set_ylabel('Frequency')
# 2. Too few bins (underfitting)
axes[0,1].hist(task_times, bins=5, edgecolor='black', color='coral', alpha=0.7)
axes[0,1].set_title('Too Few Bins (bins=5)\n→ Hides structure')
# 3. Too many bins (overfitting)
axes[0,2].hist(task_times, bins=80, edgecolor='black', color='orchid', alpha=0.7)
axes[0,2].set_title('Too Many Bins (bins=80)\n→ Too noisy')
# 4. Density histogram with KDE
axes[1,0].hist(task_times, bins=20, density=True, edgecolor='black',
color='steelblue', alpha=0.5, label='Histogram')
kde = stats.gaussian_kde(task_times)
x = np.linspace(task_times.min(), task_times.max(), 200)
axes[1,0].plot(x, kde(x), 'r-', linewidth=2, label='KDE')
axes[1,0].set_title('Density Histogram + KDE')
axes[1,0].legend()
# 5. Seaborn histplot
sns.histplot(task_times, bins=20, kde=True, ax=axes[1,1], color='teal')
axes[1,1].set_title('Seaborn histplot (built-in KDE)')
# 6. Compare two distributions
data_a = np.random.normal(35, 8, 200)
data_b = np.random.normal(42, 6, 200)
axes[1,2].hist(data_a, bins=20, alpha=0.6, color='blue', label='Method A', density=True)
axes[1,2].hist(data_b, bins=20, alpha=0.6, color='orange', label='Method B', density=True)
axes[1,2].set_title('Comparing Two Groups')
axes[1,2].legend()
plt.tight_layout()
plt.savefig('histograms.png', dpi=150)
plt.show()
Common Distribution Shapes
Symmetric / Bell-Shaped (Normal)
████
████████
██████████
████████████
Both tails are mirror images. Mean ≈ Median ≈ Mode.
Right-Skewed (Positive Skew)
████
██████
████████████████
Long right tail. Mean > Median > Mode. Common in: income, wait times, stock returns.
Left-Skewed (Negative Skew)
████
████████
████████████████
Long left tail. Mean < Median < Mode. Common in: age at death, exam scores on an easy test.
Bimodal
████ ████
████████ ████████
Two peaks. Often indicates two distinct subpopulations mixed together.
Uniform
████████████████
████████████████
Roughly equal frequency across all values. Random number generators produce this.
# Visualize all shapes
fig, axes = plt.subplots(1, 5, figsize=(18, 4))
np.random.seed(0)
shapes = {
'Normal\n(Symmetric)': np.random.normal(50, 10, 1000),
'Right-Skewed\n(Income-like)': np.random.lognormal(3, 0.8, 1000),
'Left-Skewed\n(Exam scores)': 100 - np.random.exponential(10, 1000),
'Bimodal\n(Two populations)': np.concatenate([np.random.normal(30,5,500),
np.random.normal(70,5,500)]),
'Uniform': np.random.uniform(0, 100, 1000)
}
for ax, (title, data) in zip(axes, shapes.items()):
ax.hist(data, bins=30, color='steelblue', edgecolor='black', alpha=0.7, density=True)
mean_val = np.mean(data)
median_val = np.median(data)
ax.axvline(mean_val, color='red', linewidth=2, linestyle='--', label=f'Mean={mean_val:.0f}')
ax.axvline(median_val, color='green', linewidth=2, linestyle='-', label=f'Median={median_val:.0f}')
ax.set_title(title)
ax.legend(fontsize=7)
plt.tight_layout()
plt.savefig('distribution_shapes.png', dpi=150)
plt.show()
Choosing the Right Number of Bins
| Rule | Formula | Best For |
|---|---|---|
| Sturges | k = 1 + log₂(n) | Normal-ish, small n |
| Scott | h = 3.49σ/n^(1/3) | Normal data |
| Freedman-Diaconis | h = 2·IQR/n^(1/3) | Skewed or outlier-prone |
def optimal_bins(data):
n = len(data)
iqr = np.percentile(data, 75) - np.percentile(data, 25)
data_range = data.max() - data.min()
sturges = int(np.ceil(1 + np.log2(n)))
scott_width = 3.49 * np.std(data) / n**(1/3)
scott_bins = int(np.ceil(data_range / scott_width))
fd_width = 2 * iqr / n**(1/3)
fd_bins = int(np.ceil(data_range / fd_width)) if fd_width > 0 else sturges
print(f"Sturges: {sturges} bins")
print(f"Scott: {scott_bins} bins (width = {scott_width:.2f})")
print(f"Freedman-Diaconis: {fd_bins} bins (width = {fd_width:.2f})")
return sturges, scott_bins, fd_bins
print("Task times data:")
optimal_bins(task_times)
Key Takeaways
- Histograms reveal the shape, center, spread, and gaps in data — always plot one first
- Bin width is a critical choice — too wide hides structure; too narrow creates noise
- Shape tells you which statistics to use: symmetric → mean; skewed → median
- Bimodal distributions often signal mixed populations that should be analyzed separately
- Use density (not count) on y-axis when comparing groups of different sizes
- Add KDE (kernel density estimate) to smooth the histogram for a cleaner shape estimate