Histograms

Data Visualization

The Shape of Your Data Reveals Everything

A histogram groups numerical data into bins and shows how frequently values fall into each range. Unlike bar charts, histograms have no gaps between bars because the underlying data is continuous.

Distribution shape — Symmetric, skewed, bimodal, or uniform — the histogram shows it all
Center and spread — See where data clusters and how far it stretches
Outlier detection — Spot unusual values that sit far from the main body
Bin width sensitivity — Too few bins hide structure; too many create noise

Always plot a histogram before calculating any statistic. The shape tells you which statistics are valid.

What is a Histogram?

Definition

A histogram is a bar graph that shows the distribution of numerical data by grouping it into bins (intervals). Unlike bar charts (for categorical data), histograms have no gaps between bars — because the data is continuous.

Anatomy of a Histogram

X-axis: The range of the variable, divided into equal-width bins
Y-axis: Frequency (count), relative frequency, or density
Bar height: Frequency of observations in that bin
Bar width: The bin width (class interval)

Building a Histogram in Python

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

np.random.seed(42)

# Generate example data: time to complete a task (minutes)
task_times = np.random.lognormal(mean=3.5, sigma=0.4, size=200)

fig, axes = plt.subplots(2, 3, figsize=(14, 8))

# 1. Basic histogram
axes[0,0].hist(task_times, bins=20, edgecolor='black', color='steelblue', alpha=0.7)
axes[0,0].set_title('Basic Histogram (bins=20)')
axes[0,0].set_xlabel('Time (minutes)')
axes[0,0].set_ylabel('Frequency')

# 2. Too few bins (underfitting)
axes[0,1].hist(task_times, bins=5, edgecolor='black', color='coral', alpha=0.7)
axes[0,1].set_title('Too Few Bins (bins=5)\n-> Hides structure')

# 3. Too many bins (overfitting)
axes[0,2].hist(task_times, bins=80, edgecolor='black', color='orchid', alpha=0.7)
axes[0,2].set_title('Too Many Bins (bins=80)\n-> Too noisy')

# 4. Density histogram with KDE
axes[1,0].hist(task_times, bins=20, density=True, edgecolor='black',
               color='steelblue', alpha=0.5, label='Histogram')
kde = stats.gaussian_kde(task_times)
x = np.linspace(task_times.min(), task_times.max(), 200)
axes[1,0].plot(x, kde(x), 'r-', linewidth=2, label='KDE')
axes[1,0].set_title('Density Histogram + KDE')
axes[1,0].legend()

# 5. Seaborn histplot
sns.histplot(task_times, bins=20, kde=True, ax=axes[1,1], color='teal')
axes[1,1].set_title('Seaborn histplot (built-in KDE)')

# 6. Compare two distributions
data_a = np.random.normal(35, 8, 200)
data_b = np.random.normal(42, 6, 200)
axes[1,2].hist(data_a, bins=20, alpha=0.6, color='blue', label='Method A', density=True)
axes[1,2].hist(data_b, bins=20, alpha=0.6, color='orange', label='Method B', density=True)
axes[1,2].set_title('Comparing Two Groups')
axes[1,2].legend()

plt.tight_layout()
plt.savefig('histograms.png', dpi=150)
plt.show()

Common Distribution Shapes

Symmetric / Bell-Shaped (Normal)

Both tails are mirror images. Mean ≈ Median ≈ Mode.

Right-Skewed (Positive Skew)

Long right tail. Mean > Median > Mode. Common in: income, wait times, stock returns.

Left-Skewed (Negative Skew)

Long left tail. Mean < Median < Mode. Common in: age at death, exam scores on an easy test.

Bimodal

Two peaks. Often indicates two distinct subpopulations mixed together.

Uniform

Roughly equal frequency across all values. Random number generators produce this.

# Visualize all shapes
fig, axes = plt.subplots(1, 5, figsize=(18, 4))

np.random.seed(0)
shapes = {
    'Normal\n(Symmetric)': np.random.normal(50, 10, 1000),
    'Right-Skewed\n(Income-like)': np.random.lognormal(3, 0.8, 1000),
    'Left-Skewed\n(Exam scores)': 100 - np.random.exponential(10, 1000),
    'Bimodal\n(Two populations)': np.concatenate([np.random.normal(30,5,500),
                                                    np.random.normal(70,5,500)]),
    'Uniform': np.random.uniform(0, 100, 1000)
}

for ax, (title, data) in zip(axes, shapes.items()):
    ax.hist(data, bins=30, color='steelblue', edgecolor='black', alpha=0.7, density=True)
    mean_val = np.mean(data)
    median_val = np.median(data)
    ax.axvline(mean_val, color='red', linewidth=2, linestyle='--', label=f'Mean={mean_val:.0f}')
    ax.axvline(median_val, color='green', linewidth=2, linestyle='-', label=f'Median={median_val:.0f}')
    ax.set_title(title)
    ax.legend(fontsize=7)

plt.tight_layout()
plt.savefig('distribution_shapes.png', dpi=150)
plt.show()

Choosing the Right Number of Bins

Rule	Formula	Best For
Sturges	k = 1 + log₂(n)	Normal-ish, small n
Scott	h = 3.49σ/n^(1/3)	Normal data
Freedman-Diaconis	h = 2·IQR/n^(1/3)	Skewed or outlier-prone

def optimal_bins(data):
    n = len(data)
    iqr = np.percentile(data, 75) - np.percentile(data, 25)
    data_range = data.max() - data.min()
    
    sturges = int(np.ceil(1 + np.log2(n)))
    scott_width = 3.49 * np.std(data) / n**(1/3)
    scott_bins = int(np.ceil(data_range / scott_width))
    fd_width = 2 * iqr / n**(1/3)
    fd_bins = int(np.ceil(data_range / fd_width)) if fd_width > 0 else sturges
    
    print(f"Sturges: {sturges} bins")
    print(f"Scott: {scott_bins} bins (width = {scott_width:.2f})")
    print(f"Freedman-Diaconis: {fd_bins} bins (width = {fd_width:.2f})")
    return sturges, scott_bins, fd_bins

print("Task times data:")
optimal_bins(task_times)

Histograms in Machine Learning

In ML, histograms are everywhere:

ML Application	What to Histogram	What to Look For
Feature engineering	Each input feature	Skewness → log transform
Model evaluation	Residuals (y - ŷ)	Normal → valid confidence intervals
Data drift detection	Feature distributions over time	Shifts between train/test
Loss curves	Training loss per epoch	Convergence behavior
Probability calibration	Predicted probabilities	Uniform = well calibrated

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

np.random.seed(42)

# Generate skewed feature data
n = 500
X = np.random.lognormal(3, 1, (n, 1))
y = 50 + 0.01 * X[:,0] + np.random.normal(0, 5, n)

# Before training: check feature distribution
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Raw feature — skewed
axes[0].hist(X[:,0], bins=30, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].set_title('Raw Feature (Skewed)\n→ Model struggles')
axes[0].set_xlabel('Feature Value')

# Log transform — now symmetric
X_log = np.log(X)
axes[1].hist(X_log[:,0], bins=30, color='green', edgecolor='black', alpha=0.7)
axes[1].set_title('Log Transformed (Symmetric)\n→ Model performs better')
axes[1].set_xlabel('Log(Feature)')

# Residuals after training
X_train, X_test, y_train, y_test = train_test_split(X_log, y, test_size=0.2)
model = LinearRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)
residuals = y_test - y_pred

axes[2].hist(residuals, bins=25, color='coral', edgecolor='black', alpha=0.7)
axes[2].axvline(0, color='red', linewidth=2, linestyle='--')
axes[2].set_title('Residuals (Normal-ish)\n→ Valid confidence intervals')
axes[2].set_xlabel('Residual')

plt.tight_layout()
plt.savefig('ml_histograms.png', dpi=150)
plt.show()

print(f"MSE: {mean_squared_error(y_test, y_pred):.2f}")
print(f"Residual mean: {residuals.mean():.4f} (should be ~0)")
print(f"Residual skew: {float(np.mean(((residuals - residuals.mean())/residuals.std())**3)):.3f}")

Histograms — Construction, Interpretation, and Common Shapes