Standard Deviation
The standard deviation is the square root of variance — it returns to the original units of the data, making it directly interpretable.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
data = np.array([85, 90, 78, 92, 88, 76, 95, 83, 87, 91])
mean = np.mean(data)
std = np.std(data, ddof=1)
print(f"Mean: {mean:.2f}")
print(f"Std Dev: {std:.4f}")
print(f"Variance: {std**2:.4f}")
print(f"Each score deviates ≈{std:.1f} points from the mean on average")
The Empirical Rule (68-95-99.7)
For approximately normal distributions:
| Range | % of Data |
|---|---|
| μ ± 1σ | 68.27% |
| μ ± 2σ | 95.45% |
| μ ± 3σ | 99.73% |
np.random.seed(42)
scores = np.random.normal(75, 10, 10000)
mu, sigma = scores.mean(), scores.std()
within_1 = np.abs(scores - mu) <= sigma
within_2 = np.abs(scores - mu) <= 2*sigma
within_3 = np.abs(scores - mu) <= 3*sigma
print(f"Within 1σ: {within_1.mean()*100:.2f}% (theory: 68.27%)")
print(f"Within 2σ: {within_2.mean()*100:.2f}% (theory: 95.45%)")
print(f"Within 3σ: {within_3.mean()*100:.2f}% (theory: 99.73%)")
# Visualization
x = np.linspace(mu-4*sigma, mu+4*sigma, 300)
y = stats.norm.pdf(x, mu, sigma)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(x, y, 'k-', lw=2, label='Normal curve')
for k, color, label in zip([3,2,1],
['#b8daff','#c3e6cb','#d4edda'],
['99.73% (±3σ)','95.45% (±2σ)','68.27% (±1σ)']):
ax.fill_between(x, y, where=(np.abs(x-mu)<=k*sigma), color=color, label=label)
for k in [1,2,3]:
ax.axvline(mu+k*sigma, color='gray', ls='--', alpha=0.5)
ax.axvline(mu-k*sigma, color='gray', ls='--', alpha=0.5)
ax.axvline(mu, color='red', lw=2, label=f'μ={mu:.0f}')
ax.set_title('Empirical Rule (68-95-99.7)')
ax.legend()
plt.tight_layout()
plt.savefig('empirical_rule.png', dpi=150)
plt.show()
Coefficient of Variation (CV)
Enables comparison of variability across different scales and units.
datasets = {
'Heights (cm)': np.random.normal(170, 10, 200),
'Weights (kg)': np.random.normal(70, 15, 200),
'IQ Scores': np.random.normal(100, 15, 200),
'Income ($K)': np.random.lognormal(4, 0.5, 200),
}
print(f"{'Variable':<20} {'Mean':>8} {'SD':>8} {'CV%':>8}")
print("-" * 48)
for name, d in datasets.items():
cv = np.std(d,ddof=1)/np.mean(d)*100
print(f"{name:<20} {np.mean(d):>8.1f} {np.std(d,ddof=1):>8.1f} {cv:>7.1f}%")
Outlier Detection (3σ Rule)
np.random.seed(0)
data_out = np.concatenate([np.random.normal(50,5,97), [120, 5, 3]])
z = np.abs((data_out - data_out.mean()) / data_out.std())
outliers = data_out[z > 3]
print(f"Outliers (|z|>3): {outliers}")
Key Takeaways
- Standard deviation is in the same units as the data — directly interpretable
- 68-95-99.7 rule applies to normal distributions: memorize these thresholds
- CV = SD/mean allows variability comparison across different scales
- For samples use ddof=1 in NumPy; for populations use ddof=0
- |z| > 3 (3σ from mean) flags potential outliers in approximately normal data
- Skewed distributions violate the empirical rule — never apply it blindly