Z-Scores (Standard Scores)
A z-score measures how many standard deviations a value is from the mean.
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
scores = np.array([72, 85, 91, 68, 77, 94, 83, 79, 88, 62])
mean, std = scores.mean(), scores.std(ddof=1)
z = (scores - mean) / std
print(f"Mean={mean:.2f}, Std={std:.2f}\n")
for s, zz in zip(scores, z):
print(f" Score={s:3d} z={zz:+.3f} ({abs(zz):.1f}σ {'above' if zz>0 else 'below'} mean)")
Comparing Across Different Tests
alice_math = 85; mu_m = 75; sd_m = 10 # Math test
alice_essay = 4.5; mu_e = 4.0; sd_e = 0.5 # Essay rubric (0-6)
z_math = (alice_math - mu_m) / sd_m
z_essay = (alice_essay - mu_e) / sd_e
print(f"Math z-score: {z_math:.2f}")
print(f"Essay z-score: {z_essay:.2f}")
print(f"Better performance: {'Math' if z_math > z_essay else 'Essay'}")
Z-Scores and Normal Probabilities
print("Key z-score probabilities:")
print(f"P(Z < 1.645) = {norm.cdf(1.645):.4f} (90th pct)")
print(f"P(Z < 1.960) = {norm.cdf(1.960):.4f} (97.5th pct)")
print(f"P(Z < 2.326) = {norm.cdf(2.326):.4f} (99th pct)")
print(f"P(-1.96<Z<1.96) = {norm.cdf(1.96)-norm.cdf(-1.96):.4f} (95%)")
# Given probability → find z
print(f"\nZ for 90th percentile: {norm.ppf(0.90):.4f}")
print(f"Z for 97.5th: {norm.ppf(0.975):.4f}")
print(f"Z for 99th: {norm.ppf(0.99):.4f}")
Standardization for Machine Learning
X = np.array([
[170, 70, 25],
[180, 85, 30],
[160, 55, 22],
[175, 80, 35],
[165, 60, 28],
], dtype=float)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Standardized features (z-scores):")
print(X_scaled.round(3))
print(f"Column means (should be ~0): {X_scaled.mean(axis=0).round(6)}")
print(f"Column stds (should be ~1): {X_scaled.std(axis=0).round(4)}")
Key Takeaways
- z = (x−μ)/σ — units are standard deviations from the mean
- Enables comparison across different scales — math score vs essay rubric
- P(−1.96 < Z < 1.96) ≈ 95% — foundation of 95% confidence intervals
- |z| > 3 flags potential outliers in approximately normal data
- StandardScaler in sklearn applies z-score standardization for ML pipelines
- Clean probability interpretation requires normality — for skewed data, use percentile ranks