Statistics 101: Mean, Median, Variance and Distributions
Why Statistics Matters for Data Science
Statistics is the mathematical foundation of machine learning. Every ML algorithm is essentially a statistical model optimizing some objective function. Without understanding statistics, you cannot truly understand why models work or fail.
The Statistics → ML Pipeline
1. Measures of Central Tendency
Central tendency describes the "center" or "typical value" of a dataset.
Mean (Arithmetic Average)
Formula:
Median (Middle Value)
The median is the value separating the higher half from the lower half of a data sample.
Formula:
Mode (Most Frequent)
2. Measures of Spread (Dispersion)
Variance and Standard Deviation
Population Variance:
Sample Variance (Bessel's Correction):
Standard Deviation:
3. The Normal Distribution (Gaussian)
The most important probability distribution in statistics and ML.
Probability Density Function (PDF):
Where:
- = mean (location parameter)
- = standard deviation (scale parameter)
- = variance
Standard Normal Distribution (Z-Score)
Z-Score Transformation:
This transforms any normal distribution to the standard normal with and .
4. Central Limit Theorem (CLT)
The most important theorem in statistics — explains why the normal distribution appears everywhere.
Central Limit Theorem:
Given a population with mean and standard deviation , the sampling distribution of the sample mean approaches a normal distribution as sample size increases:
Standard Error:
5. Skewness and Kurtosis
Skewness (Asymmetry)
Kurtosis (Tail Weight)
Fisher's Kurtosis:
- Mesokurtic (γ₂ = 0): Normal distribution
- Leptokurtic (γ₂ > 0): Heavy tails, sharp peak
- Platykurtic (γ₂ < 0): Light tails, flat peak
6. Covariance and Correlation
Covariance
Covariance:
- Cov > 0: Variables move together (positive)
- Cov < 0: Variables move opposite (negative)
- Cov = 0: No linear relationship
Pearson Correlation Coefficient
Pearson's r:
7. Confidence Intervals
Confidence Interval for Mean:
Common Confidence Levels:
| Level | z-score | Area in Tails |
|---|---|---|
| 90% | 1.645 | 5% each side |
| 95% | 1.960 | 2.5% each side |
| 99% | 2.576 | 0.5% each side |
Key Takeaways
- Mean, Median, Mode describe central tendency — choose based on data shape
- Variance/Standard Deviation quantify spread — foundation of all ML loss functions
- Normal Distribution — the bell curve underlies hypothesis testing and CLT
- CLT — sample means are normal regardless of population shape (n ≥ 30)
- Correlation ≠ Causation — always consider confounding variables
- Confidence Intervals — quantify uncertainty in estimates
Next: Probability, Bayes' Theorem and PDF/CDF
Build on these foundations with probability theory and Bayesian inference.