Descriptive Statistics

ℹ️ Why It Matters

Descriptive statistics are the foundation of all data analysis. Before running complex models, you must understand the distribution, central tendency, and variability of your data. These summary measures reveal patterns, detect outliers, guide feature engineering, and inform model selection. In machine learning, nearly every preprocessing step — normalization, scaling, outlier removal — relies on descriptive statistics. Without them, you are flying blind.

Overview

Descriptive statistics summarize a dataset with a few meaningful numbers. They answer three fundamental questions: Where is the center? How much do values vary? What is the shape? Measures of center (mean, median, mode) identify the typical value. Measures of spread (variance, standard deviation, IQR, range) quantify how dispersed observations are. Measures of shape (skewness, kurtosis) describe asymmetry and tail heaviness. Together, these three pillars give you a complete picture of a distribution. The choice of summary measure depends on the data's distribution shape and the presence of outliers — using the wrong one can be deeply misleading.

Key Concepts

Measures of Center

Arithmetic Mean

\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i

Here,

$x_i$ =Each individual observation
$n$ =Total number of observations
$\bar{x}$ =Sample mean (balance point of the data)

The mean uses every data point but is sensitive to extreme values. It minimizes the sum of squared deviations: $\bar{x} = \arg\min_c \sum(x_i - c)^2$ . It is the natural measure for symmetric distributions without outliers.

Median

\tilde{x} = \begin{cases} x_{((n+1)/2)} & \text{if } n \text{ is odd} \\ \frac{x_{(n/2)} + x_{(n/2+1)}}{2} & \text{if } n \text{ is even} \end{cases}

Here,

$x_{(k)}$ =The k-th value in sorted order

The median is the 50th percentile — the value separating the lower half from the upper half. It is robust to outliers and skewed distributions. When the mean is significantly different from the median, the distribution is skewed.

Mode

x^* = \arg\max_x f(x)

Here,

$x^*$ =The most frequent value
$f(x)$ =Probability mass/density function

The mode is the most frequently occurring value. A dataset can have zero, one, or multiple modes. It is especially useful for categorical data where means are undefined.

Weighted and Specialized Means

Weighted Mean

\bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}

Here,

$w_i$ =Weight assigned to observation i (importance or reliability)
$x_i$ =Each individual observation

Use when observations have different importance: combining group means with different sample sizes, weighting by reliability, or applying time decay in time series.

Geometric Mean

G = \left(\prod_{i=1}^{n} x_i\right)^{1/n} = \exp\left(\frac{1}{n}\sum_{i=1}^{n} \ln x_i\right)

Here,

$x_i$ =Each positive observation

The geometric mean is appropriate for multiplicative processes: growth rates, compound returns, and ratios. It satisfies $G \leq \bar{x}$ (AM-GM inequality) and requires positive values.

Harmonic Mean

H = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}

Here,

$x_i$ =Each positive observation

The harmonic mean is the reciprocal of the arithmetic mean of reciprocals. It is useful for rates (speed = distance/time) and is the basis of the F1-score: $F_1 = 2 \cdot \text{Precision} \cdot \text{Recall} / (\text{Precision} + \text{Recall})$ .

The Means Inequality

For any dataset with positive values: $H \leq G \leq \bar{x}$ (harmonic ≤ geometric ≤ arithmetic), with equality only when all values are identical.

Measures of Spread

Sample Variance

s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

Here,

$n-1$ =Bessel's correction (degrees of freedom)

The denominator $n-1$ corrects bias because $\bar{x}$ is computed from the same data, using one degree of freedom. This ensures $E[s^2] = \sigma^2$ .

Standard Deviation

s = \sqrt{s^2}

Here,

$s$ =Returns to original units of measurement

For approximately normal distributions, about 68% of data falls within $\bar{x} \pm s$ , 95% within $\bar{x} \pm 2s$ , and 99.7% within $\bar{x} \pm 3s$ (the 68-95-99.7 rule).

Interquartile Range (IQR)

IQR = Q_3 - Q_1

Here,

$Q_1$ =25th percentile
$Q_3$ =75th percentile

The IQR measures the spread of the middle 50% of data. It is robust to outliers. The standard outlier rule: values outside $[Q_1 - 1.5 \cdot IQR, Q_3 + 1.5 \cdot IQR]$ are outliers.

Median Absolute Deviation (MAD)

\text{MAD} = \text{median}(|x_i - \text{median}(x)|)

Here,

$\text{MAD}$ =Robust spread measure based on the median

For normal data, $\hat{\sigma} = 1.4826 \cdot \text{MAD}$ is a consistent estimator of the standard deviation, more robust than $s$ when outliers are present.

Shape Measures

Skewness

\gamma_1 = \frac{1}{n}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s}\right)^3

Here,

$\gamma_1$ =0 = symmetric, >0 = right-skewed, <0 = left-skewed

Right-skewed (positive): Mean > Median > Mode. Common in income data. Left-skewed (negative): Mean < Median < Mode. Common in test scores with ceiling effects.

Excess Kurtosis

\gamma_2 = \frac{1}{n}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s}\right)^4 - 3

Here,

$\gamma_2$ =0 = normal-like, >0 = heavy tails (leptokurtic), <0 = light tails (platykurtic)

The $-3$ subtraction makes the normal distribution have zero excess kurtosis. High kurtosis means more probability in the tails and center, leading to more outliers.

Quick Example

📝Choosing the Right Measure

Dataset of customer purchases: [12, 15, 18, 20, 22, 25, 1500].

Mean = 230.3 (inflated by outlier)
Median = 20 (robust to outlier)
The median is the appropriate center measure

For growth rates (+10%, -5%, +15%, +8%): geometric mean = 6.59% (correct average annual growth), not the arithmetic mean of 7.0%.

For a car traveling 100 km at 50 km/h and returning at 100 km/h: harmonic mean = 66.67 km/h (correct average speed), not the arithmetic mean of 75 km/h.

Measure	Formula	Sensitive to Outliers?	Best For
Mean	$\frac{1}{n}\sum x_i$	Yes	Symmetric, no outliers
Median	Middle value when sorted	No	Skewed data, outliers
Mode	Most frequent value	No	Categorical data
Geometric Mean	$(\prod x_i)^{1/n}$	Less	Growth rates, ratios
Harmonic Mean	$n / \sum(1/x_i)$	Less	Rates, F1-score

Key Takeaways

📋Summary: Descriptive Statistics

Center: Mean for symmetric data without outliers. Median for skewed or outlier-heavy data. Mode for categorical data.
Spread: Variance and standard deviation for symmetric data. IQR and MAD for robust spread. Range is highly sensitive to outliers.
Shape: Skewness measures asymmetry. Kurtosis measures tail heaviness. Both inform model choice and preprocessing decisions.
Means Hierarchy: $H \leq G \leq \bar{x}$ for positive values. Use geometric for growth rates, harmonic for rates/ratios and F1-score.
Outlier Detection: IQR rule ( $1.5 \times IQR$ ) or MAD rule ( $3 \times MAD$ ). Z-scores only work for normal data.
ML Applications: StandardScaler uses mean/std. RobustScaler uses median/IQR. Always check skewness before scaling.
Bessel's Correction: Sample variance uses $n-1$ to correct bias from estimating $\mu$ with $\bar{x}$ .

Deep Dive

For detailed explanations, worked examples, and advanced theory, explore the dedicated statistics lessons:

Measures of Center

Measures of Central Tendency — Mean, median, mode compared with when each is appropriate and Python computation
Arithmetic Mean Deep Dive — Balance point, minimization of squared deviations, and properties
Median — Robust center, computation methods, and relationship to quartiles
Mode — Most frequent value, multimodality, and categorical data analysis
Weighted Mean — Assigning different importance to observations with examples
Geometric Mean — Multiplicative processes, compound growth, and AM-GM inequality
Harmonic Mean — Rates, ratios, F1-score connection, and when to use

Measures of Spread

Range and IQR — Robust spread measures, quartile computation, and outlier detection
Variance — Bessel's correction, computational formulas, population vs. sample, and properties
Standard Deviation — Interpreting spread in original units, the empirical rule

Measures of Shape

Skewness — Fisher-Pearson coefficient, right vs. left skew, and implications for the mean
Kurtosis — Excess kurtosis, tail behavior, leptokurtic vs. platykurtic distributions

Descriptive Statistics