← Math|45 of 100
Statistics

Descriptive Statistics

Master measures of center, spread, shape, visualization techniques, and their applications in data science and machine learning.

📂 Summary Measures📖 Lesson 45 of 100🎓 Free Course

Advertisement

Descriptive Statistics

â„šī¸ Why It Matters

Descriptive statistics are the foundation of all data analysis. Before running complex models, you must understand the distribution, central tendency, and variability of your data. These summary measures reveal patterns, detect outliers, guide feature engineering, and inform model selection. In machine learning, nearly every preprocessing step — normalization, scaling, outlier removal — relies on descriptive statistics. Without them, you are flying blind.


Overview

Descriptive statistics summarize a dataset with a few meaningful numbers. They answer three fundamental questions: Where is the center? How much do values vary? What is the shape? Measures of center (mean, median, mode) identify the typical value. Measures of spread (variance, standard deviation, IQR, range) quantify how dispersed observations are. Measures of shape (skewness, kurtosis) describe asymmetry and tail heaviness. Together, these three pillars give you a complete picture of a distribution. The choice of summary measure depends on the data's distribution shape and the presence of outliers — using the wrong one can be deeply misleading.


Key Concepts

Measures of Center

Arithmetic Mean

xˉ=1n∑i=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i

Here,

  • xix_i=Each individual observation
  • nn=Total number of observations
  • xˉ\bar{x}=Sample mean (balance point of the data)

The mean uses every data point but is sensitive to extreme values. It minimizes the sum of squared deviations: xˉ=arg⁡min⁡c∑(xi−c)2\bar{x} = \arg\min_c \sum(x_i - c)^2. It is the natural measure for symmetric distributions without outliers.

Median

x~={x((n+1)/2)if n is oddx(n/2)+x(n/2+1)2if n is even\tilde{x} = \begin{cases} x_{((n+1)/2)} & \text{if } n \text{ is odd} \\ \frac{x_{(n/2)} + x_{(n/2+1)}}{2} & \text{if } n \text{ is even} \end{cases}

Here,

  • x(k)x_{(k)}=The k-th value in sorted order

The median is the 50th percentile — the value separating the lower half from the upper half. It is robust to outliers and skewed distributions. When the mean is significantly different from the median, the distribution is skewed.

Mode

x∗=arg⁡max⁡xf(x)x^* = \arg\max_x f(x)

Here,

  • x∗x^*=The most frequent value
  • f(x)f(x)=Probability mass/density function

The mode is the most frequently occurring value. A dataset can have zero, one, or multiple modes. It is especially useful for categorical data where means are undefined.

Weighted and Specialized Means

Weighted Mean

xˉw=∑i=1nwixi∑i=1nwi\bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}

Here,

  • wiw_i=Weight assigned to observation i (importance or reliability)
  • xix_i=Each individual observation

Use when observations have different importance: combining group means with different sample sizes, weighting by reliability, or applying time decay in time series.

Geometric Mean

G=(∏i=1nxi)1/n=exp⁡(1n∑i=1nln⁡xi)G = \left(\prod_{i=1}^{n} x_i\right)^{1/n} = \exp\left(\frac{1}{n}\sum_{i=1}^{n} \ln x_i\right)

Here,

  • xix_i=Each positive observation

The geometric mean is appropriate for multiplicative processes: growth rates, compound returns, and ratios. It satisfies G≤xˉG \leq \bar{x} (AM-GM inequality) and requires positive values.

Harmonic Mean

H=n∑i=1n1xiH = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}

Here,

  • xix_i=Each positive observation

The harmonic mean is the reciprocal of the arithmetic mean of reciprocals. It is useful for rates (speed = distance/time) and is the basis of the F1-score: F1=2⋅Precision⋅Recall/(Precision+Recall)F_1 = 2 \cdot \text{Precision} \cdot \text{Recall} / (\text{Precision} + \text{Recall}).

The Means Inequality

For any dataset with positive values: H≤G≤xˉH \leq G \leq \bar{x} (harmonic ≤ geometric ≤ arithmetic), with equality only when all values are identical.

Measures of Spread

Sample Variance

s2=1n−1∑i=1n(xi−xˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

Here,

  • n−1n-1=Bessel's correction (degrees of freedom)

The denominator n−1n-1 corrects bias because xˉ\bar{x} is computed from the same data, using one degree of freedom. This ensures E[s2]=΃2E[s^2] = \sigma^2.

Standard Deviation

s=s2s = \sqrt{s^2}

Here,

  • ss=Returns to original units of measurement

For approximately normal distributions, about 68% of data falls within xˉ±s\bar{x} \pm s, 95% within xˉ±2s\bar{x} \pm 2s, and 99.7% within xˉ±3s\bar{x} \pm 3s (the 68-95-99.7 rule).

Interquartile Range (IQR)

IQR=Q3−Q1IQR = Q_3 - Q_1

Here,

  • Q1Q_1=25th percentile
  • Q3Q_3=75th percentile

The IQR measures the spread of the middle 50% of data. It is robust to outliers. The standard outlier rule: values outside [Q1−1.5⋅IQR,Q3+1.5⋅IQR][Q_1 - 1.5 \cdot IQR, Q_3 + 1.5 \cdot IQR] are outliers.

Median Absolute Deviation (MAD)

MAD=median(âˆŖxi−median(x)âˆŖ)\text{MAD} = \text{median}(|x_i - \text{median}(x)|)

Here,

  • MAD\text{MAD}=Robust spread measure based on the median

For normal data, ΃^=1.4826⋅MAD\hat{\sigma} = 1.4826 \cdot \text{MAD} is a consistent estimator of the standard deviation, more robust than ss when outliers are present.

Shape Measures

Skewness

Îŗ1=1n∑i=1n(xi−xˉs)3\gamma_1 = \frac{1}{n}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s}\right)^3

Here,

  • Îŗ1\gamma_1=0 = symmetric, >0 = right-skewed, <0 = left-skewed

Right-skewed (positive): Mean > Median > Mode. Common in income data. Left-skewed (negative): Mean < Median < Mode. Common in test scores with ceiling effects.

Excess Kurtosis

Îŗ2=1n∑i=1n(xi−xˉs)4−3\gamma_2 = \frac{1}{n}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s}\right)^4 - 3

Here,

  • Îŗ2\gamma_2=0 = normal-like, >0 = heavy tails (leptokurtic), <0 = light tails (platykurtic)

The −3-3 subtraction makes the normal distribution have zero excess kurtosis. High kurtosis means more probability in the tails and center, leading to more outliers.


Quick Example

📝Choosing the Right Measure

Dataset of customer purchases: [12, 15, 18, 20, 22, 25, 1500].

  • Mean = 230.3 (inflated by outlier)
  • Median = 20 (robust to outlier)
  • The median is the appropriate center measure

For growth rates (+10%, -5%, +15%, +8%): geometric mean = 6.59% (correct average annual growth), not the arithmetic mean of 7.0%.

For a car traveling 100 km at 50 km/h and returning at 100 km/h: harmonic mean = 66.67 km/h (correct average speed), not the arithmetic mean of 75 km/h.

MeasureFormulaSensitive to Outliers?Best For
Mean1n∑xi\frac{1}{n}\sum x_iYesSymmetric, no outliers
MedianMiddle value when sortedNoSkewed data, outliers
ModeMost frequent valueNoCategorical data
Geometric Mean(∏xi)1/n(\prod x_i)^{1/n}LessGrowth rates, ratios
Harmonic Meann/∑(1/xi)n / \sum(1/x_i)LessRates, F1-score

Key Takeaways

📋Summary: Descriptive Statistics

  • Center: Mean for symmetric data without outliers. Median for skewed or outlier-heavy data. Mode for categorical data.
  • Spread: Variance and standard deviation for symmetric data. IQR and MAD for robust spread. Range is highly sensitive to outliers.
  • Shape: Skewness measures asymmetry. Kurtosis measures tail heaviness. Both inform model choice and preprocessing decisions.
  • Means Hierarchy: H≤G≤xˉH \leq G \leq \bar{x} for positive values. Use geometric for growth rates, harmonic for rates/ratios and F1-score.
  • Outlier Detection: IQR rule (1.5×IQR1.5 \times IQR) or MAD rule (3×MAD3 \times MAD). Z-scores only work for normal data.
  • ML Applications: StandardScaler uses mean/std. RobustScaler uses median/IQR. Always check skewness before scaling.
  • Bessel's Correction: Sample variance uses n−1n-1 to correct bias from estimating Îŧ\mu with xˉ\bar{x}.

Deep Dive

For detailed explanations, worked examples, and advanced theory, explore the dedicated statistics lessons:

Measures of Center

  • Measures of Central Tendency — Mean, median, mode compared with when each is appropriate and Python computation
  • Arithmetic Mean Deep Dive — Balance point, minimization of squared deviations, and properties
  • Median — Robust center, computation methods, and relationship to quartiles
  • Mode — Most frequent value, multimodality, and categorical data analysis
  • Weighted Mean — Assigning different importance to observations with examples
  • Geometric Mean — Multiplicative processes, compound growth, and AM-GM inequality
  • Harmonic Mean — Rates, ratios, F1-score connection, and when to use

Measures of Spread

  • Range and IQR — Robust spread measures, quartile computation, and outlier detection
  • Variance — Bessel's correction, computational formulas, population vs. sample, and properties
  • Standard Deviation — Interpreting spread in original units, the empirical rule

Measures of Shape

  • Skewness — Fisher-Pearson coefficient, right vs. left skew, and implications for the mean
  • Kurtosis — Excess kurtosis, tail behavior, leptokurtic vs. platykurtic distributions

Related Topics

Lesson Progress45 / 100