Descriptive Statistics
âšī¸ Why It Matters
Descriptive statistics are the foundation of all data analysis. Before running complex models, you must understand the distribution, central tendency, and variability of your data. These summary measures reveal patterns, detect outliers, guide feature engineering, and inform model selection. In machine learning, nearly every preprocessing step â normalization, scaling, outlier removal â relies on descriptive statistics. Without them, you are flying blind.
Overview
Descriptive statistics summarize a dataset with a few meaningful numbers. They answer three fundamental questions: Where is the center? How much do values vary? What is the shape? Measures of center (mean, median, mode) identify the typical value. Measures of spread (variance, standard deviation, IQR, range) quantify how dispersed observations are. Measures of shape (skewness, kurtosis) describe asymmetry and tail heaviness. Together, these three pillars give you a complete picture of a distribution. The choice of summary measure depends on the data's distribution shape and the presence of outliers â using the wrong one can be deeply misleading.
Key Concepts
Measures of Center
Arithmetic Mean
Here,
- =Each individual observation
- =Total number of observations
- =Sample mean (balance point of the data)
The mean uses every data point but is sensitive to extreme values. It minimizes the sum of squared deviations: . It is the natural measure for symmetric distributions without outliers.
Median
Here,
- =The k-th value in sorted order
The median is the 50th percentile â the value separating the lower half from the upper half. It is robust to outliers and skewed distributions. When the mean is significantly different from the median, the distribution is skewed.
Mode
Here,
- =The most frequent value
- =Probability mass/density function
The mode is the most frequently occurring value. A dataset can have zero, one, or multiple modes. It is especially useful for categorical data where means are undefined.
Weighted and Specialized Means
Weighted Mean
Here,
- =Weight assigned to observation i (importance or reliability)
- =Each individual observation
Use when observations have different importance: combining group means with different sample sizes, weighting by reliability, or applying time decay in time series.
Geometric Mean
Here,
- =Each positive observation
The geometric mean is appropriate for multiplicative processes: growth rates, compound returns, and ratios. It satisfies (AM-GM inequality) and requires positive values.
Harmonic Mean
Here,
- =Each positive observation
The harmonic mean is the reciprocal of the arithmetic mean of reciprocals. It is useful for rates (speed = distance/time) and is the basis of the F1-score: .
The Means Inequality
For any dataset with positive values: (harmonic ⤠geometric ⤠arithmetic), with equality only when all values are identical.
Measures of Spread
Sample Variance
Here,
- =Bessel's correction (degrees of freedom)
The denominator corrects bias because is computed from the same data, using one degree of freedom. This ensures .
Standard Deviation
Here,
- =Returns to original units of measurement
For approximately normal distributions, about 68% of data falls within , 95% within , and 99.7% within (the 68-95-99.7 rule).
Interquartile Range (IQR)
Here,
- =25th percentile
- =75th percentile
The IQR measures the spread of the middle 50% of data. It is robust to outliers. The standard outlier rule: values outside are outliers.
Median Absolute Deviation (MAD)
Here,
- =Robust spread measure based on the median
For normal data, is a consistent estimator of the standard deviation, more robust than when outliers are present.
Shape Measures
Skewness
Here,
- =0 = symmetric, >0 = right-skewed, <0 = left-skewed
Right-skewed (positive): Mean > Median > Mode. Common in income data. Left-skewed (negative): Mean < Median < Mode. Common in test scores with ceiling effects.
Excess Kurtosis
Here,
- =0 = normal-like, >0 = heavy tails (leptokurtic), <0 = light tails (platykurtic)
The subtraction makes the normal distribution have zero excess kurtosis. High kurtosis means more probability in the tails and center, leading to more outliers.
Quick Example
đChoosing the Right Measure
Dataset of customer purchases: [12, 15, 18, 20, 22, 25, 1500].
- Mean = 230.3 (inflated by outlier)
- Median = 20 (robust to outlier)
- The median is the appropriate center measure
For growth rates (+10%, -5%, +15%, +8%): geometric mean = 6.59% (correct average annual growth), not the arithmetic mean of 7.0%.
For a car traveling 100 km at 50 km/h and returning at 100 km/h: harmonic mean = 66.67 km/h (correct average speed), not the arithmetic mean of 75 km/h.
| Measure | Formula | Sensitive to Outliers? | Best For |
|---|---|---|---|
| Mean | Yes | Symmetric, no outliers | |
| Median | Middle value when sorted | No | Skewed data, outliers |
| Mode | Most frequent value | No | Categorical data |
| Geometric Mean | Less | Growth rates, ratios | |
| Harmonic Mean | Less | Rates, F1-score |
Key Takeaways
đSummary: Descriptive Statistics
- Center: Mean for symmetric data without outliers. Median for skewed or outlier-heavy data. Mode for categorical data.
- Spread: Variance and standard deviation for symmetric data. IQR and MAD for robust spread. Range is highly sensitive to outliers.
- Shape: Skewness measures asymmetry. Kurtosis measures tail heaviness. Both inform model choice and preprocessing decisions.
- Means Hierarchy: for positive values. Use geometric for growth rates, harmonic for rates/ratios and F1-score.
- Outlier Detection: IQR rule () or MAD rule (). Z-scores only work for normal data.
- ML Applications: StandardScaler uses mean/std. RobustScaler uses median/IQR. Always check skewness before scaling.
- Bessel's Correction: Sample variance uses to correct bias from estimating with .
Deep Dive
For detailed explanations, worked examples, and advanced theory, explore the dedicated statistics lessons:
Measures of Center
- Measures of Central Tendency â Mean, median, mode compared with when each is appropriate and Python computation
- Arithmetic Mean Deep Dive â Balance point, minimization of squared deviations, and properties
- Median â Robust center, computation methods, and relationship to quartiles
- Mode â Most frequent value, multimodality, and categorical data analysis
- Weighted Mean â Assigning different importance to observations with examples
- Geometric Mean â Multiplicative processes, compound growth, and AM-GM inequality
- Harmonic Mean â Rates, ratios, F1-score connection, and when to use
Measures of Spread
- Range and IQR â Robust spread measures, quartile computation, and outlier detection
- Variance â Bessel's correction, computational formulas, population vs. sample, and properties
- Standard Deviation â Interpreting spread in original units, the empirical rule
Measures of Shape
- Skewness â Fisher-Pearson coefficient, right vs. left skew, and implications for the mean
- Kurtosis â Excess kurtosis, tail behavior, leptokurtic vs. platykurtic distributions
Related Topics
- Normal Distribution â The bell curve and why skewness/kurtosis matter
- Standard Error â How sample means vary across samples
- Percentiles and Quartiles â Computing and interpreting quantiles
- Box Plots â Visualizing the five-number summary and detecting outliers