Statistics 101: Mean, Median, Variance, Distributions

Why Statistics Matters in Data Science

Statistics is the mathematical foundation of data science. It provides the tools to summarize data, quantify uncertainty, and make inferences about populations from samples.

Architecture Diagram

Raw Data → Descriptive Statistics → Inference → Decisions
    │              │                     │           │
    │     Summarize & Describe    Draw Conclusions  Act
    └──────────────────────────────────────────────┘

Measures of Central Tendency

Mean (Arithmetic Average)

The mean is the sum of all values divided by the number of values.

\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i

Weighted Mean

\bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}

Here,

$x̄w$ =Weighted sample mean
$wᵢ$ =Weight assigned to observation i
$xᵢ$ =Individual data values
$n$ =Number of values

import numpy as np
import pandas as pd
from scipy import stats

# Calculate mean
data = np.array([23, 45, 12, 67, 34, 89, 21, 56, 78, 43])

# Python manual calculation
mean_manual = sum(data) / len(data)
print(f"Mean (manual): {mean_manual}")

# NumPy
mean_numpy = np.mean(data)
print(f"Mean (numpy): {mean_numpy}")

# Pandas
mean_pandas = pd.Series(data).mean()
print(f"Mean (pandas): {mean_pandas}")

# Weighted mean
weights = np.array([1, 2, 1, 3, 2, 1, 1, 2, 3, 2])
weighted_mean = np.average(data, weights=weights)
print(f"Weighted Mean: {weighted_mean}")

# Trimmed mean (remove extreme values)
trimmed_mean = stats.trim_mean(data, 0.1)  # Remove 10% from each end
print(f"Trimmed Mean: {trimmed_mean}")

💡 When to Use Weighted vs Trimmed Mean

Weighted mean: Use when observations have different importance (e.g., averaging course grades with different credit hours)
Trimmed mean: Use when data has outliers that would distort the arithmetic mean (e.g., income data with billionaires)

Median (Middle Value)

The median is the middle value when data is sorted. It's robust to outliers.

# Calculate median
data = np.array([23, 45, 12, 67, 34, 89, 21, 56, 78, 43])

# Odd number of values
data_odd = np.array([23, 45, 12, 67, 34, 89, 21, 56, 78])
median_odd = np.median(data_odd)
print(f"Median (odd): {median_odd}")  # 43

# Even number of values
median_even = np.median(data)
print(f"Median (even): {median_even}")  # 44.0

# Pandas
median_pandas = pd.Series(data).median()
print(f"Median (pandas): {median_pandas}")

# Quantiles (percentiles)
percentiles = [25, 50, 75]
for p in percentiles:
    q = np.percentile(data, p)
    print(f"{p}th percentile: {q}")

# Quartiles
Q1 = np.percentile(data, 25)
Q2 = np.percentile(data, 50)  # Median
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
print(f"\nQ1: {Q1}, Q2: {Q2}, Q3: {Q3}")
print(f"IQR: {IQR}")

Mode (Most Frequent Value)

from scipy import stats

data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]

# Scipy mode
mode_result = stats.mode(data)
print(f"Mode: {mode_result.mode}")
print(f"Count: {mode_result.count}")

# Pandas mode
mode_pandas = pd.Series(data).mode()
print(f"Mode (pandas): {mode_pandas}")

# Multiple modes
data_multi = [1, 1, 2, 2, 3, 3]  # All are modes
modes = pd.Series(data_multi).mode()
print(f"Multiple modes: {modes.values}")

ℹ️ Mode for Categorical Data

The mode is the only measure of central tendency applicable to nominal (categorical) data. You cannot compute a mean or median of eye color, but you can report the most common color.

Comparing Mean, Median, Mode

import matplotlib.pyplot as plt
import numpy as np

# Create skewed data
np.random.seed(42)
right_skewed = np.random.exponential(2, 1000)
left_skewed = -np.random.exponential(2, 1000)
normal = np.random.normal(0, 1, 1000)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

datasets = [right_skewed, normal, left_skewed]
titles = ['Right Skewed', 'Normal', 'Left Skewed']

for ax, data, title in zip(axes, datasets, titles):
    ax.hist(data, bins=30, edgecolor='black', alpha=0.7, density=True)
    
    mean = np.mean(data)
    median = np.median(data)
    mode = stats.mode(np.round(data, 1)).mode
    
    ax.axvline(mean, color='r', linestyle='--', label=f'Mean: {mean:.2f}')
    ax.axvline(median, color='g', linestyle='-', label=f'Median: {median:.2f}')
    ax.set_title(title)
    ax.legend()

plt.tight_layout()
plt.show()

ℹ️ When to Use Each

Right Skewed: Mean > Median > Mode — Use MEDIAN (e.g., income data)
Left Skewed: Mode > Median > Mean — Use MEDIAN (e.g., age at retirement)
Symmetric: Mean ≈ Median ≈ Mode — Use MEAN (e.g., height, test scores)

Measures of Spread

Variance and Standard Deviation

Variance measures how spread out data is from the mean. It is the average squared deviation.

Population Variance

\sigma^2 = \frac{1}{n}\sum_{i=1}^{n} (x_i - \mu)^2

Here,

$σ²$ =Population variance
$μ$ =Population mean
$n$ =Population size
$xᵢ$ =Individual data values

Sample Variance (Bessel's Correction)

s^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2

Here,

$s²$ =Sample variance
$x̄$ =Sample mean
$n$ =Sample size
$xᵢ$ =Individual data values

s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}

💡 Why n-1 Instead of n? (Bessel's Correction)

When estimating variance from a sample, dividing by n systematically underestimates the true population variance because the sample mean x̄ is computed FROM the sample, making it closer to sample data points than the true population mean μ. Dividing by n-1 compensates for this bias by accounting for the one degree of freedom lost when estimating μ with x̄. This makes s² an unbiased estimator of σ².

Standard Deviation = sqrt(Variance), converts back to original units.

Coefficient of Variation (CV)

Measures relative variability — useful for comparing spread across datasets with different units or means.

Coefficient of Variation

CV = \frac{\sigma}{\mu} \times 100\%

Here,

$CV$ =Coefficient of Variation (percentage)
$σ$ =Standard deviation
$μ$ =Mean

data = np.array([23, 45, 12, 67, 34, 89, 21, 56, 78, 43])

# Population variance (ddof=0)
var_pop = np.var(data, ddof=0)
std_pop = np.std(data, ddof=0)
print(f"Population Variance: {var_pop:.2f}")
print(f"Population Std Dev: {std_pop:.2f}")

# Sample variance (ddof=1) - use this for samples!
var_sample = np.var(data, ddof=1)
std_sample = np.std(data, ddof=1)
print(f"Sample Variance: {var_sample:.2f}")
print(f"Sample Std Dev: {std_sample:.2f}")

# Pandas (uses sample variance by default)
series = pd.Series(data)
print(f"Pandas Variance: {series.var():.2f}")
print(f"Pandas Std Dev: {series.std():.2f}")

# Manual calculation
mean = np.mean(data)
var_manual = np.sum((data - mean) ** 2) / (len(data) - 1)
print(f"\nManual Variance: {var_manual:.2f}")

# Coefficient of Variation (relative variability)
cv = std_sample / mean * 100
print(f"Coefficient of Variation: {cv:.2f}%")

Range and IQR

data = np.array([23, 45, 12, 67, 34, 89, 21, 56, 78, 43])

# Range
range_val = np.max(data) - np.min(data)
print(f"Range: {range_val}")

# IQR (Interquartile Range)
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
print(f"Q1: {Q1}, Q3: {Q3}")
print(f"IQR: {IQR}")

# Five-number summary
five_num = {
    'Min': np.min(data),
    'Q1': Q1,
    'Median': np.median(data),
    'Q3': Q3,
    'Max': np.max(data)
}
print(f"\nFive-Number Summary:")
for k, v in five_num.items():
    print(f"  {k}: {v:.2f}")

# Box plot interpretation
"""
Box Plot Anatomy:
       │
   ────┤──── Q3 + 1.5*IQR (Upper Fence)
       │
   ┌───┴───┐
   │   ┬   │ ← Q3 (75th percentile)
   │   │   │
   │   ┴   │ ← Median (50th percentile)
   │       │ ← Q1 (25th percentile)
   └───┬───┘
       │
   ────┤──── Q1 - 1.5*IQR (Lower Fence)
       │
       • ← Outliers
"""

Probability Distributions

Normal Distribution (Gaussian)

The most important distribution in statistics.

f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

# Parameters
mu = 0      # Mean
sigma = 1   # Standard deviation

# Create range
x = np.linspace(-4, 4, 1000)

# PDF (Probability Density Function)
pdf = stats.norm.pdf(x, mu, sigma)

# CDF (Cumulative Distribution Function)
cdf = stats.norm.cdf(x, mu, sigma)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# PDF
axes[0].plot(x, pdf, 'b-', linewidth=2)
axes[0].fill_between(x, pdf, alpha=0.3)
axes[0].set_title('Normal Distribution PDF')
axes[0].set_xlabel('x')
axes[0].set_ylabel('Probability Density')

# CDF
axes[1].plot(x, cdf, 'r-', linewidth=2)
axes[1].set_title('Normal Distribution CDF')
axes[1].set_xlabel('x')
axes[1].set_ylabel('Cumulative Probability')

plt.tight_layout()
plt.show()

# Generate normal samples
np.random.seed(42)
samples = np.random.normal(mu, sigma, 10000)

# Verify properties
print(f"Sample Mean: {np.mean(samples):.4f} (should be ~{mu})")
print(f"Sample Std: {np.std(samples):.4f} (should be ~{sigma})")

# 68-95-99.7 Rule
within_1sigma = np.sum(np.abs(samples - mu) < sigma) / len(samples)
within_2sigma = np.sum(np.abs(samples - mu) < 2*sigma) / len(samples)
within_3sigma = np.sum(np.abs(samples - mu) < 3*sigma) / len(samples)

print(f"\n68-95-99.7 Rule:")
print(f"Within 1σ: {within_1sigma*100:.1f}% (expected: 68.3%)")
print(f"Within 2σ: {within_2sigma*100:.1f}% (expected: 95.4%)")
print(f"Within 3σ: {within_3sigma*100:.1f}% (expected: 99.7%)")

# Z-scores
data = np.array([23, 45, 12, 67, 34, 89, 21, 56, 78, 43])
z_scores = (data - np.mean(data)) / np.std(data)
print(f"\nZ-scores: {z_scores}")
print(f"Values within ±2 Z-scores: {np.sum(np.abs(z_scores) < 2)}/{len(data)}")

Z-Score (Standard Score)

z = \frac{x - \mu}{\sigma}

Here,

$z$ =Number of standard deviations from the mean
$x$ =Raw data value
$μ$ =Population mean
$σ$ =Population standard deviation

ℹ️ The 68-95-99.7 Empirical Rule

For a normal distribution:

68.27% of data falls within ±1σ of the mean
95.45% within ±2σ
99.73% within ±3σ This rule is used for outlier detection (values beyond ±3σ are rare), quality control (Six Sigma), and hypothesis testing.

Other Important Distributions

# 1. Uniform Distribution
x = np.linspace(0, 1, 1000)
uniform_pdf = stats.uniform.pdf(x)
plt.figure(figsize=(10, 5))
plt.plot(x, uniform_pdf, 'b-', linewidth=2)
plt.fill_between(x, uniform_pdf, alpha=0.3)
plt.title('Uniform Distribution')
plt.show()

# 2. Binomial Distribution
n_trials = 10
p_success = 0.5
x = np.arange(0, n_trials + 1)
binom_pmf = stats.binom.pmf(x, n_trials, p_success)
plt.figure(figsize=(10, 5))
plt.bar(x, binom_pmf, color='skyblue', edgecolor='black')
plt.title(f'Binomial Distribution (n={n_trials}, p={p_success})')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.show()

# 3. Poisson Distribution
lambda_param = 5
x = np.arange(0, 15)
poisson_pmf = stats.poisson.pmf(x, lambda_param)
plt.figure(figsize=(10, 5))
plt.bar(x, poisson_pmf, color='lightgreen', edgecolor='black')
plt.title(f'Poisson Distribution (λ={lambda_param})')
plt.xlabel('Number of Events')
plt.ylabel('Probability')
plt.show()

# 4. Exponential Distribution
lambda_exp = 1
x = np.linspace(0, 5, 1000)
exp_pdf = stats.expon.pdf(x, scale=1/lambda_exp)
plt.figure(figsize=(10, 5))
plt.plot(x, exp_pdf, 'r-', linewidth=2)
plt.fill_between(x, exp_pdf, alpha=0.3)
plt.title(f'Exponential Distribution (λ={lambda_exp})')
plt.show()

# 5. Chi-Square Distribution
df = 5
x = np.linspace(0, 15, 1000)
chi2_pdf = stats.chi2.pdf(x, df)
plt.figure(figsize=(10, 5))
plt.plot(x, chi2_pdf, 'g-', linewidth=2)
plt.fill_between(x, chi2_pdf, alpha=0.3)
plt.title(f'Chi-Square Distribution (df={df})')
plt.show()

# 6. T-Distribution
df_t = 5
x = np.linspace(-4, 4, 1000)
t_pdf = stats.t.pdf(x, df_t)
normal_pdf = stats.norm.pdf(x)

plt.figure(figsize=(10, 5))
plt.plot(x, normal_pdf, 'b-', label='Normal', linewidth=2)
plt.plot(x, t_pdf, 'r--', label=f't (df={df_t})', linewidth=2)
plt.legend()
plt.title('Normal vs t-Distribution')
plt.show()

💡 When to Use Which Distribution

Normal: Heights, test scores, measurement errors (continuous, symmetric)
Binomial: Number of successes in n independent trials (discrete)
Poisson: Count of events in a fixed interval (discrete, rare events)
Exponential: Time between events in a Poisson process (continuous)
t-Distribution: Small-sample inference about means (continuous, heavy tails)
Chi-Square: Tests of independence, goodness of fit (continuous, right-skewed)

Skewness and Kurtosis

from scipy.stats import skew, kurtosis

data = np.random.exponential(2, 1000)  # Right-skewed data

skewness = skew(data)
kurt = kurtosis(data)

print(f"Skewness: {skewness:.3f}")
print(f"Kurtosis: {kurt:.3f}")

# Interpretation:
"""
Skewness:
- Skewness = 0: Symmetric
- Skewness > 0: Right-skewed (positive skew)
- Skewness < 0: Left-skewed (negative skew)

Kurtosis:
- Kurtosis = 0 (or 3): Normal (mesokurtic)
- Kurtosis > 0: Heavy tails (leptokurtic)
- Kurtosis < 0: Light tails (platykurtic)
"""

# Visual comparison
np.random.seed(42)
distributions = {
    'Normal': np.random.normal(0, 1, 1000),
    'Uniform': np.random.uniform(-2, 2, 1000),
    'Exponential': np.random.exponential(1, 1000),
    'Laplace': np.random.laplace(0, 1, 1000)
}

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for ax, (name, data) in zip(axes.flatten(), distributions.items()):
    ax.hist(data, bins=30, density=True, alpha=0.7, edgecolor='black')
    ax.set_title(f'{name}\nSkew: {skew(data):.2f}, Kurt: {kurtosis(data):.2f}')

plt.tight_layout()
plt.show()

Sample Skewness

g_1 = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^3}{\left(\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2\right)^{3/2}}

Here,

$g₁$ =Sample skewness
$x̄$ =Sample mean
$n$ =Sample size

Correlation Analysis

# Pearson Correlation (linear relationships)
np.random.seed(42)
x = np.random.randn(100)
y = x * 2 + np.random.randn(100) * 0.5

pearson_corr = np.corrcoef(x, y)[0, 1]
print(f"Pearson Correlation: {pearson_corr:.3f}")

# Spearman Correlation (monotonic relationships)
from scipy.stats import spearmanr

# Non-linear but monotonic
x_nonlinear = np.random.uniform(0, 10, 100)
y_nonlinear = np.exp(x_nonlinear) + np.random.randn(100) * 10

spearman_corr, p_value = spearmanr(x_nonlinear, y_nonlinear)
print(f"\nSpearman Correlation: {spearman_corr:.3f}")
print(f"P-value: {p_value:.6f}")

# Correlation matrix
import seaborn as sns
import pandas as pd

df = pd.DataFrame({
    'height': np.random.normal(170, 10, 100),
    'weight': np.random.normal(70, 15, 100),
    'age': np.random.randint(18, 65, 100),
    'income': np.random.normal(50000, 15000, 100)
})

# Add some correlations
df['weight'] = df['height'] * 0.5 + np.random.randn(100) * 5
df['income'] = df['age'] * 1000 + np.random.randn(100) * 5000

corr_matrix = df.corr()
print("\nCorrelation Matrix:")
print(corr_matrix)

# Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
            vmin=-1, vmax=1, square=True, linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

ℹ️ Pearson vs Spearman Correlation

Pearson (r): Measures linear association. Sensitive to outliers. Assumes both variables are approximately normally distributed.
Spearman (ρ): Measures monotonic association (rank-based). Robust to outliers and non-normal distributions. Use when relationships are non-linear but consistently increasing/decreasing.

Practical Example: Customer Analytics

import pandas as pd
import numpy as np
from scipy import stats

# Generate customer data
np.random.seed(42)
n = 1000

df_customers = pd.DataFrame({
    'age': np.random.normal(35, 10, n).astype(int),
    'income': np.random.lognormal(10.8, 0.5, n).astype(int),
    'spending_score': np.random.randint(1, 100, n),
    'satisfaction': np.random.normal(7, 2, n).clip(1, 10)
})

# Descriptive Statistics Report
print("=" * 60)
print("CUSTOMER ANALYTICS REPORT")
print("=" * 60)

for col in df_customers.columns:
    data = df_customers[col]
    print(f"\n{col.upper()}:")
    print(f"  Mean: {data.mean():.2f}")
    print(f"  Median: {data.median():.2f}")
    print(f"  Std Dev: {data.std():.2f}")
    print(f"  Skewness: {stats.skew(data):.3f}")
    print(f"  Kurtosis: {stats.kurtosis(data):.3f}")
    print(f"  95% CI: [{stats.t.interval(0.95, len(data)-1, data.mean(), stats.sem(data))[0]:.2f}, "
          f"{stats.t.interval(0.95, len(data)-1, data.mean(), stats.sem(data))[1]:.2f}]")

# Hypothesis testing example
high_income = df_customers[df_customers['income'] > df_customers['income'].median()]['spending_score']
low_income = df_customers[df_customers['income'] <= df_customers['income'].median()]['spending_score']

t_stat, p_value = stats.ttest_ind(high_income, low_income)
print(f"\nHypothesis Test: Do high-income customers spend more?")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.6f}")
print(f"Result: {'Reject' if p_value < 0.05 else 'Fail to reject'} null hypothesis")

📝Worked Example: Analyzing Customer Income Distribution

Given: Customer income data with mean = $52,000, median =$ 45,000, std = $18,000

Analysis:

Mean > Median → right-skewed distribution (expected for income)
CV = (18000/52000) × 100% = 34.6% → moderate relative variability
Z-score for income = $90,000: z = (90000 - 52000) / 18000 = 2.11
P(Z > 2.11) ≈ 0.017 → about 1.7% of customers earn more than $90K
Recommendation: Report median income (robust to high earners) rather than mean

Key Takeaways

📋Summary: Statistics 101

Mean is sensitive to outliers; Median is robust — choose based on distribution shape
Standard Deviation measures spread in original units; CV measures relative spread for comparing across datasets
Bessel's correction (n-1) ensures sample variance is an unbiased estimator of population variance
Normal distribution is the foundation of many statistical tests; the Empirical Rule (68-95-99.7) enables quick inference
Skewness measures asymmetry; Kurtosis measures tail heaviness — both diagnose distribution shape
Pearson correlation measures linear association; Spearman measures monotonic association — use Spearman when relationships are non-linear
Always visualize distributions before computing summary statistics — numerical summaries can hide important patterns

Practice Exercise

Generate 1000 samples from a normal distribution and verify the 68-95-99.7 rule
Calculate all descriptive statistics (mean, median, mode, variance, std, skewness, kurtosis) manually and compare with Python's built-in functions
Create a comparison of mean, median, and mode for right-skewed, left-skewed, and symmetric distributions
Compute the z-score for each data point and identify potential outliers (|z| > 2)
Perform a correlation analysis on a real dataset — compare Pearson and Spearman results
Generate a dataset where Pearson correlation is near zero but Spearman is high (non-linear monotonic relationship)
Test whether a dataset is normally distributed using the Shapiro-Wilk test

Statistics 101: Mean, Median, Variance, Distributions

Why Statistics Matters in Data Science

Measures of Central Tendency

Mean (Arithmetic Average)

Weighted Mean

Median (Middle Value)

Mode (Most Frequent Value)

Comparing Mean, Median, Mode

Measures of Spread

Variance and Standard Deviation

Population Variance

Sample Variance (Bessel's Correction)

Coefficient of Variation (CV)

Coefficient of Variation

Range and IQR

Probability Distributions

Normal Distribution (Gaussian)

Z-Score (Standard Score)

Other Important Distributions

Skewness and Kurtosis

Sample Skewness

Correlation Analysis

Practical Example: Customer Analytics

📝Worked Example: Analyzing Customer Income Distribution

Key Takeaways

📋Summary: Statistics 101

Practice Exercise

Need Expert Data Science Help?