Statistics 101: Mean, Median, Variance, Distributions

Module 1: FoundationsFree Lesson

Advertisement

Why Statistics Matters in Data Science

Statistics is the mathematical foundation of data science. It provides the tools to summarize data, quantify uncertainty, and make inferences about populations from samples.

Architecture Diagram
Raw Data โ†’ Descriptive Statistics โ†’ Inference โ†’ Decisions
    โ”‚              โ”‚                     โ”‚           โ”‚
    โ”‚     Summarize & Describe    Draw Conclusions  Act
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Measures of Central Tendency

Mean (Arithmetic Average)

The mean is the sum of all values divided by the number of values.

xห‰=1nโˆ‘i=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i

Weighted Mean

xห‰w=โˆ‘i=1nwixiโˆ‘i=1nwi\bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}

Here,

  • xห‰wxฬ„w=Weighted sample mean
  • wiwแตข=Weight assigned to observation i
  • xixแตข=Individual data values
  • nn=Number of values
import numpy as np
import pandas as pd
from scipy import stats

# Calculate mean
data = np.array([23, 45, 12, 67, 34, 89, 21, 56, 78, 43])

# Python manual calculation
mean_manual = sum(data) / len(data)
print(f"Mean (manual): {mean_manual}")

# NumPy
mean_numpy = np.mean(data)
print(f"Mean (numpy): {mean_numpy}")

# Pandas
mean_pandas = pd.Series(data).mean()
print(f"Mean (pandas): {mean_pandas}")

# Weighted mean
weights = np.array([1, 2, 1, 3, 2, 1, 1, 2, 3, 2])
weighted_mean = np.average(data, weights=weights)
print(f"Weighted Mean: {weighted_mean}")

# Trimmed mean (remove extreme values)
trimmed_mean = stats.trim_mean(data, 0.1)  # Remove 10% from each end
print(f"Trimmed Mean: {trimmed_mean}")

๐Ÿ’ก When to Use Weighted vs Trimmed Mean

  • Weighted mean: Use when observations have different importance (e.g., averaging course grades with different credit hours)
  • Trimmed mean: Use when data has outliers that would distort the arithmetic mean (e.g., income data with billionaires)

Median (Middle Value)

The median is the middle value when data is sorted. It's robust to outliers.

# Calculate median
data = np.array([23, 45, 12, 67, 34, 89, 21, 56, 78, 43])

# Odd number of values
data_odd = np.array([23, 45, 12, 67, 34, 89, 21, 56, 78])
median_odd = np.median(data_odd)
print(f"Median (odd): {median_odd}")  # 43

# Even number of values
median_even = np.median(data)
print(f"Median (even): {median_even}")  # 44.0

# Pandas
median_pandas = pd.Series(data).median()
print(f"Median (pandas): {median_pandas}")

# Quantiles (percentiles)
percentiles = [25, 50, 75]
for p in percentiles:
    q = np.percentile(data, p)
    print(f"{p}th percentile: {q}")

# Quartiles
Q1 = np.percentile(data, 25)
Q2 = np.percentile(data, 50)  # Median
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
print(f"\nQ1: {Q1}, Q2: {Q2}, Q3: {Q3}")
print(f"IQR: {IQR}")

Mode (Most Frequent Value)

from scipy import stats

data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]

# Scipy mode
mode_result = stats.mode(data)
print(f"Mode: {mode_result.mode}")
print(f"Count: {mode_result.count}")

# Pandas mode
mode_pandas = pd.Series(data).mode()
print(f"Mode (pandas): {mode_pandas}")

# Multiple modes
data_multi = [1, 1, 2, 2, 3, 3]  # All are modes
modes = pd.Series(data_multi).mode()
print(f"Multiple modes: {modes.values}")

โ„น๏ธ Mode for Categorical Data

The mode is the only measure of central tendency applicable to nominal (categorical) data. You cannot compute a mean or median of eye color, but you can report the most common color.

Comparing Mean, Median, Mode

import matplotlib.pyplot as plt
import numpy as np

# Create skewed data
np.random.seed(42)
right_skewed = np.random.exponential(2, 1000)
left_skewed = -np.random.exponential(2, 1000)
normal = np.random.normal(0, 1, 1000)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

datasets = [right_skewed, normal, left_skewed]
titles = ['Right Skewed', 'Normal', 'Left Skewed']

for ax, data, title in zip(axes, datasets, titles):
    ax.hist(data, bins=30, edgecolor='black', alpha=0.7, density=True)
    
    mean = np.mean(data)
    median = np.median(data)
    mode = stats.mode(np.round(data, 1)).mode
    
    ax.axvline(mean, color='r', linestyle='--', label=f'Mean: {mean:.2f}')
    ax.axvline(median, color='g', linestyle='-', label=f'Median: {median:.2f}')
    ax.set_title(title)
    ax.legend()

plt.tight_layout()
plt.show()

โ„น๏ธ When to Use Each

  • Right Skewed: Mean > Median > Mode โ€” Use MEDIAN (e.g., income data)
  • Left Skewed: Mode > Median > Mean โ€” Use MEDIAN (e.g., age at retirement)
  • Symmetric: Mean โ‰ˆ Median โ‰ˆ Mode โ€” Use MEAN (e.g., height, test scores)

Measures of Spread

Variance and Standard Deviation

Variance measures how spread out data is from the mean. It is the average squared deviation.

Population Variance

ฯƒ2=1nโˆ‘i=1n(xiโˆ’ฮผ)2\sigma^2 = \frac{1}{n}\sum_{i=1}^{n} (x_i - \mu)^2

Here,

  • ฯƒ2ฯƒยฒ=Population variance
  • ฮผฮผ=Population mean
  • nn=Population size
  • xixแตข=Individual data values

Sample Variance (Bessel's Correction)

s2=1nโˆ’1โˆ‘i=1n(xiโˆ’xห‰)2s^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2

Here,

  • s2sยฒ=Sample variance
  • xห‰xฬ„=Sample mean
  • nn=Sample size
  • xixแตข=Individual data values
s=1nโˆ’1โˆ‘i=1n(xiโˆ’xห‰)2s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}

๐Ÿ’ก Why n-1 Instead of n? (Bessel's Correction)

When estimating variance from a sample, dividing by n systematically underestimates the true population variance because the sample mean xฬ„ is computed FROM the sample, making it closer to sample data points than the true population mean ฮผ. Dividing by n-1 compensates for this bias by accounting for the one degree of freedom lost when estimating ฮผ with xฬ„. This makes sยฒ an unbiased estimator of ฯƒยฒ.

Standard Deviation = sqrt(Variance), converts back to original units.

Coefficient of Variation (CV)

Measures relative variability โ€” useful for comparing spread across datasets with different units or means.

Coefficient of Variation

CV=ฯƒฮผร—100%CV = \frac{\sigma}{\mu} \times 100\%

Here,

  • CVCV=Coefficient of Variation (percentage)
  • ฯƒฯƒ=Standard deviation
  • ฮผฮผ=Mean
data = np.array([23, 45, 12, 67, 34, 89, 21, 56, 78, 43])

# Population variance (ddof=0)
var_pop = np.var(data, ddof=0)
std_pop = np.std(data, ddof=0)
print(f"Population Variance: {var_pop:.2f}")
print(f"Population Std Dev: {std_pop:.2f}")

# Sample variance (ddof=1) - use this for samples!
var_sample = np.var(data, ddof=1)
std_sample = np.std(data, ddof=1)
print(f"Sample Variance: {var_sample:.2f}")
print(f"Sample Std Dev: {std_sample:.2f}")

# Pandas (uses sample variance by default)
series = pd.Series(data)
print(f"Pandas Variance: {series.var():.2f}")
print(f"Pandas Std Dev: {series.std():.2f}")

# Manual calculation
mean = np.mean(data)
var_manual = np.sum((data - mean) ** 2) / (len(data) - 1)
print(f"\nManual Variance: {var_manual:.2f}")

# Coefficient of Variation (relative variability)
cv = std_sample / mean * 100
print(f"Coefficient of Variation: {cv:.2f}%")

Range and IQR

data = np.array([23, 45, 12, 67, 34, 89, 21, 56, 78, 43])

# Range
range_val = np.max(data) - np.min(data)
print(f"Range: {range_val}")

# IQR (Interquartile Range)
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
print(f"Q1: {Q1}, Q3: {Q3}")
print(f"IQR: {IQR}")

# Five-number summary
five_num = {
    'Min': np.min(data),
    'Q1': Q1,
    'Median': np.median(data),
    'Q3': Q3,
    'Max': np.max(data)
}
print(f"\nFive-Number Summary:")
for k, v in five_num.items():
    print(f"  {k}: {v:.2f}")

# Box plot interpretation
"""
Box Plot Anatomy:
       โ”‚
   โ”€โ”€โ”€โ”€โ”คโ”€โ”€โ”€โ”€ Q3 + 1.5*IQR (Upper Fence)
       โ”‚
   โ”Œโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”
   โ”‚   โ”ฌ   โ”‚ โ† Q3 (75th percentile)
   โ”‚   โ”‚   โ”‚
   โ”‚   โ”ด   โ”‚ โ† Median (50th percentile)
   โ”‚       โ”‚ โ† Q1 (25th percentile)
   โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜
       โ”‚
   โ”€โ”€โ”€โ”€โ”คโ”€โ”€โ”€โ”€ Q1 - 1.5*IQR (Lower Fence)
       โ”‚
       โ€ข โ† Outliers
"""

Probability Distributions

Normal Distribution (Gaussian)

The most important distribution in statistics.

f(x)=1ฯƒ2ฯ€eโˆ’(xโˆ’ฮผ)22ฯƒ2f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

# Parameters
mu = 0      # Mean
sigma = 1   # Standard deviation

# Create range
x = np.linspace(-4, 4, 1000)

# PDF (Probability Density Function)
pdf = stats.norm.pdf(x, mu, sigma)

# CDF (Cumulative Distribution Function)
cdf = stats.norm.cdf(x, mu, sigma)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# PDF
axes[0].plot(x, pdf, 'b-', linewidth=2)
axes[0].fill_between(x, pdf, alpha=0.3)
axes[0].set_title('Normal Distribution PDF')
axes[0].set_xlabel('x')
axes[0].set_ylabel('Probability Density')

# CDF
axes[1].plot(x, cdf, 'r-', linewidth=2)
axes[1].set_title('Normal Distribution CDF')
axes[1].set_xlabel('x')
axes[1].set_ylabel('Cumulative Probability')

plt.tight_layout()
plt.show()

# Generate normal samples
np.random.seed(42)
samples = np.random.normal(mu, sigma, 10000)

# Verify properties
print(f"Sample Mean: {np.mean(samples):.4f} (should be ~{mu})")
print(f"Sample Std: {np.std(samples):.4f} (should be ~{sigma})")

# 68-95-99.7 Rule
within_1sigma = np.sum(np.abs(samples - mu) < sigma) / len(samples)
within_2sigma = np.sum(np.abs(samples - mu) < 2*sigma) / len(samples)
within_3sigma = np.sum(np.abs(samples - mu) < 3*sigma) / len(samples)

print(f"\n68-95-99.7 Rule:")
print(f"Within 1ฯƒ: {within_1sigma*100:.1f}% (expected: 68.3%)")
print(f"Within 2ฯƒ: {within_2sigma*100:.1f}% (expected: 95.4%)")
print(f"Within 3ฯƒ: {within_3sigma*100:.1f}% (expected: 99.7%)")

# Z-scores
data = np.array([23, 45, 12, 67, 34, 89, 21, 56, 78, 43])
z_scores = (data - np.mean(data)) / np.std(data)
print(f"\nZ-scores: {z_scores}")
print(f"Values within ยฑ2 Z-scores: {np.sum(np.abs(z_scores) < 2)}/{len(data)}")

Z-Score (Standard Score)

z=xโˆ’ฮผฯƒz = \frac{x - \mu}{\sigma}

Here,

  • zz=Number of standard deviations from the mean
  • xx=Raw data value
  • ฮผฮผ=Population mean
  • ฯƒฯƒ=Population standard deviation

โ„น๏ธ The 68-95-99.7 Empirical Rule

For a normal distribution:

  • 68.27% of data falls within ยฑ1ฯƒ of the mean
  • 95.45% within ยฑ2ฯƒ
  • 99.73% within ยฑ3ฯƒ This rule is used for outlier detection (values beyond ยฑ3ฯƒ are rare), quality control (Six Sigma), and hypothesis testing.

Other Important Distributions

# 1. Uniform Distribution
x = np.linspace(0, 1, 1000)
uniform_pdf = stats.uniform.pdf(x)
plt.figure(figsize=(10, 5))
plt.plot(x, uniform_pdf, 'b-', linewidth=2)
plt.fill_between(x, uniform_pdf, alpha=0.3)
plt.title('Uniform Distribution')
plt.show()

# 2. Binomial Distribution
n_trials = 10
p_success = 0.5
x = np.arange(0, n_trials + 1)
binom_pmf = stats.binom.pmf(x, n_trials, p_success)
plt.figure(figsize=(10, 5))
plt.bar(x, binom_pmf, color='skyblue', edgecolor='black')
plt.title(f'Binomial Distribution (n={n_trials}, p={p_success})')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.show()

# 3. Poisson Distribution
lambda_param = 5
x = np.arange(0, 15)
poisson_pmf = stats.poisson.pmf(x, lambda_param)
plt.figure(figsize=(10, 5))
plt.bar(x, poisson_pmf, color='lightgreen', edgecolor='black')
plt.title(f'Poisson Distribution (ฮป={lambda_param})')
plt.xlabel('Number of Events')
plt.ylabel('Probability')
plt.show()

# 4. Exponential Distribution
lambda_exp = 1
x = np.linspace(0, 5, 1000)
exp_pdf = stats.expon.pdf(x, scale=1/lambda_exp)
plt.figure(figsize=(10, 5))
plt.plot(x, exp_pdf, 'r-', linewidth=2)
plt.fill_between(x, exp_pdf, alpha=0.3)
plt.title(f'Exponential Distribution (ฮป={lambda_exp})')
plt.show()

# 5. Chi-Square Distribution
df = 5
x = np.linspace(0, 15, 1000)
chi2_pdf = stats.chi2.pdf(x, df)
plt.figure(figsize=(10, 5))
plt.plot(x, chi2_pdf, 'g-', linewidth=2)
plt.fill_between(x, chi2_pdf, alpha=0.3)
plt.title(f'Chi-Square Distribution (df={df})')
plt.show()

# 6. T-Distribution
df_t = 5
x = np.linspace(-4, 4, 1000)
t_pdf = stats.t.pdf(x, df_t)
normal_pdf = stats.norm.pdf(x)

plt.figure(figsize=(10, 5))
plt.plot(x, normal_pdf, 'b-', label='Normal', linewidth=2)
plt.plot(x, t_pdf, 'r--', label=f't (df={df_t})', linewidth=2)
plt.legend()
plt.title('Normal vs t-Distribution')
plt.show()

๐Ÿ’ก When to Use Which Distribution

  • Normal: Heights, test scores, measurement errors (continuous, symmetric)
  • Binomial: Number of successes in n independent trials (discrete)
  • Poisson: Count of events in a fixed interval (discrete, rare events)
  • Exponential: Time between events in a Poisson process (continuous)
  • t-Distribution: Small-sample inference about means (continuous, heavy tails)
  • Chi-Square: Tests of independence, goodness of fit (continuous, right-skewed)

Skewness and Kurtosis

from scipy.stats import skew, kurtosis

data = np.random.exponential(2, 1000)  # Right-skewed data

skewness = skew(data)
kurt = kurtosis(data)

print(f"Skewness: {skewness:.3f}")
print(f"Kurtosis: {kurt:.3f}")

# Interpretation:
"""
Skewness:
- Skewness = 0: Symmetric
- Skewness > 0: Right-skewed (positive skew)
- Skewness < 0: Left-skewed (negative skew)

Kurtosis:
- Kurtosis = 0 (or 3): Normal (mesokurtic)
- Kurtosis > 0: Heavy tails (leptokurtic)
- Kurtosis < 0: Light tails (platykurtic)
"""

# Visual comparison
np.random.seed(42)
distributions = {
    'Normal': np.random.normal(0, 1, 1000),
    'Uniform': np.random.uniform(-2, 2, 1000),
    'Exponential': np.random.exponential(1, 1000),
    'Laplace': np.random.laplace(0, 1, 1000)
}

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for ax, (name, data) in zip(axes.flatten(), distributions.items()):
    ax.hist(data, bins=30, density=True, alpha=0.7, edgecolor='black')
    ax.set_title(f'{name}\nSkew: {skew(data):.2f}, Kurt: {kurtosis(data):.2f}')

plt.tight_layout()
plt.show()

Sample Skewness

g1=1nโˆ‘i=1n(xiโˆ’xห‰)3(1nโˆ‘i=1n(xiโˆ’xห‰)2)3/2g_1 = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^3}{\left(\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2\right)^{3/2}}

Here,

  • g1gโ‚=Sample skewness
  • xห‰xฬ„=Sample mean
  • nn=Sample size

Correlation Analysis

# Pearson Correlation (linear relationships)
np.random.seed(42)
x = np.random.randn(100)
y = x * 2 + np.random.randn(100) * 0.5

pearson_corr = np.corrcoef(x, y)[0, 1]
print(f"Pearson Correlation: {pearson_corr:.3f}")

# Spearman Correlation (monotonic relationships)
from scipy.stats import spearmanr

# Non-linear but monotonic
x_nonlinear = np.random.uniform(0, 10, 100)
y_nonlinear = np.exp(x_nonlinear) + np.random.randn(100) * 10

spearman_corr, p_value = spearmanr(x_nonlinear, y_nonlinear)
print(f"\nSpearman Correlation: {spearman_corr:.3f}")
print(f"P-value: {p_value:.6f}")

# Correlation matrix
import seaborn as sns
import pandas as pd

df = pd.DataFrame({
    'height': np.random.normal(170, 10, 100),
    'weight': np.random.normal(70, 15, 100),
    'age': np.random.randint(18, 65, 100),
    'income': np.random.normal(50000, 15000, 100)
})

# Add some correlations
df['weight'] = df['height'] * 0.5 + np.random.randn(100) * 5
df['income'] = df['age'] * 1000 + np.random.randn(100) * 5000

corr_matrix = df.corr()
print("\nCorrelation Matrix:")
print(corr_matrix)

# Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
            vmin=-1, vmax=1, square=True, linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

โ„น๏ธ Pearson vs Spearman Correlation

  • Pearson (r): Measures linear association. Sensitive to outliers. Assumes both variables are approximately normally distributed.
  • Spearman (ฯ): Measures monotonic association (rank-based). Robust to outliers and non-normal distributions. Use when relationships are non-linear but consistently increasing/decreasing.

Practical Example: Customer Analytics

import pandas as pd
import numpy as np
from scipy import stats

# Generate customer data
np.random.seed(42)
n = 1000

df_customers = pd.DataFrame({
    'age': np.random.normal(35, 10, n).astype(int),
    'income': np.random.lognormal(10.8, 0.5, n).astype(int),
    'spending_score': np.random.randint(1, 100, n),
    'satisfaction': np.random.normal(7, 2, n).clip(1, 10)
})

# Descriptive Statistics Report
print("=" * 60)
print("CUSTOMER ANALYTICS REPORT")
print("=" * 60)

for col in df_customers.columns:
    data = df_customers[col]
    print(f"\n{col.upper()}:")
    print(f"  Mean: {data.mean():.2f}")
    print(f"  Median: {data.median():.2f}")
    print(f"  Std Dev: {data.std():.2f}")
    print(f"  Skewness: {stats.skew(data):.3f}")
    print(f"  Kurtosis: {stats.kurtosis(data):.3f}")
    print(f"  95% CI: [{stats.t.interval(0.95, len(data)-1, data.mean(), stats.sem(data))[0]:.2f}, "
          f"{stats.t.interval(0.95, len(data)-1, data.mean(), stats.sem(data))[1]:.2f}]")

# Hypothesis testing example
high_income = df_customers[df_customers['income'] > df_customers['income'].median()]['spending_score']
low_income = df_customers[df_customers['income'] <= df_customers['income'].median()]['spending_score']

t_stat, p_value = stats.ttest_ind(high_income, low_income)
print(f"\nHypothesis Test: Do high-income customers spend more?")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.6f}")
print(f"Result: {'Reject' if p_value < 0.05 else 'Fail to reject'} null hypothesis")

๐Ÿ“Worked Example: Analyzing Customer Income Distribution

Given: Customer income data with mean = 52,000,median=52,000, median =45,000, std = $18,000

Analysis:

  1. Mean > Median โ†’ right-skewed distribution (expected for income)
  2. CV = (18000/52000) ร— 100% = 34.6% โ†’ moderate relative variability
  3. Z-score for income = $90,000: z = (90000 - 52000) / 18000 = 2.11
  4. P(Z > 2.11) โ‰ˆ 0.017 โ†’ about 1.7% of customers earn more than $90K
  5. Recommendation: Report median income (robust to high earners) rather than mean

Key Takeaways

๐Ÿ“‹Summary: Statistics 101

  1. Mean is sensitive to outliers; Median is robust โ€” choose based on distribution shape
  2. Standard Deviation measures spread in original units; CV measures relative spread for comparing across datasets
  3. Bessel's correction (n-1) ensures sample variance is an unbiased estimator of population variance
  4. Normal distribution is the foundation of many statistical tests; the Empirical Rule (68-95-99.7) enables quick inference
  5. Skewness measures asymmetry; Kurtosis measures tail heaviness โ€” both diagnose distribution shape
  6. Pearson correlation measures linear association; Spearman measures monotonic association โ€” use Spearman when relationships are non-linear
  7. Always visualize distributions before computing summary statistics โ€” numerical summaries can hide important patterns

Practice Exercise

  1. Generate 1000 samples from a normal distribution and verify the 68-95-99.7 rule
  2. Calculate all descriptive statistics (mean, median, mode, variance, std, skewness, kurtosis) manually and compare with Python's built-in functions
  3. Create a comparison of mean, median, and mode for right-skewed, left-skewed, and symmetric distributions
  4. Compute the z-score for each data point and identify potential outliers (|z| > 2)
  5. Perform a correlation analysis on a real dataset โ€” compare Pearson and Spearman results
  6. Generate a dataset where Pearson correlation is near zero but Spearman is high (non-linear monotonic relationship)
  7. Test whether a dataset is normally distributed using the Shapiro-Wilk test

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement