The Arithmetic Mean: A Deep Dive
The arithmetic mean (commonly just "the mean" or "average") is the most widely used statistical measure — and the most misused. Understanding it deeply saves you from common analytical errors.
Definition and Formula
For a sample of n observations:
For a population of N observations (Greek mu):
import numpy as np
import pandas as pd
from scipy import stats
# Three equivalent ways to compute the mean in Python
data = [12, 15, 14, 10, 18, 20, 16, 11, 13, 17]
mean_manual = sum(data) / len(data)
mean_numpy = np.mean(data)
mean_pandas = pd.Series(data).mean()
print(f"Manual: {mean_manual:.4f}")
print(f"NumPy: {mean_numpy:.4f}")
print(f"Pandas: {mean_pandas:.4f}")
Algebraic Properties of the Mean
Property 1: Sum of deviations = 0
The mean is the "balance point" of the distribution.
data = np.array([12, 15, 14, 10, 18, 20, 16, 11, 13, 17])
mean = np.mean(data)
deviations = data - mean
print(f"Sum of deviations: {deviations.sum():.10f}") # Essentially 0
print(f"Deviations: {deviations}")
Property 2: Linear transformation
# Temperature: convert from Celsius sample to Fahrenheit
celsius = np.array([20, 22, 25, 18, 30])
fahrenheit = celsius * (9/5) + 32
print(f"Mean Celsius: {celsius.mean():.2f}°C")
print(f"Mean Fahrenheit (direct): {fahrenheit.mean():.2f}°F")
print(f"Mean Fahrenheit (formula): {celsius.mean()*(9/5)+32:.2f}°F") # Same!
Property 3: Minimizes sum of squared deviations
The mean minimizes for any constant c.
# Show the mean minimizes sum of squared deviations
c_values = np.linspace(data.min(), data.max(), 200)
sse = [(data - c)**2 .sum() for c in c_values]
import matplotlib.pyplot as plt
plt.plot(c_values, sse)
plt.axvline(mean, color='red', linestyle='--', label=f'Mean = {mean:.1f}')
plt.xlabel('Value of c')
plt.ylabel('Sum of Squared Deviations')
plt.title('Mean Minimizes Sum of Squared Deviations')
plt.legend()
plt.show()
Weighted Mean
When observations have different importance or frequency:
# GPA calculation (weighted mean)
grades = {'A': 4.0, 'B+': 3.3, 'A-': 3.7, 'B': 3.0, 'A': 4.0}
courses = [
{'grade': 4.0, 'credits': 4, 'course': 'Calculus'},
{'grade': 3.3, 'credits': 3, 'course': 'History'},
{'grade': 3.7, 'credits': 4, 'course': 'Chemistry'},
{'grade': 3.0, 'credits': 2, 'course': 'PE'},
{'grade': 4.0, 'credits': 3, 'course': 'Statistics'},
]
df = pd.DataFrame(courses)
weighted_gpa = np.average(df['grade'], weights=df['credits'])
simple_gpa = df['grade'].mean()
print(f"Weighted GPA: {weighted_gpa:.4f}")
print(f"Simple mean GPA (ignores credit hours): {simple_gpa:.4f}")
print(f"Difference: {abs(weighted_gpa - simple_gpa):.4f}")
Mean for Grouped Data
When raw data is unavailable (only frequency table):
where mᵢ is the midpoint of each class interval.
# Grouped frequency table
grouped_data = pd.DataFrame({
'interval': ['20-29', '30-39', '40-49', '50-59', '60-69'],
'midpoint': [24.5, 34.5, 44.5, 54.5, 64.5],
'frequency': [8, 22, 35, 25, 10]
})
grouped_data['f_times_m'] = grouped_data['midpoint'] * grouped_data['frequency']
n_total = grouped_data['frequency'].sum()
grouped_mean = grouped_data['f_times_m'].sum() / n_total
print(grouped_data.to_string(index=False))
print(f"\nEstimated mean (from grouped data): {grouped_mean:.2f}")
Trimmed Mean: A Robust Alternative
Removes a fraction α of observations from each tail before computing the mean:
from scipy.stats import trim_mean
data_with_outliers = np.concatenate([
np.random.normal(50, 5, 95),
[200, 210, 220, 0, -5] # outliers
])
print(f"Arithmetic mean: {np.mean(data_with_outliers):.2f}")
print(f"Trimmed mean 5%: {trim_mean(data_with_outliers, 0.05):.2f}")
print(f"Trimmed mean 10%: {trim_mean(data_with_outliers, 0.10):.2f}")
print(f"Median: {np.median(data_with_outliers):.2f}")
# (True mean should be ~50)
Limitations of the Mean
| Problem | Example | Solution |
|---|---|---|
| Sensitive to outliers | CEO salary distorts avg company salary | Use median |
| Meaningless for nominal data | Mean blood type is nonsense | Use mode |
| Inappropriate for skewed data | Mean income misleads | Use median |
| May not be a possible value | Mean family size = 2.3 children | Use appropriate measure |
| Hides multimodality | Mean of bimodal = between the modes | Visualize first |
Key Takeaways
- The mean is the balance point — sum of deviations always = 0
- Linear transformations flow directly through the mean
- The mean minimizes sum of squared deviations — optimal for normal data
- Weighted mean accounts for unequal importance of observations
- Trimmed mean provides robustness without abandoning the mean entirely
- The mean is not always the right measure — always check your data's shape first