Arithmetic Mean — Formula, Properties, Computation, Limitations

Foundations of StatisticsDescriptive StatisticsFree Lesson

Advertisement

The Arithmetic Mean: A Deep Dive

The arithmetic mean (commonly just "the mean" or "average") is the most widely used statistical measure — and the most misused. Understanding it deeply saves you from common analytical errors.


Definition and Formula

For a sample of n observations:

xˉ=x1+x2++xnn=1ni=1nxi\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n}\sum_{i=1}^n x_i

For a population of N observations (Greek mu):

μ=1Ni=1Nxi\mu = \frac{1}{N}\sum_{i=1}^N x_i

import numpy as np
import pandas as pd
from scipy import stats

# Three equivalent ways to compute the mean in Python
data = [12, 15, 14, 10, 18, 20, 16, 11, 13, 17]

mean_manual = sum(data) / len(data)
mean_numpy = np.mean(data)
mean_pandas = pd.Series(data).mean()

print(f"Manual:  {mean_manual:.4f}")
print(f"NumPy:   {mean_numpy:.4f}")
print(f"Pandas:  {mean_pandas:.4f}")

Algebraic Properties of the Mean

Property 1: Sum of deviations = 0

i=1n(xixˉ)=0\sum_{i=1}^n (x_i - \bar{x}) = 0

The mean is the "balance point" of the distribution.

data = np.array([12, 15, 14, 10, 18, 20, 16, 11, 13, 17])
mean = np.mean(data)
deviations = data - mean
print(f"Sum of deviations: {deviations.sum():.10f}")  # Essentially 0
print(f"Deviations: {deviations}")

Property 2: Linear transformation

(aX+b)=aXˉ+b\overline{(aX + b)} = a\bar{X} + b

# Temperature: convert from Celsius sample to Fahrenheit
celsius = np.array([20, 22, 25, 18, 30])
fahrenheit = celsius * (9/5) + 32

print(f"Mean Celsius: {celsius.mean():.2f}°C")
print(f"Mean Fahrenheit (direct): {fahrenheit.mean():.2f}°F")
print(f"Mean Fahrenheit (formula): {celsius.mean()*(9/5)+32:.2f}°F")  # Same!

Property 3: Minimizes sum of squared deviations

The mean minimizes (xic)2\sum(x_i - c)^2 for any constant c.

# Show the mean minimizes sum of squared deviations
c_values = np.linspace(data.min(), data.max(), 200)
sse = [(data - c)**2 .sum() for c in c_values]

import matplotlib.pyplot as plt
plt.plot(c_values, sse)
plt.axvline(mean, color='red', linestyle='--', label=f'Mean = {mean:.1f}')
plt.xlabel('Value of c')
plt.ylabel('Sum of Squared Deviations')
plt.title('Mean Minimizes Sum of Squared Deviations')
plt.legend()
plt.show()

Weighted Mean

When observations have different importance or frequency:

xˉw=wixiwi\bar{x}_w = \frac{\sum w_i x_i}{\sum w_i}

# GPA calculation (weighted mean)
grades = {'A': 4.0, 'B+': 3.3, 'A-': 3.7, 'B': 3.0, 'A': 4.0}
courses = [
    {'grade': 4.0, 'credits': 4, 'course': 'Calculus'},
    {'grade': 3.3, 'credits': 3, 'course': 'History'},
    {'grade': 3.7, 'credits': 4, 'course': 'Chemistry'},
    {'grade': 3.0, 'credits': 2, 'course': 'PE'},
    {'grade': 4.0, 'credits': 3, 'course': 'Statistics'},
]

df = pd.DataFrame(courses)
weighted_gpa = np.average(df['grade'], weights=df['credits'])
simple_gpa = df['grade'].mean()

print(f"Weighted GPA: {weighted_gpa:.4f}")
print(f"Simple mean GPA (ignores credit hours): {simple_gpa:.4f}")
print(f"Difference: {abs(weighted_gpa - simple_gpa):.4f}")

Mean for Grouped Data

When raw data is unavailable (only frequency table):

xˉ=fimifi\bar{x} = \frac{\sum f_i m_i}{\sum f_i}

where mᵢ is the midpoint of each class interval.

# Grouped frequency table
grouped_data = pd.DataFrame({
    'interval': ['20-29', '30-39', '40-49', '50-59', '60-69'],
    'midpoint': [24.5, 34.5, 44.5, 54.5, 64.5],
    'frequency': [8, 22, 35, 25, 10]
})

grouped_data['f_times_m'] = grouped_data['midpoint'] * grouped_data['frequency']
n_total = grouped_data['frequency'].sum()
grouped_mean = grouped_data['f_times_m'].sum() / n_total

print(grouped_data.to_string(index=False))
print(f"\nEstimated mean (from grouped data): {grouped_mean:.2f}")

Trimmed Mean: A Robust Alternative

Removes a fraction α of observations from each tail before computing the mean:

from scipy.stats import trim_mean

data_with_outliers = np.concatenate([
    np.random.normal(50, 5, 95),
    [200, 210, 220, 0, -5]  # outliers
])

print(f"Arithmetic mean:  {np.mean(data_with_outliers):.2f}")
print(f"Trimmed mean 5%:  {trim_mean(data_with_outliers, 0.05):.2f}")
print(f"Trimmed mean 10%: {trim_mean(data_with_outliers, 0.10):.2f}")
print(f"Median:           {np.median(data_with_outliers):.2f}")
# (True mean should be ~50)

Limitations of the Mean

ProblemExampleSolution
Sensitive to outliersCEO salary distorts avg company salaryUse median
Meaningless for nominal dataMean blood type is nonsenseUse mode
Inappropriate for skewed dataMean income misleadsUse median
May not be a possible valueMean family size = 2.3 childrenUse appropriate measure
Hides multimodalityMean of bimodal = between the modesVisualize first

Key Takeaways

  1. The mean is the balance point — sum of deviations always = 0
  2. Linear transformations flow directly through the mean
  3. The mean minimizes sum of squared deviations — optimal for normal data
  4. Weighted mean accounts for unequal importance of observations
  5. Trimmed mean provides robustness without abandoning the mean entirely
  6. The mean is not always the right measure — always check your data's shape first

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement