What is Statistics? — A Complete Introduction

Foundations of StatisticsIntroductionFree Lesson

Advertisement

What Is Statistics?

Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. It gives us tools to make sense of a world full of uncertainty — turning raw numbers into actionable knowledge.

"Statistics is the grammar of science." — Karl Pearson


Why Statistics Matters

Every field that uses data uses statistics:

FieldStatistical Application
MedicineClinical trial analysis, disease prevalence
FinanceRisk modeling, portfolio optimization
EngineeringQuality control, reliability testing
Social ScienceSurvey analysis, causal inference
Machine LearningModel evaluation, feature selection
BusinessA/B testing, demand forecasting

Without statistics, we are swimming in data but drowning in uncertainty.


Two Pillars: Descriptive vs Inferential

Descriptive Statistics

Summarizes and describes the data you have. No generalizations beyond your dataset.

Examples:

  • The average salary of 500 employees at a company
  • The distribution of exam scores in a class
  • A pie chart of market share by product

Inferential Statistics

Uses a sample to draw conclusions about a larger population.

Examples:

  • Estimating the average salary of all workers in a country (from a survey of 5,000)
  • Testing whether a new drug works better than a placebo
  • Predicting election outcomes from polling data
Population (all units of interest)
        ↓
   Sampling
        ↓
Sample (subset we measure)
        ↓
Statistical inference
        ↓
Conclusions about population (with uncertainty quantified)

The Statistical Thinking Process

Good statistical reasoning follows this cycle:

1. Ask a clear question "Does the new teaching method improve test scores?"

2. Design the study

  • Who to collect data from (sample vs. population)
  • How to collect it (experiment, survey, observation)
  • What to measure

3. Collect data

  • Ensure data quality and consistency

4. Explore the data (EDA)

  • Visualize distributions
  • Check for outliers, missingness

5. Analyze

  • Apply appropriate statistical methods

6. Interpret & communicate

  • Translate results into actionable insights
  • Quantify uncertainty honestly

Key Vocabulary

TermDefinition
PopulationThe entire group of interest
SampleA subset of the population that is measured
ParameterA numerical property of the population (e.g., μ, σ)
StatisticA numerical property of the sample (e.g., x̄, s)
VariableA characteristic being measured
ObservationA single data point

Branches of Statistics

Classical Frequentist Statistics

Probability is the long-run frequency of events. Parameters are fixed unknowns; data provides evidence.

Bayesian Statistics

Probability represents degrees of belief. We update beliefs as new evidence arrives using Bayes' Theorem.

Nonparametric Statistics

Makes fewer assumptions about the distribution of the data. Useful when normality cannot be assumed.


Python: First Steps

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Create a sample dataset
np.random.seed(42)
data = np.random.normal(loc=170, scale=10, size=100)  # Heights in cm

# --- Descriptive statistics ---
print("=== Descriptive Statistics ===")
print(f"n         = {len(data)}")
print(f"Mean      = {np.mean(data):.2f} cm")
print(f"Median    = {np.median(data):.2f} cm")
print(f"Std Dev   = {np.std(data, ddof=1):.2f} cm")
print(f"Min       = {np.min(data):.2f} cm")
print(f"Max       = {np.max(data):.2f} cm")

# --- Visualization ---
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(data, bins=15, edgecolor='black', color='steelblue', alpha=0.7)
axes[0].axvline(np.mean(data), color='red', linestyle='--', label=f'Mean={np.mean(data):.1f}')
axes[0].axvline(np.median(data), color='green', linestyle='--', label=f'Median={np.median(data):.1f}')
axes[0].set_title('Distribution of Heights')
axes[0].set_xlabel('Height (cm)')
axes[0].legend()

axes[1].boxplot(data, vert=True)
axes[1].set_title('Box Plot of Heights')
axes[1].set_ylabel('Height (cm)')

plt.tight_layout()
plt.savefig('heights.png', dpi=150)
plt.show()

# --- Inferential: 95% confidence interval for the mean ---
ci = stats.t.interval(0.95, df=len(data)-1,
                       loc=np.mean(data),
                       scale=stats.sem(data))
print(f"\n95% CI for mean height: ({ci[0]:.2f}, {ci[1]:.2f}) cm")

Output:

=== Descriptive Statistics ===
n         = 100
Mean      = 170.48 cm
Median    = 170.52 cm
Std Dev   = 9.96 cm
Min       = 145.39 cm
Max       = 196.34 cm

95% CI for mean height: (168.50, 172.46) cm

Common Pitfalls in Statistical Thinking

1. Confusing Correlation with Causation

Ice cream sales correlate with drowning rates. Both are caused by summer heat — not each other.

2. Survivorship Bias

WWII engineers studied returning bombers' bullet holes and reinforced those areas. Abraham Wald pointed out: reinforce where the missing planes got hit — the ones that didn't return.

3. Simpson's Paradox

A trend can reverse when subgroups are combined. Example: Hospital A has higher overall survival rate, but Hospital B has better rates for every individual disease severity level (Hospital A treats milder cases).

4. P-Hacking

Running many tests until you find p < 0.05 inflates false positive rates. Always pre-register your hypotheses.

5. Ignoring Effect Size

A result can be statistically significant but practically meaningless. Always report effect sizes alongside p-values.


Practice Exercises

Exercise 1: In your own words, explain the difference between a parameter and a statistic. Give one example of each.

Exercise 2: Classify each scenario as descriptive or inferential statistics:

  • a) Finding the average age of students in your classroom
  • b) Using a survey of 1,000 adults to estimate the proportion of all adults who prefer remote work
  • c) Creating a bar chart of monthly sales for the past year

Exercise 3 (Code): Load the tips dataset from seaborn and compute:

  • Mean, median, and standard deviation of the total_bill column
  • A 95% confidence interval for the mean tip percentage
import seaborn as sns
tips = sns.load_dataset('tips')
# Your code here
See Solution
import seaborn as sns
import numpy as np
from scipy import stats

tips = sns.load_dataset('tips')
tips['tip_pct'] = tips['tip'] / tips['total_bill'] * 100

bill = tips['total_bill']
tip_pct = tips['tip_pct']

print(f"Total Bill — Mean: {bill.mean():.2f}, Median: {bill.median():.2f}, SD: {bill.std():.2f}")

ci = stats.t.interval(0.95, df=len(tip_pct)-1,
                       loc=tip_pct.mean(),
                       scale=stats.sem(tip_pct))
print(f"95% CI for mean tip %: ({ci[0]:.2f}%, {ci[1]:.2f}%)")

Key Takeaways

  1. Statistics converts data into knowledge — it is the foundation of evidence-based decision making.
  2. Descriptive statistics summarize; inferential statistics generalize — both are essential.
  3. Statistical thinking is a skill — it means quantifying uncertainty, not eliminating it.
  4. Data quality matters more than data quantity — garbage in, garbage out.
  5. Always visualize before you analyze — your eyes catch what formulas miss.
  6. Effect size matters as much as p-value — statistical significance ≠ practical significance.

Next Steps

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement