What Is Statistics?
Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. It gives us tools to make sense of a world full of uncertainty — turning raw numbers into actionable knowledge.
"Statistics is the grammar of science." — Karl Pearson
Why Statistics Matters
Every field that uses data uses statistics:
| Field | Statistical Application |
|---|---|
| Medicine | Clinical trial analysis, disease prevalence |
| Finance | Risk modeling, portfolio optimization |
| Engineering | Quality control, reliability testing |
| Social Science | Survey analysis, causal inference |
| Machine Learning | Model evaluation, feature selection |
| Business | A/B testing, demand forecasting |
Without statistics, we are swimming in data but drowning in uncertainty.
Two Pillars: Descriptive vs Inferential
Descriptive Statistics
Summarizes and describes the data you have. No generalizations beyond your dataset.
Examples:
- The average salary of 500 employees at a company
- The distribution of exam scores in a class
- A pie chart of market share by product
Inferential Statistics
Uses a sample to draw conclusions about a larger population.
Examples:
- Estimating the average salary of all workers in a country (from a survey of 5,000)
- Testing whether a new drug works better than a placebo
- Predicting election outcomes from polling data
Population (all units of interest)
↓
Sampling
↓
Sample (subset we measure)
↓
Statistical inference
↓
Conclusions about population (with uncertainty quantified)
The Statistical Thinking Process
Good statistical reasoning follows this cycle:
1. Ask a clear question "Does the new teaching method improve test scores?"
2. Design the study
- Who to collect data from (sample vs. population)
- How to collect it (experiment, survey, observation)
- What to measure
3. Collect data
- Ensure data quality and consistency
4. Explore the data (EDA)
- Visualize distributions
- Check for outliers, missingness
5. Analyze
- Apply appropriate statistical methods
6. Interpret & communicate
- Translate results into actionable insights
- Quantify uncertainty honestly
Key Vocabulary
| Term | Definition |
|---|---|
| Population | The entire group of interest |
| Sample | A subset of the population that is measured |
| Parameter | A numerical property of the population (e.g., μ, σ) |
| Statistic | A numerical property of the sample (e.g., x̄, s) |
| Variable | A characteristic being measured |
| Observation | A single data point |
Branches of Statistics
Classical Frequentist Statistics
Probability is the long-run frequency of events. Parameters are fixed unknowns; data provides evidence.
Bayesian Statistics
Probability represents degrees of belief. We update beliefs as new evidence arrives using Bayes' Theorem.
Nonparametric Statistics
Makes fewer assumptions about the distribution of the data. Useful when normality cannot be assumed.
Python: First Steps
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
# Create a sample dataset
np.random.seed(42)
data = np.random.normal(loc=170, scale=10, size=100) # Heights in cm
# --- Descriptive statistics ---
print("=== Descriptive Statistics ===")
print(f"n = {len(data)}")
print(f"Mean = {np.mean(data):.2f} cm")
print(f"Median = {np.median(data):.2f} cm")
print(f"Std Dev = {np.std(data, ddof=1):.2f} cm")
print(f"Min = {np.min(data):.2f} cm")
print(f"Max = {np.max(data):.2f} cm")
# --- Visualization ---
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(data, bins=15, edgecolor='black', color='steelblue', alpha=0.7)
axes[0].axvline(np.mean(data), color='red', linestyle='--', label=f'Mean={np.mean(data):.1f}')
axes[0].axvline(np.median(data), color='green', linestyle='--', label=f'Median={np.median(data):.1f}')
axes[0].set_title('Distribution of Heights')
axes[0].set_xlabel('Height (cm)')
axes[0].legend()
axes[1].boxplot(data, vert=True)
axes[1].set_title('Box Plot of Heights')
axes[1].set_ylabel('Height (cm)')
plt.tight_layout()
plt.savefig('heights.png', dpi=150)
plt.show()
# --- Inferential: 95% confidence interval for the mean ---
ci = stats.t.interval(0.95, df=len(data)-1,
loc=np.mean(data),
scale=stats.sem(data))
print(f"\n95% CI for mean height: ({ci[0]:.2f}, {ci[1]:.2f}) cm")
Output:
=== Descriptive Statistics ===
n = 100
Mean = 170.48 cm
Median = 170.52 cm
Std Dev = 9.96 cm
Min = 145.39 cm
Max = 196.34 cm
95% CI for mean height: (168.50, 172.46) cm
Common Pitfalls in Statistical Thinking
1. Confusing Correlation with Causation
Ice cream sales correlate with drowning rates. Both are caused by summer heat — not each other.
2. Survivorship Bias
WWII engineers studied returning bombers' bullet holes and reinforced those areas. Abraham Wald pointed out: reinforce where the missing planes got hit — the ones that didn't return.
3. Simpson's Paradox
A trend can reverse when subgroups are combined. Example: Hospital A has higher overall survival rate, but Hospital B has better rates for every individual disease severity level (Hospital A treats milder cases).
4. P-Hacking
Running many tests until you find p < 0.05 inflates false positive rates. Always pre-register your hypotheses.
5. Ignoring Effect Size
A result can be statistically significant but practically meaningless. Always report effect sizes alongside p-values.
Practice Exercises
Exercise 1: In your own words, explain the difference between a parameter and a statistic. Give one example of each.
Exercise 2: Classify each scenario as descriptive or inferential statistics:
- a) Finding the average age of students in your classroom
- b) Using a survey of 1,000 adults to estimate the proportion of all adults who prefer remote work
- c) Creating a bar chart of monthly sales for the past year
Exercise 3 (Code): Load the tips dataset from seaborn and compute:
- Mean, median, and standard deviation of the
total_billcolumn - A 95% confidence interval for the mean tip percentage
import seaborn as sns
tips = sns.load_dataset('tips')
# Your code here
See Solution
import seaborn as sns
import numpy as np
from scipy import stats
tips = sns.load_dataset('tips')
tips['tip_pct'] = tips['tip'] / tips['total_bill'] * 100
bill = tips['total_bill']
tip_pct = tips['tip_pct']
print(f"Total Bill — Mean: {bill.mean():.2f}, Median: {bill.median():.2f}, SD: {bill.std():.2f}")
ci = stats.t.interval(0.95, df=len(tip_pct)-1,
loc=tip_pct.mean(),
scale=stats.sem(tip_pct))
print(f"95% CI for mean tip %: ({ci[0]:.2f}%, {ci[1]:.2f}%)")
Key Takeaways
- Statistics converts data into knowledge — it is the foundation of evidence-based decision making.
- Descriptive statistics summarize; inferential statistics generalize — both are essential.
- Statistical thinking is a skill — it means quantifying uncertainty, not eliminating it.
- Data quality matters more than data quantity — garbage in, garbage out.
- Always visualize before you analyze — your eyes catch what formulas miss.
- Effect size matters as much as p-value — statistical significance ≠ practical significance.