Types of Data in Statistics
Understanding data types is the first and most critical step in any statistical analysis. The type of data you have determines which statistical methods are valid, which visualizations are appropriate, and what conclusions you can draw.
The Data Type Hierarchy
ALL DATA
├── Qualitative (Categorical)
│ ├── Nominal (categories with no order)
│ └── Ordinal (categories with meaningful order)
│
└── Quantitative (Numerical)
├── Discrete (countable, whole numbers)
└── Continuous (measurable, any value in a range)
Qualitative (Categorical) Data
Qualitative data represents categories or groups — things that are described rather than measured numerically.
Nominal Data
Categories with no natural order. You can only determine equality or inequality.
Examples:
- Eye color: brown, blue, green, hazel
- Blood type: A, B, AB, O
- Country of birth
- Product category: electronics, clothing, food
- Survey responses: Yes / No
Valid operations: Count, mode, chi-square test
Invalid operations: Mean, median, subtraction
Ordinal Data
Categories with a meaningful order, but the gaps between categories are not necessarily equal.
Examples:
- Education level: high school < bachelor's < master's < PhD
- Customer satisfaction: Poor < Fair < Good < Excellent
- Military rank: Private < Corporal < Sergeant < Captain
- Star ratings: ★ < ★★ < ★★★ < ★★★★ < ★★★★★
Valid operations: Ordering, median, percentiles, Spearman correlation
Invalid operations: Arithmetic mean (controversial), subtraction (intervals unknown)
Key distinction: "Excellent" is better than "Good", but is it exactly twice as good? Ordinal scales can't tell us.
Quantitative (Numerical) Data
Quantitative data represents measured or counted quantities — numbers that have mathematical meaning.
Discrete Data
Can only take specific, countable values — usually whole numbers. There are gaps between possible values.
Examples:
- Number of children in a family (0, 1, 2, 3, ... — not 1.7)
- Number of cars in a parking lot
- Number of defects in a product
- Shoe sizes (though not whole numbers, they're discrete: 8, 8.5, 9...)
- Number of goals scored in a soccer match
Valid operations: All arithmetic, count, Poisson distribution, binomial distribution
Continuous Data
Can take any value within a range, including fractions and decimals. Limited only by measurement precision.
Examples:
- Height (1.753847... meters)
- Temperature (23.7°C)
- Time to complete a task
- Weight, blood pressure, distance
- Stock prices
Valid operations: All arithmetic, normal distribution, integration, derivatives
Interval vs Ratio (A Deeper Cut)
Within quantitative data, we can further distinguish:
| Feature | Interval | Ratio |
|---|---|---|
| Equal intervals | ✅ Yes | ✅ Yes |
| True zero (zero = absence) | ❌ No | ✅ Yes |
| Meaningful ratios | ❌ No | ✅ Yes |
| Example | Temperature (°C), IQ | Height, weight, income |
Interval example: 0°C is not "no temperature." 40°C is not twice as hot as 20°C (in the thermodynamic sense). Temperature in Kelvin is ratio.
Ratio example: A person who weighs 80 kg is genuinely twice as heavy as someone who weighs 40 kg.
Why Data Types Matter for Statistics
| Analysis Goal | Nominal | Ordinal | Discrete/Continuous |
|---|---|---|---|
| Central tendency | Mode | Mode, Median | Mean, Median, Mode |
| Spread | Frequency | IQR | Std Dev, Variance |
| Correlation | Cramér's V | Spearman ρ | Pearson r |
| Group comparison | Chi-square | Kruskal-Wallis | ANOVA, t-test |
| Regression | Dummy variables | Ordinal logistic | Linear regression |
| Visualization | Bar chart | Bar/box | Histogram, scatter |
Python: Identifying and Working with Data Types
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load a rich dataset
df = sns.load_dataset('tips')
print("Dataset shape:", df.shape)
print("\nData types (pandas dtypes):")
print(df.dtypes)
Output:
Dataset shape: (244, 7)
Data types (pandas dtypes):
total_bill float64 ← Continuous quantitative
tip float64 ← Continuous quantitative
sex category ← Nominal qualitative
smoker category ← Nominal qualitative
day category ← Ordinal qualitative (Sun > Sat > Fri > Thur semantically)
time category ← Nominal qualitative
size int64 ← Discrete quantitative
# --- Statistical summaries differ by type ---
print("\n=== Quantitative Variables ===")
print(df[['total_bill', 'tip', 'size']].describe())
print("\n=== Qualitative Variables ===")
for col in ['sex', 'smoker', 'day', 'time']:
print(f"\n{col} — value counts:")
print(df[col].value_counts())
print(f"Mode: {df[col].mode()[0]}")
# --- Visualizations appropriate to each type ---
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
# Continuous: histogram
axes[0, 0].hist(df['total_bill'], bins=20, color='steelblue', edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Total Bill (Continuous)\n→ Histogram')
axes[0, 0].set_xlabel('Amount ($)')
# Discrete: bar chart
size_counts = df['size'].value_counts().sort_index()
axes[0, 1].bar(size_counts.index, size_counts.values, color='coral', edgecolor='black')
axes[0, 1].set_title('Party Size (Discrete)\n→ Bar Chart')
axes[0, 1].set_xlabel('Size')
# Nominal: pie chart
sex_counts = df['sex'].value_counts()
axes[0, 2].pie(sex_counts.values, labels=sex_counts.index, autopct='%1.1f%%', startangle=90)
axes[0, 2].set_title('Sex (Nominal)\n→ Pie Chart')
# Ordinal: ordered bar
day_order = ['Thur', 'Fri', 'Sat', 'Sun']
day_counts = df['day'].value_counts().reindex(day_order)
axes[1, 0].bar(day_counts.index, day_counts.values, color='mediumseagreen', edgecolor='black')
axes[1, 0].set_title('Day (Ordinal)\n→ Ordered Bar Chart')
# Continuous: box plot by category
df.boxplot(column='tip', by='day', ax=axes[1, 1])
axes[1, 1].set_title('Tip by Day\n→ Box Plot')
# Scatter: two continuous
axes[1, 2].scatter(df['total_bill'], df['tip'], alpha=0.5, color='purple')
axes[1, 2].set_title('Bill vs Tip (Continuous × Continuous)\n→ Scatter Plot')
axes[1, 2].set_xlabel('Total Bill ($)')
axes[1, 2].set_ylabel('Tip ($)')
plt.tight_layout()
plt.savefig('data_types_visualization.png', dpi=150)
plt.show()
Data Type Classification in Practice
def classify_variable(series: pd.Series, nunique_threshold: int = 15) -> str:
"""Classify a pandas Series into a statistical data type."""
dtype = series.dtype
nunique = series.nunique()
if dtype == 'bool':
return 'Nominal (Binary)'
elif dtype.name == 'category' or dtype == 'object':
return 'Nominal Categorical'
elif dtype in ['int32', 'int64']:
if nunique <= nunique_threshold:
return f'Discrete Quantitative ({nunique} unique values)'
else:
return 'Discrete Quantitative (high cardinality)'
elif dtype in ['float32', 'float64']:
return 'Continuous Quantitative'
else:
return f'Unknown ({dtype})'
# Apply to the tips dataset
print("Variable Classification:")
print("-" * 50)
for col in df.columns:
classification = classify_variable(df[col])
print(f"{col:<15} → {classification}")
Common Mistakes
❌ Treating Ordinal as Interval
Averaging Likert-scale responses (1–5) as if they are interval data is common but technically incorrect. The difference between "Strongly Agree" and "Agree" may not equal the difference between "Neutral" and "Disagree."
❌ Zip Codes as Quantitative
ZIP code 90210 is not 40,000 more than ZIP code 50000. It's a nominal identifier.
❌ Treating Discrete Data as Continuous (or vice versa)
Modeling number of children with a continuous distribution can predict 1.7 children — meaningless. Use Poisson or negative binomial.
Practice Exercises
Exercise 1: Classify each variable:
- a) Temperature in Fahrenheit
- b) Movie genre (Action, Comedy, Drama)
- c) Customer age
- d) Job satisfaction rating (1 = Very Unsatisfied, 5 = Very Satisfied)
- e) Number of siblings
Exercise 2: For each variable in the iris dataset, identify the type and choose the most appropriate visualization.
import seaborn as sns
iris = sns.load_dataset('iris')
print(iris.dtypes)
# Your classifications and visualizations here
See Solution
# sepal_length: float64 → Continuous → histogram or box plot
# sepal_width: float64 → Continuous → histogram or box plot
# petal_length: float64 → Continuous → histogram or box plot
# petal_width: float64 → Continuous → histogram or box plot
# species: object/category → Nominal → bar chart
import seaborn as sns
import matplotlib.pyplot as plt
iris = sns.load_dataset('iris')
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# Continuous: petal_length distribution by species
for species in iris['species'].unique():
subset = iris[iris['species'] == species]['petal_length']
axes[0].hist(subset, bins=15, alpha=0.6, label=species)
axes[0].set_title('Petal Length by Species\n(Continuous, grouped)')
axes[0].legend()
# Nominal: species counts
iris['species'].value_counts().plot(kind='bar', ax=axes[1], color='coral', edgecolor='black')
axes[1].set_title('Species Count\n(Nominal)')
axes[1].tick_params(rotation=0)
plt.tight_layout()
plt.show()
Key Takeaways
- Data type determines your entire analysis pipeline — identify types before doing anything else.
- Qualitative data describes categories — nominal has no order, ordinal has meaningful order.
- Quantitative data measures or counts — discrete takes whole values, continuous takes any value.
- The interval/ratio distinction matters for ratio calculations and distributional choices.
- Pandas dtypes approximate statistical types but require human judgment to classify correctly.
- Wrong type → wrong method → wrong conclusion — this is not just academic pedantry.