Why This Matters
This capstone project brings together every skill from the Foundations module. You'll work through a complete data analysis pipeline β from raw data to actionable insights β exactly as you would in a real data science role.
Dataset: World Happiness Report (2015-2019) β real data from the United Nations.
DfReproducible Analysis
An analysis where another researcher can obtain the same results from the same data by following the same computational steps. Reproducibility requires documented code, version-controlled data, and deterministic random seeds. It is the minimum standard of scientific rigor β even exploratory analyses should be reproducible.
Why a capstone? Individual skills (pandas, visualization, statistics) are necessary but not sufficient. Real data science requires integrating these skills into a coherent workflow where each step informs the next. This project teaches you to think in pipelines, not isolated operations. The World Happiness Report is ideal because it contains real statistical relationships (GDPβhappiness correlation) that produce interpretable results.
Project Overview
Phase 1: Data Collection & Loading
β
Phase 2: Data Cleaning & Validation
β
Phase 3: Exploratory Data Analysis (EDA)
β
Phase 4: Statistical Testing
β
Phase 5: Advanced Analysis
β
Phase 6: Visualization Dashboard
β
Phase 7: Insights & Reporting
DfExploratory Data Analysis (EDA)
The process of systematically examining a dataset to understand its structure, detect anomalies, test assumptions, and identify relationships before formal statistical modeling. EDA combines summary statistics, visualizations, and cross-tabulations to generate hypotheses rather than test them. John Tukey (1977) argued that EDA should precede confirmatory analysis to prevent misspecification of models.
Phase 1: Data Collection & Loading
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
# ============================================
# 1. LOAD DATA
# ============================================
# Load World Happiness Report data
# (Using built-in seaborn dataset for reproducibility)
df = pd.read_csv('world-happiness-report.csv')
# Alternative: Generate comparable synthetic data
np.random.seed(42)
n_countries = 150
df = pd.DataFrame({
'Country': [f'Country_{i}' for i in range(n_countries)],
'Region': np.random.choice(['Western Europe', 'North America', 'Asia Pacific',
'Latin America', 'Sub-Saharan Africa',
'Central and Eastern Europe', 'Middle East'], n_countries),
'Happiness Score': np.random.normal(5.5, 1.8, n_countries).clip(2, 8),
'GDP per capita': np.random.lognormal(9.5, 1.0, n_countries) / 1000,
'Social support': np.random.normal(0.8, 0.15, n_countries).clip(0, 1),
'Healthy life expectancy': np.random.normal(65, 10, n_countries).clip(40, 85),
'Freedom to make life choices': np.random.normal(0.45, 0.12, n_countries).clip(0, 1),
'Generosity': np.random.normal(0.2, 0.15, n_countries).clip(-0.2, 0.6),
'Perceptions of corruption': np.random.normal(0.15, 0.12, n_countries).clip(0, 0.5),
'Year': np.random.choice([2015, 2016, 2017, 2018, 2019], n_countries)
})
print("=" * 60)
print("PHASE 1: DATA COLLECTION & LOADING")
print("=" * 60)
print(f"\nDataset shape: {df.shape}")
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
print(f"\nColumn types:")
print(df.dtypes)
print(f"\nFirst 5 rows:")
print(df.head())
Why set a random seed? np.random.seed(42) ensures that every time you run the synthetic data generation code, you get the same "random" numbers. This makes your analysis reproducible β someone else running your code will get identical results. The seed value (42, the Answer to the Ultimate Question of Life, the Universe, and Everything) is a common convention but any fixed integer works.
Phase 2: Data Cleaning & Validation
DfData Validation
The systematic process of checking data for correctness, completeness, and consistency. Validation includes: (1) checking for missing values and deciding on imputation strategy, (2) detecting and handling outliers, (3) verifying data types and ranges, (4) identifying duplicate records. Poor validation leads to misleading results and invalid conclusions.
print("\n" + "=" * 60)
print("PHASE 2: DATA CLEANING & VALIDATION")
print("=" * 60)
# 2.1 Missing Values Analysis
print("\n--- Missing Values ---")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Count': missing, 'Percent': missing_pct})
print(missing_df[missing_df['Count'] > 0])
# Strategy: If missing < 5%, drop; if 5-25%, impute median; if > 25%, investigate
for col in df.select_dtypes(include=[np.number]).columns:
if df[col].isnull().sum() > 0:
pct = df[col].isnull().sum() / len(df) * 100
if pct < 5:
df = df.dropna(subset=[col])
print(f" Dropped rows for {col} ({pct:.1f}% missing)")
elif pct < 25:
median_val = df[col].median()
df[col] = df[col].fillna(median_val)
print(f" Imputed {col} with median ({pct:.1f}% missing)")
else:
print(f" WARNING: {col} has {pct:.1f}% missing β needs investigation")
# 2.2 Duplicate Check
duplicates = df.duplicated().sum()
print(f"\nDuplicate rows: {duplicates}")
if duplicates > 0:
df = df.drop_duplicates()
print(f"Removed {duplicates} duplicates")
# 2.3 Outlier Detection (IQR method)
print("\n--- Outlier Detection ---")
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = ((df[col] < lower) | (df[col] > upper)).sum()
if outliers > 0:
print(f" {col}: {outliers} outliers ({outliers/len(df)*100:.1f}%)")
# 2.4 Data Type Optimization
print("\n--- Data Type Optimization ---")
for col in df.select_dtypes(include=['int64']).columns:
if df[col].min() >= 0:
if df[col].max() < 255:
df[col] = df[col].astype('uint8')
print(f" {col}: int64 -> uint8")
elif df[col].max() < 65535:
df[col] = df[col].astype('uint16')
print(f" {col}: int64 -> uint16")
# 2.5 Feature Engineering
print("\n--- Feature Engineering ---")
df['Happiness Category'] = pd.cut(df['Happiness Score'],
bins=[0, 4, 5.5, 7, 10],
labels=['Low', 'Medium', 'High', 'Very High'])
print(f" Created Happiness Category: {df['Happiness Category'].value_counts().to_dict()}")
df['GDP Category'] = pd.cut(df['GDP per capita'],
bins=[0, 10, 30, 60, 200],
labels=['Low', 'Medium', 'High', 'Very High'])
print(f" Created GDP Category")
print(f"\nFinal dataset shape: {df.shape}")
DfInterquartile Range (IQR) Outlier Detection
A robust outlier detection method based on quartiles. The IQR = Q3 β Q1 (75th β 25th percentile). Values below Q1 β 1.5Β·IQR or above Q3 + 1.5Β·IQR are flagged as outliers. The 1.5 multiplier is Tukey's convention β it corresponds to approximately Β±2.7Ο for normally distributed data, capturing ~99.3% of data points within the fence.
Phase 3: Exploratory Data Analysis
print("\n" + "=" * 60)
print("PHASE 3: EXPLORATORY DATA ANALYSIS")
print("=" * 60)
# 3.1 Summary Statistics
print("\n--- Summary Statistics ---")
print(df.describe().round(3))
# 3.2 Distribution Analysis
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
plot_cols = ['Happiness Score', 'GDP per capita', 'Social support',
'Healthy life expectancy', 'Freedom to make life choices', 'Generosity']
for i, col in enumerate(plot_cols):
ax = axes[i // 3, i % 3]
df[col].hist(bins=30, ax=ax, color='steelblue', edgecolor='black', alpha=0.7)
ax.axvline(df[col].mean(), color='red', linestyle='--', label=f'Mean: {df[col].mean():.2f}')
ax.set_title(col, fontsize=11, fontweight='bold')
ax.legend(fontsize=8)
plt.suptitle('Distribution of Happiness Factors', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('03_distributions.png', dpi=150, bbox_inches='tight')
plt.show()
# 3.3 Correlation Analysis
print("\n--- Correlation Matrix ---")
corr_cols = ['Happiness Score', 'GDP per capita', 'Social support',
'Healthy life expectancy', 'Freedom to make life choices', 'Generosity']
corr_matrix = df[corr_cols].corr()
print(corr_matrix.round(3))
# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0,
fmt='.2f', square=True, linewidths=0.5)
plt.title('Correlation Between Happiness Factors', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('03_correlation_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
# 3.4 Regional Analysis
print("\n--- Happiness by Region ---")
regional = df.groupby('Region')['Happiness Score'].agg(['mean', 'std', 'count'])
regional = regional.sort_values('mean', ascending=False)
print(regional.round(3))
plt.figure(figsize=(12, 6))
df.boxplot(column='Happiness Score', by='Region', figsize=(12, 6))
plt.title('Happiness Score by Region', fontsize=14, fontweight='bold')
plt.suptitle('')
plt.xticks(rotation=45, ha='right')
plt.ylabel('Happiness Score')
plt.tight_layout()
plt.savefig('03_regional_boxplot.png', dpi=150, bbox_inches='tight')
plt.show()
# 3.5 Relationship Analysis
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
scatter_vars = [('GDP per capita', 'Happiness Score'),
('Healthy life expectancy', 'Happiness Score'),
('Social support', 'Happiness Score')]
for i, (x, y) in enumerate(scatter_vars):
axes[i].scatter(df[x], df[y], alpha=0.5, color='steelblue')
z = np.polyfit(df[x].dropna(), df[y].dropna(), 1)
p = np.poly1d(z)
axes[i].plot(df[x].sort_values(), p(df[x].sort_values()), 'r--', linewidth=2)
r, _ = stats.pearsonr(df[x].dropna(), df[y].dropna())
axes[i].set_xlabel(x, fontsize=10)
axes[i].set_ylabel(y, fontsize=10)
axes[i].set_title(f'{x} vs {y}\nr = {r:.3f}', fontsize=10, fontweight='bold')
plt.tight_layout()
plt.savefig('03_scatter_relationships.png', dpi=150, bbox_inches='tight')
plt.show()
The correlation heatmap reveals multicollinearity risk: If two predictor variables (e.g., GDP per capita and Healthy life expectancy) have |r| > 0.7, including both in a regression model can inflate coefficient standard errors and make individual coefficients unstable. This does not invalidate the model but makes interpretation of individual coefficients unreliable. Consider using only one of the correlated predictors or using dimensionality reduction.
Phase 4: Statistical Testing
print("\n" + "=" * 60)
print("PHASE 4: STATISTICAL TESTING")
print("=" * 60)
# 4.1 Test: Is there a significant difference in happiness between regions?
print("\n--- Test 1: Regional Differences (ANOVA) ---")
# Select top regions
top_regions = ['Western Europe', 'North America', 'Asia Pacific']
df_top = df[df['Region'].isin(top_regions)]
groups = [group['Happiness Score'].values for name, group in df_top.groupby('Region')]
f_stat, p_value = stats.f_oneway(*groups)
print(f"F-statistic: {f_stat:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"Conclusion: {'Significant' if p_value < 0.05 else 'No significant'} difference between regions")
# 4.2 Test: Is GDP correlated with Happiness?
print("\n--- Test 2: GDP-Happiness Correlation (Pearson) ---")
r, p_value = stats.pearsonr(df['GDP per capita'].dropna(), df['Happiness Score'].dropna())
print(f"Pearson r: {r:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"Conclusion: {'Strong' if abs(r) > 0.5 else 'Moderate' if abs(r) > 0.3 else 'Weak'} correlation")
# 4.3 Test: Are high-GDP and low-GDP countries different?
print("\n--- Test 3: High vs Low GDP Countries (t-test) ---")
high_gdp = df[df['GDP per capita'] > df['GDP per capita'].median()]['Happiness Score']
low_gdp = df[df['GDP per capita'] <= df['GDP per capita'].median()]['Happiness Score']
t_stat, p_value = stats.ttest_ind(high_gdp, low_gdp)
cohens_d = (high_gdp.mean() - low_gdp.mean()) / np.sqrt((high_gdp.std()**2 + low_gdp.std()**2) / 2)
print(f"High GDP mean: {high_gdp.mean():.3f}")
print(f"Low GDP mean: {low_gdp.mean():.3f}")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"Cohen's d: {cohens_d:.4f}")
print(f"Conclusion: {'Significant' if p_value < 0.05 else 'No significant'} difference")
Why use median split for the t-test? Splitting GDP at the median creates two equal-sized groups, which maximizes statistical power for the t-test. However, median splits discard information β a continuous regression (Phase 5) is more informative. Median splits are useful for generating interpretable group comparisons, but do not substitute for regression analysis.
Phase 5: Advanced Analysis
DfMultiple Linear Regression
A statistical model that estimates the relationship between a continuous outcome variable and multiple predictor variables simultaneously. The model assumes: , where . Standardized coefficients (Ξ²*) allow direct comparison of predictor importance across variables with different scales.
print("\n" + "=" * 60)
print("PHASE 5: ADVANCED ANALYSIS")
print("=" * 60)
# 5.1 Multiple Regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
feature_cols = ['GDP per capita', 'Social support', 'Healthy life expectancy',
'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
X = df[feature_cols].dropna()
y = df.loc[X.index, 'Happiness Score']
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit model
model = LinearRegression()
model.fit(X_scaled, y)
print("\n--- Multiple Regression Results ---")
print(f"R-squared: {model.score(X_scaled, y):.4f}")
# Cross-validation
cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='r2')
print(f"CV R-squared: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# Feature importance
print("\nFeature Coefficients:")
importance = pd.DataFrame({
'Feature': feature_cols,
'Coefficient': model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print(importance.to_string(index=False))
# 5.2 Clustering
from sklearn.cluster import KMeans
X_cluster = df[['Happiness Score', 'GDP per capita']].dropna()
scaler_c = StandardScaler()
X_cluster_scaled = scaler_c.fit_transform(X_cluster)
# Find optimal k
inertias = []
for k in range(2, 7):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(X_cluster_scaled)
inertias.append(km.inertia_)
# Use k=3
km = KMeans(n_clusters=3, random_state=42, n_init=10)
df.loc[X_cluster.index, 'Happiness Cluster'] = km.fit_predict(X_cluster_scaled)
print("\n--- Clustering Results ---")
print(df.groupby('Happiness Cluster')[['Happiness Score', 'GDP per capita']].mean().round(3))
DfCoefficient of Determination (RΒ²)
The proportion of variance in the response variable explained by the regression model. RΒ² = 1 β (SS_res / SS_tot), where SS_res is the residual sum of squares and SS_tot is the total sum of squares. RΒ² ranges from 0 (no explanation) to 1 (perfect explanation). Adjusted RΒ² penalizes for the number of predictors, preventing overfitting when adding irrelevant variables.
Why standardize before regression? Standardizing predictors (z-score) converts coefficients to "effect sizes" β each Ξ² represents the change in Y (in standard deviation units) per one standard deviation change in X. This allows direct comparison: if GDP has Ξ² = 0.8 and social support has Ξ² = 0.5, GDP contributes ~60% more to prediction, regardless of their original units.
Phase 6: Visualization Dashboard
print("\n" + "=" * 60)
print("PHASE 6: VISUALIZATION DASHBOARD")
print("=" * 60)
fig = plt.figure(figsize=(20, 16))
# Plot 1: Distribution of Happiness
ax1 = fig.add_subplot(2, 3, 1)
df['Happiness Score'].hist(bins=30, ax=ax1, color='steelblue', edgecolor='black', alpha=0.7)
ax1.axvline(df['Happiness Score'].mean(), color='red', linestyle='--', linewidth=2)
ax1.set_title('Distribution of Happiness Scores', fontweight='bold')
ax1.set_xlabel('Happiness Score')
# Plot 2: GDP vs Happiness
ax2 = fig.add_subplot(2, 3, 2)
scatter = ax2.scatter(df['GDP per capita'], df['Happiness Score'],
c=df['Healthy life expectancy'], cmap='viridis',
alpha=0.6, edgecolors='w', linewidth=0.5)
plt.colorbar(scatter, ax=ax2, label='Life Expectancy')
ax2.set_xlabel('GDP per Capita')
ax2.set_ylabel('Happiness Score')
ax2.set_title('GDP vs Happiness (colored by Life Expectancy)', fontweight='bold')
# Plot 3: Regional Comparison
ax3 = fig.add_subplot(2, 3, 3)
region_means = df.groupby('Region')['Happiness Score'].mean().sort_values()
region_means.plot(kind='barh', ax=ax3, color='steelblue', edgecolor='black')
ax3.set_xlabel('Mean Happiness Score')
ax3.set_title('Average Happiness by Region', fontweight='bold')
# Plot 4: Correlation Heatmap
ax4 = fig.add_subplot(2, 3, 4)
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0,
fmt='.2f', ax=ax4, cbar_kws={'shrink': 0.8})
ax4.set_title('Feature Correlations', fontweight='bold')
# Plot 5: Feature Importance
ax5 = fig.add_subplot(2, 3, 5)
importance_sorted = importance.sort_values('Coefficient')
colors = ['red' if c < 0 else 'steelblue' for c in importance_sorted['Coefficient']]
importance_sorted.plot(kind='barh', x='Feature', y='Coefficient', ax=ax5,
color=colors, legend=False, edgecolor='black')
ax5.set_title('Regression Feature Importance', fontweight='bold')
ax5.set_xlabel('Standardized Coefficient')
# Plot 6: Happiness Categories
ax6 = fig.add_subplot(2, 3, 6)
cat_counts = df['Happiness Category'].value_counts()
cat_colors = ['#e74c3c', '#f39c12', '#2ecc71', '#3498db']
cat_counts.plot(kind='pie', ax=ax6, autopct='%1.1f%%', colors=cat_colors,
startangle=90, textprops={'fontsize': 9})
ax6.set_title('Happiness Categories', fontweight='bold')
ax6.set_ylabel('')
plt.suptitle('World Happiness Report β Analysis Dashboard',
fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('06_dashboard.png', dpi=150, bbox_inches='tight')
plt.show()
Phase 7: Insights & Reporting
print("\n" + "=" * 60)
print("PHASE 7: KEY INSIGHTS & FINDINGS")
print("=" * 60)
print("""
KEY FINDINGS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. GDP STRONGLY CORRELATES WITH HAPPINESS
β’ Pearson correlation: r = {:.3f}
β’ Countries with higher GDP per capita consistently report
higher happiness scores
β’ BUT: Money isn't everything β social support and freedom
matter too
2. REGIONAL DISPARITIES ARE SIGNIFICANT
β’ ANOVA test confirms significant differences between regions
β’ Western Europe and North America lead in happiness
β’ Sub-Saharan Africa scores lowest on average
3. SOCIAL SUPPORT IS A KEY DRIVER
β’ Second strongest predictor of happiness after GDP
β’ Having someone to count on matters across all cultures
4. HEALTH MATTERS
β’ Life expectancy correlates strongly with happiness (r = {:.3f})
β’ Healthier populations are happier populations
5. CORRUPTION UNDERMINES HAPPINESS
β’ Negative correlation between corruption perception and happiness
β’ Trust in institutions matters for well-being
LIMITATIONS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β’ Self-reported happiness scores may be culturally biased
β’ Correlation does not imply causation
β’ Missing data for some countries
β’ Cross-sectional analysis (not longitudinal)
β’ Other factors not measured: climate, culture, inequality
RECOMMENDATIONS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β’ Policy should focus on economic growth AND social support
β’ Anti-corruption measures likely improve happiness
β’ Healthcare investment has measurable happiness returns
β’ Freedom of choice matters independently of GDP
""".format(
abs(corr_matrix.loc['Happiness Score', 'GDP per capita']),
abs(corr_matrix.loc['Happiness Score', 'Healthy life expectancy'])
))
Correlation β Causation β but it constrains it. The strong GDPβhappiness correlation (r β 0.7+) does not prove that increasing GDP causes happiness. Possible explanations include: (1) GDP causes happiness (prosperity enables well-being), (2) happiness causes GDP (happy workers are more productive), (3) a third variable (e.g., institutional quality) causes both. Controlled experiments or natural experiments are needed to establish causation β cross-sectional correlation alone cannot.
Project Structure Template
project/
βββ data/
β βββ raw/ # Original data (never modify)
β βββ processed/ # Cleaned data
β βββ external/ # External data sources
βββ notebooks/
β βββ 01_EDA.ipynb
β βββ 02_Cleaning.ipynb
β βββ 03_Analysis.ipynb
β βββ 04_Visualization.ipynb
βββ src/
β βββ data/
β β βββ make_dataset.py
β βββ features/
β β βββ build_features.py
β βββ models/
β β βββ train_model.py
β βββ visualization/
β βββ visualize.py
βββ reports/
β βββ figures/
β βββ final_report.pdf
βββ requirements.txt
βββ README.md
Key Takeaways
πSummary: Capstone Foundation Project
- Follow a systematic pipeline: Load β Clean β Explore β Test β Analyze β Visualize. Each phase produces artifacts that feed into the next. Skipping phases (e.g., cleaning before loading) leads to errors downstream.
- Always document your decisions and assumptions. Why did you impute with median instead of mean? Why did you drop those rows? Future you (and collaborators) needs to understand these choices.
- Check missing values, outliers, and data types before analysis. The IQR method (Q1 β 1.5Β·IQR, Q3 + 1.5Β·IQR) is a robust outlier detection standard. Data type optimization can reduce memory by 50-80%.
- Use statistical tests to validate visual observations. ANOVA confirms regional differences; Pearson correlation quantifies linear association; t-tests compare group means. Always report effect sizes alongside p-values.
- Multiple regression reveals relative importance of factors. Standardized coefficients (Ξ²*) enable direct comparison across predictors with different scales. Cross-validation prevents overfitting by testing on held-out data.
- Good visualizations tell a story β not just pretty pictures. Each panel should answer a specific question: What is the distribution? What correlates with what? Which regions differ?
- Always state limitations and caveats in your findings. Self-reported data has cultural biases; correlation does not imply causation; cross-sectional data cannot establish temporal ordering. Intellectual honesty strengthens credibility.
How to Extend This Project
- Real Data: Download the actual World Happiness Report CSV from Kaggle
- Time Series: Analyze how happiness trends over years
- Predictive Model: Build a model to predict happiness from economic indicators
- Dashboard: Create an interactive Plotly dashboard
- Report: Write a 5-page PDF report with findings and policy recommendations
- Portfolio: Deploy as a blog post or GitHub Pages site