Capstone Project: End-to-End Data Analysis

Why This Matters

This capstone project brings together every skill from the Foundations module. You'll work through a complete data analysis pipeline — from raw data to actionable insights — exactly as you would in a real data science role.

Dataset: World Happiness Report (2015-2019) — real data from the United Nations.

DfReproducible Analysis

An analysis where another researcher can obtain the same results from the same data by following the same computational steps. Reproducibility requires documented code, version-controlled data, and deterministic random seeds. It is the minimum standard of scientific rigor — even exploratory analyses should be reproducible.

Why a capstone? Individual skills (pandas, visualization, statistics) are necessary but not sufficient. Real data science requires integrating these skills into a coherent workflow where each step informs the next. This project teaches you to think in pipelines, not isolated operations. The World Happiness Report is ideal because it contains real statistical relationships (GDP–happiness correlation) that produce interpretable results.

Project Overview

Architecture Diagram

Phase 1: Data Collection & Loading
    ↓
Phase 2: Data Cleaning & Validation
    ↓
Phase 3: Exploratory Data Analysis (EDA)
    ↓
Phase 4: Statistical Testing
    ↓
Phase 5: Advanced Analysis
    ↓
Phase 6: Visualization Dashboard
    ↓
Phase 7: Insights & Reporting

DfExploratory Data Analysis (EDA)

The process of systematically examining a dataset to understand its structure, detect anomalies, test assumptions, and identify relationships before formal statistical modeling. EDA combines summary statistics, visualizations, and cross-tabulations to generate hypotheses rather than test them. John Tukey (1977) argued that EDA should precede confirmatory analysis to prevent misspecification of models.

Phase 1: Data Collection & Loading

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# ============================================
# 1. LOAD DATA
# ============================================

# Load World Happiness Report data
# (Using built-in seaborn dataset for reproducibility)
df = pd.read_csv('world-happiness-report.csv')

# Alternative: Generate comparable synthetic data
np.random.seed(42)
n_countries = 150

df = pd.DataFrame({
    'Country': [f'Country_{i}' for i in range(n_countries)],
    'Region': np.random.choice(['Western Europe', 'North America', 'Asia Pacific',
                                 'Latin America', 'Sub-Saharan Africa',
                                 'Central and Eastern Europe', 'Middle East'], n_countries),
    'Happiness Score': np.random.normal(5.5, 1.8, n_countries).clip(2, 8),
    'GDP per capita': np.random.lognormal(9.5, 1.0, n_countries) / 1000,
    'Social support': np.random.normal(0.8, 0.15, n_countries).clip(0, 1),
    'Healthy life expectancy': np.random.normal(65, 10, n_countries).clip(40, 85),
    'Freedom to make life choices': np.random.normal(0.45, 0.12, n_countries).clip(0, 1),
    'Generosity': np.random.normal(0.2, 0.15, n_countries).clip(-0.2, 0.6),
    'Perceptions of corruption': np.random.normal(0.15, 0.12, n_countries).clip(0, 0.5),
    'Year': np.random.choice([2015, 2016, 2017, 2018, 2019], n_countries)
})

print("=" * 60)
print("PHASE 1: DATA COLLECTION & LOADING")
print("=" * 60)
print(f"\nDataset shape: {df.shape}")
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

print(f"\nColumn types:")
print(df.dtypes)

print(f"\nFirst 5 rows:")
print(df.head())

Why set a random seed? np.random.seed(42) ensures that every time you run the synthetic data generation code, you get the same "random" numbers. This makes your analysis reproducible — someone else running your code will get identical results. The seed value (42, the Answer to the Ultimate Question of Life, the Universe, and Everything) is a common convention but any fixed integer works.

Phase 2: Data Cleaning & Validation

DfData Validation

The systematic process of checking data for correctness, completeness, and consistency. Validation includes: (1) checking for missing values and deciding on imputation strategy, (2) detecting and handling outliers, (3) verifying data types and ranges, (4) identifying duplicate records. Poor validation leads to misleading results and invalid conclusions.

print("\n" + "=" * 60)
print("PHASE 2: DATA CLEANING & VALIDATION")
print("=" * 60)

# 2.1 Missing Values Analysis
print("\n--- Missing Values ---")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Count': missing, 'Percent': missing_pct})
print(missing_df[missing_df['Count'] > 0])

# Strategy: If missing < 5%, drop; if 5-25%, impute median; if > 25%, investigate
for col in df.select_dtypes(include=[np.number]).columns:
    if df[col].isnull().sum() > 0:
        pct = df[col].isnull().sum() / len(df) * 100
        if pct < 5:
            df = df.dropna(subset=[col])
            print(f"  Dropped rows for {col} ({pct:.1f}% missing)")
        elif pct < 25:
            median_val = df[col].median()
            df[col] = df[col].fillna(median_val)
            print(f"  Imputed {col} with median ({pct:.1f}% missing)")
        else:
            print(f"  WARNING: {col} has {pct:.1f}% missing — needs investigation")

# 2.2 Duplicate Check
duplicates = df.duplicated().sum()
print(f"\nDuplicate rows: {duplicates}")
if duplicates > 0:
    df = df.drop_duplicates()
    print(f"Removed {duplicates} duplicates")

# 2.3 Outlier Detection (IQR method)
print("\n--- Outlier Detection ---")
numeric_cols = df.select_dtypes(include=[np.number]).columns

for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = ((df[col] < lower) | (df[col] > upper)).sum()
    if outliers > 0:
        print(f"  {col}: {outliers} outliers ({outliers/len(df)*100:.1f}%)")

# 2.4 Data Type Optimization
print("\n--- Data Type Optimization ---")
for col in df.select_dtypes(include=['int64']).columns:
    if df[col].min() >= 0:
        if df[col].max() < 255:
            df[col] = df[col].astype('uint8')
            print(f"  {col}: int64 -> uint8")
        elif df[col].max() < 65535:
            df[col] = df[col].astype('uint16')
            print(f"  {col}: int64 -> uint16")

# 2.5 Feature Engineering
print("\n--- Feature Engineering ---")
df['Happiness Category'] = pd.cut(df['Happiness Score'],
                                   bins=[0, 4, 5.5, 7, 10],
                                   labels=['Low', 'Medium', 'High', 'Very High'])
print(f"  Created Happiness Category: {df['Happiness Category'].value_counts().to_dict()}")

df['GDP Category'] = pd.cut(df['GDP per capita'],
                             bins=[0, 10, 30, 60, 200],
                             labels=['Low', 'Medium', 'High', 'Very High'])
print(f"  Created GDP Category")

print(f"\nFinal dataset shape: {df.shape}")

DfInterquartile Range (IQR) Outlier Detection

A robust outlier detection method based on quartiles. The IQR = Q3 − Q1 (75th − 25th percentile). Values below Q1 − 1.5·IQR or above Q3 + 1.5·IQR are flagged as outliers. The 1.5 multiplier is Tukey's convention — it corresponds to approximately ±2.7σ for normally distributed data, capturing ~99.3% of data points within the fence.

Phase 3: Exploratory Data Analysis

print("\n" + "=" * 60)
print("PHASE 3: EXPLORATORY DATA ANALYSIS")
print("=" * 60)

# 3.1 Summary Statistics
print("\n--- Summary Statistics ---")
print(df.describe().round(3))

# 3.2 Distribution Analysis
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
plot_cols = ['Happiness Score', 'GDP per capita', 'Social support',
             'Healthy life expectancy', 'Freedom to make life choices', 'Generosity']

for i, col in enumerate(plot_cols):
    ax = axes[i // 3, i % 3]
    df[col].hist(bins=30, ax=ax, color='steelblue', edgecolor='black', alpha=0.7)
    ax.axvline(df[col].mean(), color='red', linestyle='--', label=f'Mean: {df[col].mean():.2f}')
    ax.set_title(col, fontsize=11, fontweight='bold')
    ax.legend(fontsize=8)

plt.suptitle('Distribution of Happiness Factors', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('03_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

# 3.3 Correlation Analysis
print("\n--- Correlation Matrix ---")
corr_cols = ['Happiness Score', 'GDP per capita', 'Social support',
             'Healthy life expectancy', 'Freedom to make life choices', 'Generosity']
corr_matrix = df[corr_cols].corr()
print(corr_matrix.round(3))

# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0,
            fmt='.2f', square=True, linewidths=0.5)
plt.title('Correlation Between Happiness Factors', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('03_correlation_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()

# 3.4 Regional Analysis
print("\n--- Happiness by Region ---")
regional = df.groupby('Region')['Happiness Score'].agg(['mean', 'std', 'count'])
regional = regional.sort_values('mean', ascending=False)
print(regional.round(3))

plt.figure(figsize=(12, 6))
df.boxplot(column='Happiness Score', by='Region', figsize=(12, 6))
plt.title('Happiness Score by Region', fontsize=14, fontweight='bold')
plt.suptitle('')
plt.xticks(rotation=45, ha='right')
plt.ylabel('Happiness Score')
plt.tight_layout()
plt.savefig('03_regional_boxplot.png', dpi=150, bbox_inches='tight')
plt.show()

# 3.5 Relationship Analysis
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

scatter_vars = [('GDP per capita', 'Happiness Score'),
                ('Healthy life expectancy', 'Happiness Score'),
                ('Social support', 'Happiness Score')]

for i, (x, y) in enumerate(scatter_vars):
    axes[i].scatter(df[x], df[y], alpha=0.5, color='steelblue')
    z = np.polyfit(df[x].dropna(), df[y].dropna(), 1)
    p = np.poly1d(z)
    axes[i].plot(df[x].sort_values(), p(df[x].sort_values()), 'r--', linewidth=2)
    r, _ = stats.pearsonr(df[x].dropna(), df[y].dropna())
    axes[i].set_xlabel(x, fontsize=10)
    axes[i].set_ylabel(y, fontsize=10)
    axes[i].set_title(f'{x} vs {y}\nr = {r:.3f}', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.savefig('03_scatter_relationships.png', dpi=150, bbox_inches='tight')
plt.show()

The correlation heatmap reveals multicollinearity risk: If two predictor variables (e.g., GDP per capita and Healthy life expectancy) have |r| > 0.7, including both in a regression model can inflate coefficient standard errors and make individual coefficients unstable. This does not invalidate the model but makes interpretation of individual coefficients unreliable. Consider using only one of the correlated predictors or using dimensionality reduction.

Phase 4: Statistical Testing

print("\n" + "=" * 60)
print("PHASE 4: STATISTICAL TESTING")
print("=" * 60)

# 4.1 Test: Is there a significant difference in happiness between regions?
print("\n--- Test 1: Regional Differences (ANOVA) ---")

# Select top regions
top_regions = ['Western Europe', 'North America', 'Asia Pacific']
df_top = df[df['Region'].isin(top_regions)]

groups = [group['Happiness Score'].values for name, group in df_top.groupby('Region')]
f_stat, p_value = stats.f_oneway(*groups)

print(f"F-statistic: {f_stat:.4f}")
print(f"p-value:     {p_value:.6f}")
print(f"Conclusion: {'Significant' if p_value < 0.05 else 'No significant'} difference between regions")

# 4.2 Test: Is GDP correlated with Happiness?
print("\n--- Test 2: GDP-Happiness Correlation (Pearson) ---")
r, p_value = stats.pearsonr(df['GDP per capita'].dropna(), df['Happiness Score'].dropna())
print(f"Pearson r:   {r:.4f}")
print(f"p-value:     {p_value:.6f}")
print(f"Conclusion: {'Strong' if abs(r) > 0.5 else 'Moderate' if abs(r) > 0.3 else 'Weak'} correlation")

# 4.3 Test: Are high-GDP and low-GDP countries different?
print("\n--- Test 3: High vs Low GDP Countries (t-test) ---")
high_gdp = df[df['GDP per capita'] > df['GDP per capita'].median()]['Happiness Score']
low_gdp = df[df['GDP per capita'] <= df['GDP per capita'].median()]['Happiness Score']

t_stat, p_value = stats.ttest_ind(high_gdp, low_gdp)
cohens_d = (high_gdp.mean() - low_gdp.mean()) / np.sqrt((high_gdp.std()**2 + low_gdp.std()**2) / 2)

print(f"High GDP mean: {high_gdp.mean():.3f}")
print(f"Low GDP mean:  {low_gdp.mean():.3f}")
print(f"t-statistic:   {t_stat:.4f}")
print(f"p-value:       {p_value:.6f}")
print(f"Cohen's d:     {cohens_d:.4f}")
print(f"Conclusion: {'Significant' if p_value < 0.05 else 'No significant'} difference")

Why use median split for the t-test? Splitting GDP at the median creates two equal-sized groups, which maximizes statistical power for the t-test. However, median splits discard information — a continuous regression (Phase 5) is more informative. Median splits are useful for generating interpretable group comparisons, but do not substitute for regression analysis.

Phase 5: Advanced Analysis

DfMultiple Linear Regression

A statistical model that estimates the relationship between a continuous outcome variable and multiple predictor variables simultaneously. The model assumes: $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + \epsilon$ , where $\epsilon \sim N(0, \sigma^2)$ . Standardized coefficients (β*) allow direct comparison of predictor importance across variables with different scales.

Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + \epsilon

print("\n" + "=" * 60)
print("PHASE 5: ADVANCED ANALYSIS")
print("=" * 60)

# 5.1 Multiple Regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

feature_cols = ['GDP per capita', 'Social support', 'Healthy life expectancy',
                'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']

X = df[feature_cols].dropna()
y = df.loc[X.index, 'Happiness Score']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit model
model = LinearRegression()
model.fit(X_scaled, y)

print("\n--- Multiple Regression Results ---")
print(f"R-squared: {model.score(X_scaled, y):.4f}")

# Cross-validation
cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='r2')
print(f"CV R-squared: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Feature importance
print("\nFeature Coefficients:")
importance = pd.DataFrame({
    'Feature': feature_cols,
    'Coefficient': model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print(importance.to_string(index=False))

# 5.2 Clustering
from sklearn.cluster import KMeans

X_cluster = df[['Happiness Score', 'GDP per capita']].dropna()
scaler_c = StandardScaler()
X_cluster_scaled = scaler_c.fit_transform(X_cluster)

# Find optimal k
inertias = []
for k in range(2, 7):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_cluster_scaled)
    inertias.append(km.inertia_)

# Use k=3
km = KMeans(n_clusters=3, random_state=42, n_init=10)
df.loc[X_cluster.index, 'Happiness Cluster'] = km.fit_predict(X_cluster_scaled)

print("\n--- Clustering Results ---")
print(df.groupby('Happiness Cluster')[['Happiness Score', 'GDP per capita']].mean().round(3))

DfCoefficient of Determination (R²)

The proportion of variance in the response variable explained by the regression model. R² = 1 − (SS_res / SS_tot), where SS_res is the residual sum of squares and SS_tot is the total sum of squares. R² ranges from 0 (no explanation) to 1 (perfect explanation). Adjusted R² penalizes for the number of predictors, preventing overfitting when adding irrelevant variables.

Why standardize before regression? Standardizing predictors (z-score) converts coefficients to "effect sizes" — each β represents the change in Y (in standard deviation units) per one standard deviation change in X. This allows direct comparison: if GDP has β = 0.8 and social support has β = 0.5, GDP contributes ~60% more to prediction, regardless of their original units.

Phase 6: Visualization Dashboard

print("\n" + "=" * 60)
print("PHASE 6: VISUALIZATION DASHBOARD")
print("=" * 60)

fig = plt.figure(figsize=(20, 16))

# Plot 1: Distribution of Happiness
ax1 = fig.add_subplot(2, 3, 1)
df['Happiness Score'].hist(bins=30, ax=ax1, color='steelblue', edgecolor='black', alpha=0.7)
ax1.axvline(df['Happiness Score'].mean(), color='red', linestyle='--', linewidth=2)
ax1.set_title('Distribution of Happiness Scores', fontweight='bold')
ax1.set_xlabel('Happiness Score')

# Plot 2: GDP vs Happiness
ax2 = fig.add_subplot(2, 3, 2)
scatter = ax2.scatter(df['GDP per capita'], df['Happiness Score'],
                      c=df['Healthy life expectancy'], cmap='viridis',
                      alpha=0.6, edgecolors='w', linewidth=0.5)
plt.colorbar(scatter, ax=ax2, label='Life Expectancy')
ax2.set_xlabel('GDP per Capita')
ax2.set_ylabel('Happiness Score')
ax2.set_title('GDP vs Happiness (colored by Life Expectancy)', fontweight='bold')

# Plot 3: Regional Comparison
ax3 = fig.add_subplot(2, 3, 3)
region_means = df.groupby('Region')['Happiness Score'].mean().sort_values()
region_means.plot(kind='barh', ax=ax3, color='steelblue', edgecolor='black')
ax3.set_xlabel('Mean Happiness Score')
ax3.set_title('Average Happiness by Region', fontweight='bold')

# Plot 4: Correlation Heatmap
ax4 = fig.add_subplot(2, 3, 4)
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0,
            fmt='.2f', ax=ax4, cbar_kws={'shrink': 0.8})
ax4.set_title('Feature Correlations', fontweight='bold')

# Plot 5: Feature Importance
ax5 = fig.add_subplot(2, 3, 5)
importance_sorted = importance.sort_values('Coefficient')
colors = ['red' if c < 0 else 'steelblue' for c in importance_sorted['Coefficient']]
importance_sorted.plot(kind='barh', x='Feature', y='Coefficient', ax=ax5,
                        color=colors, legend=False, edgecolor='black')
ax5.set_title('Regression Feature Importance', fontweight='bold')
ax5.set_xlabel('Standardized Coefficient')

# Plot 6: Happiness Categories
ax6 = fig.add_subplot(2, 3, 6)
cat_counts = df['Happiness Category'].value_counts()
cat_colors = ['#e74c3c', '#f39c12', '#2ecc71', '#3498db']
cat_counts.plot(kind='pie', ax=ax6, autopct='%1.1f%%', colors=cat_colors,
                startangle=90, textprops={'fontsize': 9})
ax6.set_title('Happiness Categories', fontweight='bold')
ax6.set_ylabel('')

plt.suptitle('World Happiness Report — Analysis Dashboard',
             fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('06_dashboard.png', dpi=150, bbox_inches='tight')
plt.show()

Phase 7: Insights & Reporting

print("\n" + "=" * 60)
print("PHASE 7: KEY INSIGHTS & FINDINGS")
print("=" * 60)

print("""
KEY FINDINGS:
═══════════════════════════════════════════════════════════════

1. GDP STRONGLY CORRELATES WITH HAPPINESS
   • Pearson correlation: r = {:.3f}
   • Countries with higher GDP per capita consistently report
     higher happiness scores
   • BUT: Money isn't everything — social support and freedom
     matter too

2. REGIONAL DISPARITIES ARE SIGNIFICANT
   • ANOVA test confirms significant differences between regions
   • Western Europe and North America lead in happiness
   • Sub-Saharan Africa scores lowest on average

3. SOCIAL SUPPORT IS A KEY DRIVER
   • Second strongest predictor of happiness after GDP
   • Having someone to count on matters across all cultures

4. HEALTH MATTERS
   • Life expectancy correlates strongly with happiness (r = {:.3f})
   • Healthier populations are happier populations

5. CORRUPTION UNDERMINES HAPPINESS
   • Negative correlation between corruption perception and happiness
   • Trust in institutions matters for well-being

LIMITATIONS:
═══════════════════════════════════════════════════════════════
• Self-reported happiness scores may be culturally biased
• Correlation does not imply causation
• Missing data for some countries
• Cross-sectional analysis (not longitudinal)
• Other factors not measured: climate, culture, inequality

RECOMMENDATIONS:
═══════════════════════════════════════════════════════════════
• Policy should focus on economic growth AND social support
• Anti-corruption measures likely improve happiness
• Healthcare investment has measurable happiness returns
• Freedom of choice matters independently of GDP
""".format(
    abs(corr_matrix.loc['Happiness Score', 'GDP per capita']),
    abs(corr_matrix.loc['Happiness Score', 'Healthy life expectancy'])
))

Correlation ≠ Causation — but it constrains it. The strong GDP–happiness correlation (r ≈ 0.7+) does not prove that increasing GDP causes happiness. Possible explanations include: (1) GDP causes happiness (prosperity enables well-being), (2) happiness causes GDP (happy workers are more productive), (3) a third variable (e.g., institutional quality) causes both. Controlled experiments or natural experiments are needed to establish causation — cross-sectional correlation alone cannot.

Project Structure Template