Project 1: Full EDA on Real Dataset

Module 1: FoundationsFree Lesson

Advertisement

Project Overview

In this project, you'll perform a complete Exploratory Data Analysis (EDA) on a real-world dataset. You'll apply all the skills learned in Module 1 to extract insights and tell a data story.

Objectives

  1. Load and inspect real-world data
  2. Clean and preprocess data
  3. Perform univariate, bivariate, and multivariate analysis
  4. Create compelling visualizations
  5. Document findings and recommendations

โ„น๏ธ EDA as a Systematic Process

A disciplined EDA follows a repeatable workflow:

  1. Structure: Dimensions, types, missing values
  2. Description: Summary statistics, distributions
  3. Quality: Outliers, inconsistencies, duplicates
  4. Relationships: Correlations, associations, dependencies
  5. Narrative: Key insights, hypotheses, recommendations Never skip to modeling without completing EDA first.

Dataset: Titanic Survival Prediction

We'll use the famous Titanic dataset โ€” a classic for learning EDA.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set style
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

# Load data
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

print("Dataset Shape:", df.shape)
print("\nColumn Names:", df.columns.tolist())

Step 1: Initial Data Inspection

# Basic information
print("=" * 60)
print("INITIAL DATA INSPECTION")
print("=" * 60)

print("\n1. First 5 rows:")
print(df.head())

print("\n2. Data Types:")
print(df.dtypes)

print("\n3. Data Info:")
df.info()

print("\n4. Statistical Summary (Numerical):")
print(df.describe())

print("\n5. Statistical Summary (Categorical):")
print(df.describe(include=['object']))

print("\n6. Missing Values:")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Count': missing, 'Percentage': missing_pct})
print(missing_df[missing_df['Count'] > 0])

print("\n7. Unique Values:")
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")

Step 2: Data Cleaning

# Create a copy for cleaning
df_clean = df.copy()

# 1. Handle missing values
print("Missing Values Before:")
print(df_clean.isnull().sum()[df_clean.isnull().sum() > 0])

# Age: Fill with median (robust to outliers)
df_clean['Age'] = df_clean['Age'].fillna(df_clean['Age'].median())

# Embarked: Fill with mode
df_clean['Embarked'] = df_clean['Embarked'].fillna(df_clean['Embarked'].mode()[0])

# Cabin: Too many missing, create indicator
df_clean['HasCabin'] = df_clean['Cabin'].notna().astype(int)
df_clean = df_clean.drop('Cabin', axis=1)

print("\nMissing Values After:")
print(df_clean.isnull().sum())

# 2. Feature Engineering
# Extract title from Name
df_clean['Title'] = df_clean['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
df_clean['Title'] = df_clean['Title'].replace(['Lady', 'Countess', 'Capt', 'Col', 
                                                'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'],
                                               'Rare')
df_clean['Title'] = df_clean['Title'].replace('Mlle', 'Miss')
df_clean['Title'] = df_clean['Title'].replace('Ms', 'Miss')
df_clean['Title'] = df_clean['Title'].replace('Mme', 'Mrs')

# Family size
df_clean['FamilySize'] = df_clean['SibSp'] + df_clean['Parch'] + 1

# Is alone
df_clean['IsAlone'] = (df_clean['FamilySize'] == 1).astype(int)

# Age bins
df_clean['AgeBin'] = pd.cut(df_clean['Age'], bins=[0, 12, 18, 35, 60, 100],
                            labels=['Child', 'Teen', 'Adult', 'Middle', 'Senior'])

# Fare bins
df_clean['FareBin'] = pd.qcut(df_clean['Fare'], 4, 
                              labels=['Low', 'Medium', 'High', 'Very High'])

print("\nNew Features Created:")
print(df_clean[['Title', 'FamilySize', 'IsAlone', 'AgeBin', 'FareBin']].head())

๐Ÿ’ก Imputation Strategy Selection

  • Median: Best for skewed numerical data (robust to outliers). Use for Age, Income.
  • Mode: Best for categorical data with a clear majority class. Use for Embarked, Gender.
  • Indicator variable: When missingness itself is informative (e.g., Cabin missing โ†’ likely lower class). Create HasCabin before dropping.
  • Never impute the target variable โ€” it introduces circular reasoning into model evaluation.

Step 3: Univariate Analysis

# Target variable distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Survival count
survival_counts = df_clean['Survived'].value_counts()
axes[0].bar(survival_counts.index, survival_counts.values, 
            color=['#FF6B6B', '#4ECDC4'])
axes[0].set_xticks([0, 1])
axes[0].set_xticklabels(['Did Not Survive', 'Survived'])
axes[0].set_ylabel('Count')
axes[0].set_title('Survival Distribution')

# Survival percentage
axes[1].pie(survival_counts.values, labels=['Did Not Survive', 'Survived'],
            autopct='%1.1f%%', colors=['#FF6B6B', '#4ECDC4'])
axes[1].set_title('Survival Percentage')

plt.tight_layout()
plt.show()

print(f"Survival Rate: {df_clean['Survived'].mean()*100:.1f}%")

# Numerical features distribution
numerical_cols = ['Age', 'Fare', 'FamilySize', 'SibSp', 'Parch']

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i, col in enumerate(numerical_cols):
    axes[i].hist(df_clean[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')
    
    # Add mean and median lines
    mean_val = df_clean[col].mean()
    median_val = df_clean[col].median()
    axes[i].axvline(mean_val, color='r', linestyle='--', label=f'Mean: {mean_val:.2f}')
    axes[i].axvline(median_val, color='g', linestyle='-', label=f'Median: {median_val:.2f}')
    axes[i].legend()

# Remove empty subplot
axes[5].axis('off')

plt.suptitle('Numerical Features Distribution', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Categorical features distribution
categorical_cols = ['Sex', 'Embarked', 'Pclass', 'Title']

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for i, col in enumerate(categorical_cols):
    value_counts = df_clean[col].value_counts()
    axes[i].bar(value_counts.index, value_counts.values, 
                color=plt.cm.Set2(np.linspace(0, 1, len(value_counts))))
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Count')
    axes[i].tick_params(axis='x', rotation=45)

plt.suptitle('Categorical Features Distribution', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Step 4: Bivariate Analysis

# Survival by categorical features
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for i, col in enumerate(categorical_cols):
    survival_by = df_clean.groupby(col)['Survived'].mean() * 100
    axes[i].bar(survival_by.index, survival_by.values, 
                color=plt.cm.RdYlGn(survival_by.values / 100))
    axes[i].set_title(f'Survival Rate by {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Survival Rate (%)')
    axes[i].tick_params(axis='x', rotation=45)
    
    # Add value labels
    for j, v in enumerate(survival_by.values):
        axes[i].text(j, v + 1, f'{v:.1f}%', ha='center')

plt.suptitle('Survival Rate by Categorical Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Survival by numerical features
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Age vs Survival
sns.boxplot(data=df_clean, x='Survived', y='Age', ax=axes[0, 0])
axes[0, 0].set_title('Age vs Survival')
axes[0, 0].set_xticklabels(['Did Not Survive', 'Survived'])

# Fare vs Survival
sns.boxplot(data=df_clean, x='Survived', y='Fare', ax=axes[0, 1])
axes[0, 1].set_title('Fare vs Survival')
axes[0, 1].set_xticklabels(['Did Not Survive', 'Survived'])

# Family Size vs Survival
survival_by_family = df_clean.groupby('FamilySize')['Survived'].mean() * 100
axes[1, 0].bar(survival_by_family.index, survival_by_family.values, 
               color='skyblue', edgecolor='black')
axes[1, 0].set_title('Survival Rate by Family Size')
axes[1, 0].set_xlabel('Family Size')
axes[1, 0].set_ylabel('Survival Rate (%)')

# Pclass vs Survival
survival_by_class = df_clean.groupby('Pclass')['Survived'].mean() * 100
axes[1, 1].bar(survival_by_class.index, survival_by_class.values, 
               color=['#FF6B6B', '#FFEAA7', '#4ECDC4'], edgecolor='black')
axes[1, 1].set_title('Survival Rate by Passenger Class')
axes[1, 1].set_xlabel('Passenger Class')
axes[1, 1].set_ylabel('Survival Rate (%)')

plt.suptitle('Survival Analysis', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Statistical tests
print("=" * 60)
print("STATISTICAL TESTS")
print("=" * 60)

# Chi-square test for categorical variables
from scipy.stats import chi2_contingency

for col in categorical_cols:
    contingency = pd.crosstab(df_clean[col], df_clean['Survived'])
    chi2, p_value, dof, expected = chi2_contingency(contingency)
    print(f"\n{col} vs Survived:")
    print(f"  Chi-square: {chi2:.4f}")
    print(f"  P-value: {p_value:.6f}")
    print(f"  Significant: {'Yes' if p_value < 0.05 else 'No'}")

# T-test for numerical variables
print("\n" + "=" * 60)
print("T-TESTS FOR NUMERICAL VARIABLES")
print("=" * 60)

for col in ['Age', 'Fare']:
    survived = df_clean[df_clean['Survived'] == 1][col].dropna()
    not_survived = df_clean[df_clean['Survived'] == 0][col].dropna()
    
    t_stat, p_value = stats.ttest_ind(survived, not_survived)
    print(f"\n{col}:")
    print(f"  Survived mean: {survived.mean():.2f}")
    print(f"  Not survived mean: {not_survived.mean():.2f}")
    print(f"  T-statistic: {t_stat:.4f}")
    print(f"  P-value: {p_value:.6f}")
    print(f"  Significant: {'Yes' if p_value < 0.05 else 'No'}")

Chi-Square Test Statistic

ฯ‡2=โˆ‘i=1rโˆ‘j=1c(Oijโˆ’Eij)2Eij\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}

Here,

  • OijOแตขโฑผ=Observed frequency in row i, column j
  • EijEแตขโฑผ=Expected frequency under independence
  • r,cr, c=Number of rows and columns

โ„น๏ธ Chi-Square vs T-Test: When to Use Which

  • Chi-square test of independence: Tests whether two categorical variables are associated. Use for Sex vs Survived, Pclass vs Survived.
  • Two-sample t-test: Tests whether the means of a numerical variable differ between two groups. Use for Age (survived vs not), Fare (survived vs not).
  • Both require independent observations. Neither proves causation โ€” only association.

Step 5: Multivariate Analysis

# Correlation heatmap
plt.figure(figsize=(10, 8))
corr_cols = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 
             'HasCabin', 'FamilySize', 'IsAlone']
corr_matrix = df_clean[corr_cols].corr()

mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='coolwarm', 
            center=0, square=True, linewidths=0.5,
            fmt='.2f', vmin=-1, vmax=1)
plt.title('Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Pairplot for key features
key_features = ['Survived', 'Age', 'Fare', 'Pclass', 'FamilySize']
sns.pairplot(df_clean[key_features], hue='Survived', 
             palette={0: '#FF6B6B', 1: '#4ECDC4'},
             plot_kws={'alpha': 0.6})
plt.suptitle('Pairplot of Key Features', y=1.02)
plt.show()

# Survival by multiple features
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# By Class and Sex
survival_rates = df_clean.groupby(['Pclass', 'Sex'])['Survived'].mean().unstack()
survival_rates.plot(kind='bar', ax=axes[0], color=['#FF6B6B', '#4ECDC4'])
axes[0].set_title('Survival by Class and Sex')
axes[0].set_ylabel('Survival Rate')
axes[0].set_xticklabels(['1st', '2nd', '3rd'], rotation=0)
axes[0].legend(title='Sex')

# By Age Group and Class
survival_by_age_class = df_clean.groupby(['AgeBin', 'Pclass'])['Survived'].mean().unstack()
survival_by_age_class.plot(kind='bar', ax=axes[1])
axes[1].set_title('Survival by Age Group and Class')
axes[1].set_ylabel('Survival Rate')
axes[1].legend(title='Class')

# By Title
survival_by_title = df_clean.groupby('Title')['Survived'].agg(['mean', 'count'])
survival_by_title = survival_by_title[survival_by_title['count'] >= 10]
axes[2].bar(survival_by_title.index, survival_by_title['mean'] * 100, 
            color='skyblue', edgecolor='black')
axes[2].set_title('Survival Rate by Title')
axes[2].set_ylabel('Survival Rate (%)')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

๐Ÿ’ก Correlation Does Not Imply Causation

A strong correlation between two variables does not mean one causes the other. There may be a confounding variable (lurking variable) driving both. For example, Fare and Survival are correlated, but Fare is partly determined by Pclass โ€” a third variable. Only randomized controlled experiments can establish causation.

Step 6: Advanced Visualizations

# Violin plots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.violinplot(data=df_clean, x='Pclass', y='Age', hue='Survived', 
               split=True, ax=axes[0], palette={0: '#FF6B6B', 1: '#4ECDC4'})
axes[0].set_title('Age Distribution by Class and Survival')

sns.violinplot(data=df_clean, x='Sex', y='Fare', hue='Survived', 
               split=True, ax=axes[1], palette={0: '#FF6B6B', 1: '#4ECDC4'})
axes[1].set_title('Fare Distribution by Sex and Survival')

plt.tight_layout()
plt.show()

# FacetGrid
g = sns.FacetGrid(df_clean, col='Pclass', row='Sex', height=4, aspect=1.2)
g.map_dataframe(sns.histplot, x='Age', hue='Survived', 
                palette={0: '#FF6B6B', 1: '#4ECDC4'}, multiple='stack')
g.set_axis_labels('Age', 'Count')
g.add_legend(title='Survived')
g.set_titles('{row_name} - {col_name} Class')
plt.suptitle('Age Distribution by Class and Sex', y=1.02)
plt.show()

# Clustermap
plt.figure(figsize=(10, 8))
cluster_cols = ['Survived', 'Pclass', 'Age', 'Fare', 'FamilySize', 'HasCabin']
cluster_data = df_clean[cluster_cols].dropna()
sns.clustermap(cluster_data.corr(), annot=True, cmap='coolwarm', 
               center=0, figsize=(10, 8))
plt.title('Clustermap of Features')
plt.show()

Step 7: Key Insights and Findings

print("=" * 70)
print("KEY INSIGHTS AND FINDINGS")
print("=" * 70)

insights = """
1. SURVIVAL OVERVIEW
   - Overall survival rate: {:.1f}%
   - Higher survival for females ({:.1f}%) vs males ({:.1f}%)
   
2. CLASS EFFECT
   - 1st class: {:.1f}% survival
   - 2nd class: {:.1f}% survival
   - 3rd class: {:.1f}% survival
   - Clear class hierarchy in survival rates
   
3. AGE FACTOR
   - Children (0-12) had highest survival rate: {:.1f}%
   - Young adults (18-35) had lower survival: {:.1f}%
   - Age was significant predictor (p < 0.05)
   
4. FAMILY SIZE
   - Optimal family size: 2-4 members
   - Solo travelers had lower survival: {:.1f}%
   - Very large families (5+) also had reduced survival
   
5. FARE AND CABIN
   - Higher fare correlated with survival
   - Passengers with cabins (higher class) survived more
   
6. SEX DISPARITY
   - Strongest predictor of survival
   - "Women and children first" policy clearly evident
"""

# Calculate statistics
overall = df_clean['Survived'].mean() * 100
female_survival = df_clean[df_clean['Sex'] == 'female']['Survived'].mean() * 100
male_survival = df_clean[df_clean['Sex'] == 'male']['Survived'].mean() * 100
class_survival = df_clean.groupby('Pclass')['Survived'].mean() * 100
child_survival = df_clean[df_clean['AgeBin'] == 'Child']['Survived'].mean() * 100
adult_survival = df_clean[df_clean['AgeBin'] == 'Adult']['Survived'].mean() * 100
solo_survival = df_clean[df_clean['IsAlone'] == 1]['Survived'].mean() * 100

print(insights.format(
    overall, female_survival, male_survival,
    class_survival[1], class_survival[2], class_survival[3],
    child_survival, adult_survival,
    solo_survival
))

# Recommendations
print("\n" + "=" * 70)
print("RECOMMENDATIONS")
print("=" * 70)

recommendations = """
Based on the analysis:

1. DATA COLLECTION
   - Collect more complete data (Age has 20% missing)
   - Consider additional features (ticket type, deck location)

2. MODELING APPROACH
   - Use ensemble methods (Random Forest, Gradient Boosting)
   - Include interaction terms (Sex ร— Pclass, Age ร— FamilySize)
   - Consider non-linear relationships

3. BUSINESS INSIGHTS
   - Survival was heavily influenced by social status
   - Family connections mattered but optimal size existed
   - Gender was the strongest predictor - reflect on ethical implications

4. NEXT STEPS
   - Feature engineering for predictive modeling
   - Cross-validation for model evaluation
   - Consider fairness and bias in predictions
"""

print(recommendations)

๐Ÿ“Worked Example: Interpreting Statistical Tests

Chi-Square Test for Sex vs Survived:

  • Hโ‚€: Sex and Survived are independent
  • ฯ‡ยฒ statistic: ~263 (very large)
  • p-value: < 0.00001
  • Decision: Reject Hโ‚€ โ€” Sex and Survived are strongly associated

Two-Sample T-Test for Fare:

  • Hโ‚€: Mean fare is equal for survivors and non-survivors
  • t-statistic: ~7.94
  • p-value: < 0.00001
  • Survivors paid higher fares on average (48vs48 vs22)
  • Decision: Reject Hโ‚€ โ€” Fare is a significant predictor

Key: A low p-value (< 0.05) indicates the observed difference is unlikely under Hโ‚€. But effect size matters too โ€” always report both statistical significance and practical significance.

Step 8: Create Summary Report

# Create comprehensive summary visualization
fig = plt.figure(figsize=(16, 12))
gs = fig.add_gridspec(3, 3, hspace=0.4, wspace=0.3)

# 1. Survival Overview (top left)
ax1 = fig.add_subplot(gs[0, 0])
survival_counts = df_clean['Survived'].value_counts()
ax1.pie(survival_counts.values, labels=['Died', 'Survived'], 
        autopct='%1.1f%%', colors=['#FF6B6B', '#4ECDC4'])
ax1.set_title('Overall Survival Rate')

# 2. Survival by Gender (top middle)
ax2 = fig.add_subplot(gs[0, 1])
gender_survival = df_clean.groupby('Sex')['Survived'].mean() * 100
ax2.bar(gender_survival.index, gender_survival.values, color=['#FF6B6B', '#4ECDC4'])
ax2.set_title('Survival by Gender')
ax2.set_ylabel('Survival Rate (%)')

# 3. Survival by Class (top right)
ax3 = fig.add_subplot(gs[0, 2])
class_survival = df_clean.groupby('Pclass')['Survived'].mean() * 100
ax3.bar(class_survival.index, class_survival.values, 
        color=['#4ECDC4', '#FFEAA7', '#FF6B6B'])
ax3.set_title('Survival by Class')
ax3.set_xlabel('Passenger Class')
ax3.set_ylabel('Survival Rate (%)')

# 4. Age Distribution (middle left)
ax4 = fig.add_subplot(gs[1, 0])
sns.histplot(data=df_clean, x='Age', hue='Survived', kde=True, 
             ax=ax4, palette={0: '#FF6B6B', 1: '#4ECDC4'})
ax4.set_title('Age Distribution by Survival')

# 5. Fare Distribution (middle middle)
ax5 = fig.add_subplot(gs[1, 1])
sns.boxplot(data=df_clean, x='Survived', y='Fare', ax=ax5,
            palette=['#FF6B6B', '#4ECDC4'])
ax5.set_title('Fare Distribution by Survival')
ax5.set_xticklabels(['Died', 'Survived'])

# 6. Correlation Heatmap (middle right)
ax6 = fig.add_subplot(gs[1, 2])
corr_cols = ['Survived', 'Pclass', 'Age', 'Fare', 'FamilySize']
corr = df_clean[corr_cols].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, 
            ax=ax6, fmt='.2f', square=True)
ax6.set_title('Feature Correlations')

# 7. Survival by Family Size (bottom left)
ax7 = fig.add_subplot(gs[2, 0])
family_survival = df_clean.groupby('FamilySize')['Survived'].mean() * 100
ax7.bar(family_survival.index, family_survival.values, color='skyblue', edgecolor='black')
ax7.set_title('Survival by Family Size')
ax7.set_xlabel('Family Size')
ax7.set_ylabel('Survival Rate (%)')

# 8. Survival by Title (bottom middle)
ax8 = fig.add_subplot(gs[2, 1])
title_survival = df_clean.groupby('Title')['Survived'].agg(['mean', 'count'])
title_survival = title_survival[title_survival['count'] >= 10].sort_values('mean', ascending=False)
ax8.barh(title_survival.index, title_survival['mean'] * 100, color='lightgreen')
ax8.set_title('Survival by Title (n โ‰ฅ 10)')
ax8.set_xlabel('Survival Rate (%)')

# 9. Key Statistics (bottom right)
ax9 = fig.add_subplot(gs[2, 2])
ax9.axis('off')
stats_text = f"""
KEY STATISTICS
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Total Passengers: {len(df_clean)}
Survival Rate: {overall:.1f}%
Female Survival: {female_survival:.1f}%
Male Survival: {male_survival:.1f}%
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1st Class: {class_survival[1]:.1f}%
2nd Class: {class_survival[2]:.1f}%
3rd Class: {class_survival[3]:.1f}%
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Children (0-12): {child_survival:.1f}%
Solo Travelers: {solo_survival:.1f}%
"""
ax9.text(0.1, 0.5, stats_text, transform=ax9.transAxes,
         fontsize=11, verticalalignment='center',
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.suptitle('Titanic EDA Summary Dashboard', fontsize=16, fontweight='bold', y=1.02)
plt.savefig('titanic_eda_summary.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nSummary dashboard saved as 'titanic_eda_summary.png'")

๐Ÿ’ก Communicating EDA Results

A good EDA summary tells a story:

  1. What: Key findings with supporting statistics
  2. So what: Why these findings matter (business/practical implications)
  3. Now what: Recommended next steps (feature engineering, modeling, further analysis) Always lead with the most impactful finding and use visualizations to support each claim.

Key Takeaways

๐Ÿ“‹Summary: EDA on Real Dataset

  1. EDA is systematic: Follow a structured approach โ€” inspect โ†’ clean โ†’ univariate โ†’ bivariate โ†’ multivariate โ†’ report
  2. Missing data is informative: Understand MCAR/MAR/MNAR mechanisms before choosing imputation strategy
  3. Start simple: Understand distributions before complex relationships โ€” always visualize before computing statistics
  4. Use statistics to quantify: Don't just visualize โ€” use chi-square tests, t-tests, and correlation to measure the strength of associations
  5. Feature engineering is creative: Derived features (FamilySize, Title, AgeBin) often reveal patterns hidden in raw data
  6. Tell a story: EDA should lead to actionable insights โ€” the "so what" matters as much as the "what"
  7. Class imbalance: Check target distribution early โ€” it affects both EDA interpretation and downstream modeling

Next Steps

  1. Apply these techniques to your own datasets
  2. Practice with different types of data (time series, text, images)
  3. Build automated EDA reports using libraries like ydata-profiling or sweetviz
  4. Move to predictive modeling in Module 2
  5. Consider: how would your EDA findings change if the dataset had 10x more missing values? What about 100x more rows?

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement