Project Overview
In this project, you'll perform a complete Exploratory Data Analysis (EDA) on a real-world dataset. You'll apply all the skills learned in Module 1 to extract insights and tell a data story.
Objectives
- Load and inspect real-world data
- Clean and preprocess data
- Perform univariate, bivariate, and multivariate analysis
- Create compelling visualizations
- Document findings and recommendations
โน๏ธ EDA as a Systematic Process
A disciplined EDA follows a repeatable workflow:
- Structure: Dimensions, types, missing values
- Description: Summary statistics, distributions
- Quality: Outliers, inconsistencies, duplicates
- Relationships: Correlations, associations, dependencies
- Narrative: Key insights, hypotheses, recommendations Never skip to modeling without completing EDA first.
Dataset: Titanic Survival Prediction
We'll use the famous Titanic dataset โ a classic for learning EDA.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Set style
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
# Load data
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
print("Dataset Shape:", df.shape)
print("\nColumn Names:", df.columns.tolist())
Step 1: Initial Data Inspection
# Basic information
print("=" * 60)
print("INITIAL DATA INSPECTION")
print("=" * 60)
print("\n1. First 5 rows:")
print(df.head())
print("\n2. Data Types:")
print(df.dtypes)
print("\n3. Data Info:")
df.info()
print("\n4. Statistical Summary (Numerical):")
print(df.describe())
print("\n5. Statistical Summary (Categorical):")
print(df.describe(include=['object']))
print("\n6. Missing Values:")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Count': missing, 'Percentage': missing_pct})
print(missing_df[missing_df['Count'] > 0])
print("\n7. Unique Values:")
for col in df.columns:
print(f"{col}: {df[col].nunique()} unique values")
Step 2: Data Cleaning
# Create a copy for cleaning
df_clean = df.copy()
# 1. Handle missing values
print("Missing Values Before:")
print(df_clean.isnull().sum()[df_clean.isnull().sum() > 0])
# Age: Fill with median (robust to outliers)
df_clean['Age'] = df_clean['Age'].fillna(df_clean['Age'].median())
# Embarked: Fill with mode
df_clean['Embarked'] = df_clean['Embarked'].fillna(df_clean['Embarked'].mode()[0])
# Cabin: Too many missing, create indicator
df_clean['HasCabin'] = df_clean['Cabin'].notna().astype(int)
df_clean = df_clean.drop('Cabin', axis=1)
print("\nMissing Values After:")
print(df_clean.isnull().sum())
# 2. Feature Engineering
# Extract title from Name
df_clean['Title'] = df_clean['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
df_clean['Title'] = df_clean['Title'].replace(['Lady', 'Countess', 'Capt', 'Col',
'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'],
'Rare')
df_clean['Title'] = df_clean['Title'].replace('Mlle', 'Miss')
df_clean['Title'] = df_clean['Title'].replace('Ms', 'Miss')
df_clean['Title'] = df_clean['Title'].replace('Mme', 'Mrs')
# Family size
df_clean['FamilySize'] = df_clean['SibSp'] + df_clean['Parch'] + 1
# Is alone
df_clean['IsAlone'] = (df_clean['FamilySize'] == 1).astype(int)
# Age bins
df_clean['AgeBin'] = pd.cut(df_clean['Age'], bins=[0, 12, 18, 35, 60, 100],
labels=['Child', 'Teen', 'Adult', 'Middle', 'Senior'])
# Fare bins
df_clean['FareBin'] = pd.qcut(df_clean['Fare'], 4,
labels=['Low', 'Medium', 'High', 'Very High'])
print("\nNew Features Created:")
print(df_clean[['Title', 'FamilySize', 'IsAlone', 'AgeBin', 'FareBin']].head())
๐ก Imputation Strategy Selection
- Median: Best for skewed numerical data (robust to outliers). Use for Age, Income.
- Mode: Best for categorical data with a clear majority class. Use for Embarked, Gender.
- Indicator variable: When missingness itself is informative (e.g., Cabin missing โ likely lower class). Create HasCabin before dropping.
- Never impute the target variable โ it introduces circular reasoning into model evaluation.
Step 3: Univariate Analysis
# Target variable distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Survival count
survival_counts = df_clean['Survived'].value_counts()
axes[0].bar(survival_counts.index, survival_counts.values,
color=['#FF6B6B', '#4ECDC4'])
axes[0].set_xticks([0, 1])
axes[0].set_xticklabels(['Did Not Survive', 'Survived'])
axes[0].set_ylabel('Count')
axes[0].set_title('Survival Distribution')
# Survival percentage
axes[1].pie(survival_counts.values, labels=['Did Not Survive', 'Survived'],
autopct='%1.1f%%', colors=['#FF6B6B', '#4ECDC4'])
axes[1].set_title('Survival Percentage')
plt.tight_layout()
plt.show()
print(f"Survival Rate: {df_clean['Survived'].mean()*100:.1f}%")
# Numerical features distribution
numerical_cols = ['Age', 'Fare', 'FamilySize', 'SibSp', 'Parch']
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
for i, col in enumerate(numerical_cols):
axes[i].hist(df_clean[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
axes[i].set_title(f'Distribution of {col}')
axes[i].set_xlabel(col)
axes[i].set_ylabel('Frequency')
# Add mean and median lines
mean_val = df_clean[col].mean()
median_val = df_clean[col].median()
axes[i].axvline(mean_val, color='r', linestyle='--', label=f'Mean: {mean_val:.2f}')
axes[i].axvline(median_val, color='g', linestyle='-', label=f'Median: {median_val:.2f}')
axes[i].legend()
# Remove empty subplot
axes[5].axis('off')
plt.suptitle('Numerical Features Distribution', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Categorical features distribution
categorical_cols = ['Sex', 'Embarked', 'Pclass', 'Title']
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()
for i, col in enumerate(categorical_cols):
value_counts = df_clean[col].value_counts()
axes[i].bar(value_counts.index, value_counts.values,
color=plt.cm.Set2(np.linspace(0, 1, len(value_counts))))
axes[i].set_title(f'Distribution of {col}')
axes[i].set_xlabel(col)
axes[i].set_ylabel('Count')
axes[i].tick_params(axis='x', rotation=45)
plt.suptitle('Categorical Features Distribution', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
Step 4: Bivariate Analysis
# Survival by categorical features
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()
for i, col in enumerate(categorical_cols):
survival_by = df_clean.groupby(col)['Survived'].mean() * 100
axes[i].bar(survival_by.index, survival_by.values,
color=plt.cm.RdYlGn(survival_by.values / 100))
axes[i].set_title(f'Survival Rate by {col}')
axes[i].set_xlabel(col)
axes[i].set_ylabel('Survival Rate (%)')
axes[i].tick_params(axis='x', rotation=45)
# Add value labels
for j, v in enumerate(survival_by.values):
axes[i].text(j, v + 1, f'{v:.1f}%', ha='center')
plt.suptitle('Survival Rate by Categorical Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Survival by numerical features
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Age vs Survival
sns.boxplot(data=df_clean, x='Survived', y='Age', ax=axes[0, 0])
axes[0, 0].set_title('Age vs Survival')
axes[0, 0].set_xticklabels(['Did Not Survive', 'Survived'])
# Fare vs Survival
sns.boxplot(data=df_clean, x='Survived', y='Fare', ax=axes[0, 1])
axes[0, 1].set_title('Fare vs Survival')
axes[0, 1].set_xticklabels(['Did Not Survive', 'Survived'])
# Family Size vs Survival
survival_by_family = df_clean.groupby('FamilySize')['Survived'].mean() * 100
axes[1, 0].bar(survival_by_family.index, survival_by_family.values,
color='skyblue', edgecolor='black')
axes[1, 0].set_title('Survival Rate by Family Size')
axes[1, 0].set_xlabel('Family Size')
axes[1, 0].set_ylabel('Survival Rate (%)')
# Pclass vs Survival
survival_by_class = df_clean.groupby('Pclass')['Survived'].mean() * 100
axes[1, 1].bar(survival_by_class.index, survival_by_class.values,
color=['#FF6B6B', '#FFEAA7', '#4ECDC4'], edgecolor='black')
axes[1, 1].set_title('Survival Rate by Passenger Class')
axes[1, 1].set_xlabel('Passenger Class')
axes[1, 1].set_ylabel('Survival Rate (%)')
plt.suptitle('Survival Analysis', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Statistical tests
print("=" * 60)
print("STATISTICAL TESTS")
print("=" * 60)
# Chi-square test for categorical variables
from scipy.stats import chi2_contingency
for col in categorical_cols:
contingency = pd.crosstab(df_clean[col], df_clean['Survived'])
chi2, p_value, dof, expected = chi2_contingency(contingency)
print(f"\n{col} vs Survived:")
print(f" Chi-square: {chi2:.4f}")
print(f" P-value: {p_value:.6f}")
print(f" Significant: {'Yes' if p_value < 0.05 else 'No'}")
# T-test for numerical variables
print("\n" + "=" * 60)
print("T-TESTS FOR NUMERICAL VARIABLES")
print("=" * 60)
for col in ['Age', 'Fare']:
survived = df_clean[df_clean['Survived'] == 1][col].dropna()
not_survived = df_clean[df_clean['Survived'] == 0][col].dropna()
t_stat, p_value = stats.ttest_ind(survived, not_survived)
print(f"\n{col}:")
print(f" Survived mean: {survived.mean():.2f}")
print(f" Not survived mean: {not_survived.mean():.2f}")
print(f" T-statistic: {t_stat:.4f}")
print(f" P-value: {p_value:.6f}")
print(f" Significant: {'Yes' if p_value < 0.05 else 'No'}")
Chi-Square Test Statistic
Here,
- =Observed frequency in row i, column j
- =Expected frequency under independence
- =Number of rows and columns
โน๏ธ Chi-Square vs T-Test: When to Use Which
- Chi-square test of independence: Tests whether two categorical variables are associated. Use for Sex vs Survived, Pclass vs Survived.
- Two-sample t-test: Tests whether the means of a numerical variable differ between two groups. Use for Age (survived vs not), Fare (survived vs not).
- Both require independent observations. Neither proves causation โ only association.
Step 5: Multivariate Analysis
# Correlation heatmap
plt.figure(figsize=(10, 8))
corr_cols = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare',
'HasCabin', 'FamilySize', 'IsAlone']
corr_matrix = df_clean[corr_cols].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='coolwarm',
center=0, square=True, linewidths=0.5,
fmt='.2f', vmin=-1, vmax=1)
plt.title('Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Pairplot for key features
key_features = ['Survived', 'Age', 'Fare', 'Pclass', 'FamilySize']
sns.pairplot(df_clean[key_features], hue='Survived',
palette={0: '#FF6B6B', 1: '#4ECDC4'},
plot_kws={'alpha': 0.6})
plt.suptitle('Pairplot of Key Features', y=1.02)
plt.show()
# Survival by multiple features
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# By Class and Sex
survival_rates = df_clean.groupby(['Pclass', 'Sex'])['Survived'].mean().unstack()
survival_rates.plot(kind='bar', ax=axes[0], color=['#FF6B6B', '#4ECDC4'])
axes[0].set_title('Survival by Class and Sex')
axes[0].set_ylabel('Survival Rate')
axes[0].set_xticklabels(['1st', '2nd', '3rd'], rotation=0)
axes[0].legend(title='Sex')
# By Age Group and Class
survival_by_age_class = df_clean.groupby(['AgeBin', 'Pclass'])['Survived'].mean().unstack()
survival_by_age_class.plot(kind='bar', ax=axes[1])
axes[1].set_title('Survival by Age Group and Class')
axes[1].set_ylabel('Survival Rate')
axes[1].legend(title='Class')
# By Title
survival_by_title = df_clean.groupby('Title')['Survived'].agg(['mean', 'count'])
survival_by_title = survival_by_title[survival_by_title['count'] >= 10]
axes[2].bar(survival_by_title.index, survival_by_title['mean'] * 100,
color='skyblue', edgecolor='black')
axes[2].set_title('Survival Rate by Title')
axes[2].set_ylabel('Survival Rate (%)')
axes[2].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
๐ก Correlation Does Not Imply Causation
A strong correlation between two variables does not mean one causes the other. There may be a confounding variable (lurking variable) driving both. For example, Fare and Survival are correlated, but Fare is partly determined by Pclass โ a third variable. Only randomized controlled experiments can establish causation.
Step 6: Advanced Visualizations
# Violin plots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.violinplot(data=df_clean, x='Pclass', y='Age', hue='Survived',
split=True, ax=axes[0], palette={0: '#FF6B6B', 1: '#4ECDC4'})
axes[0].set_title('Age Distribution by Class and Survival')
sns.violinplot(data=df_clean, x='Sex', y='Fare', hue='Survived',
split=True, ax=axes[1], palette={0: '#FF6B6B', 1: '#4ECDC4'})
axes[1].set_title('Fare Distribution by Sex and Survival')
plt.tight_layout()
plt.show()
# FacetGrid
g = sns.FacetGrid(df_clean, col='Pclass', row='Sex', height=4, aspect=1.2)
g.map_dataframe(sns.histplot, x='Age', hue='Survived',
palette={0: '#FF6B6B', 1: '#4ECDC4'}, multiple='stack')
g.set_axis_labels('Age', 'Count')
g.add_legend(title='Survived')
g.set_titles('{row_name} - {col_name} Class')
plt.suptitle('Age Distribution by Class and Sex', y=1.02)
plt.show()
# Clustermap
plt.figure(figsize=(10, 8))
cluster_cols = ['Survived', 'Pclass', 'Age', 'Fare', 'FamilySize', 'HasCabin']
cluster_data = df_clean[cluster_cols].dropna()
sns.clustermap(cluster_data.corr(), annot=True, cmap='coolwarm',
center=0, figsize=(10, 8))
plt.title('Clustermap of Features')
plt.show()
Step 7: Key Insights and Findings
print("=" * 70)
print("KEY INSIGHTS AND FINDINGS")
print("=" * 70)
insights = """
1. SURVIVAL OVERVIEW
- Overall survival rate: {:.1f}%
- Higher survival for females ({:.1f}%) vs males ({:.1f}%)
2. CLASS EFFECT
- 1st class: {:.1f}% survival
- 2nd class: {:.1f}% survival
- 3rd class: {:.1f}% survival
- Clear class hierarchy in survival rates
3. AGE FACTOR
- Children (0-12) had highest survival rate: {:.1f}%
- Young adults (18-35) had lower survival: {:.1f}%
- Age was significant predictor (p < 0.05)
4. FAMILY SIZE
- Optimal family size: 2-4 members
- Solo travelers had lower survival: {:.1f}%
- Very large families (5+) also had reduced survival
5. FARE AND CABIN
- Higher fare correlated with survival
- Passengers with cabins (higher class) survived more
6. SEX DISPARITY
- Strongest predictor of survival
- "Women and children first" policy clearly evident
"""
# Calculate statistics
overall = df_clean['Survived'].mean() * 100
female_survival = df_clean[df_clean['Sex'] == 'female']['Survived'].mean() * 100
male_survival = df_clean[df_clean['Sex'] == 'male']['Survived'].mean() * 100
class_survival = df_clean.groupby('Pclass')['Survived'].mean() * 100
child_survival = df_clean[df_clean['AgeBin'] == 'Child']['Survived'].mean() * 100
adult_survival = df_clean[df_clean['AgeBin'] == 'Adult']['Survived'].mean() * 100
solo_survival = df_clean[df_clean['IsAlone'] == 1]['Survived'].mean() * 100
print(insights.format(
overall, female_survival, male_survival,
class_survival[1], class_survival[2], class_survival[3],
child_survival, adult_survival,
solo_survival
))
# Recommendations
print("\n" + "=" * 70)
print("RECOMMENDATIONS")
print("=" * 70)
recommendations = """
Based on the analysis:
1. DATA COLLECTION
- Collect more complete data (Age has 20% missing)
- Consider additional features (ticket type, deck location)
2. MODELING APPROACH
- Use ensemble methods (Random Forest, Gradient Boosting)
- Include interaction terms (Sex ร Pclass, Age ร FamilySize)
- Consider non-linear relationships
3. BUSINESS INSIGHTS
- Survival was heavily influenced by social status
- Family connections mattered but optimal size existed
- Gender was the strongest predictor - reflect on ethical implications
4. NEXT STEPS
- Feature engineering for predictive modeling
- Cross-validation for model evaluation
- Consider fairness and bias in predictions
"""
print(recommendations)
๐Worked Example: Interpreting Statistical Tests
Chi-Square Test for Sex vs Survived:
- Hโ: Sex and Survived are independent
- ฯยฒ statistic: ~263 (very large)
- p-value: < 0.00001
- Decision: Reject Hโ โ Sex and Survived are strongly associated
Two-Sample T-Test for Fare:
- Hโ: Mean fare is equal for survivors and non-survivors
- t-statistic: ~7.94
- p-value: < 0.00001
- Survivors paid higher fares on average (22)
- Decision: Reject Hโ โ Fare is a significant predictor
Key: A low p-value (< 0.05) indicates the observed difference is unlikely under Hโ. But effect size matters too โ always report both statistical significance and practical significance.
Step 8: Create Summary Report
# Create comprehensive summary visualization
fig = plt.figure(figsize=(16, 12))
gs = fig.add_gridspec(3, 3, hspace=0.4, wspace=0.3)
# 1. Survival Overview (top left)
ax1 = fig.add_subplot(gs[0, 0])
survival_counts = df_clean['Survived'].value_counts()
ax1.pie(survival_counts.values, labels=['Died', 'Survived'],
autopct='%1.1f%%', colors=['#FF6B6B', '#4ECDC4'])
ax1.set_title('Overall Survival Rate')
# 2. Survival by Gender (top middle)
ax2 = fig.add_subplot(gs[0, 1])
gender_survival = df_clean.groupby('Sex')['Survived'].mean() * 100
ax2.bar(gender_survival.index, gender_survival.values, color=['#FF6B6B', '#4ECDC4'])
ax2.set_title('Survival by Gender')
ax2.set_ylabel('Survival Rate (%)')
# 3. Survival by Class (top right)
ax3 = fig.add_subplot(gs[0, 2])
class_survival = df_clean.groupby('Pclass')['Survived'].mean() * 100
ax3.bar(class_survival.index, class_survival.values,
color=['#4ECDC4', '#FFEAA7', '#FF6B6B'])
ax3.set_title('Survival by Class')
ax3.set_xlabel('Passenger Class')
ax3.set_ylabel('Survival Rate (%)')
# 4. Age Distribution (middle left)
ax4 = fig.add_subplot(gs[1, 0])
sns.histplot(data=df_clean, x='Age', hue='Survived', kde=True,
ax=ax4, palette={0: '#FF6B6B', 1: '#4ECDC4'})
ax4.set_title('Age Distribution by Survival')
# 5. Fare Distribution (middle middle)
ax5 = fig.add_subplot(gs[1, 1])
sns.boxplot(data=df_clean, x='Survived', y='Fare', ax=ax5,
palette=['#FF6B6B', '#4ECDC4'])
ax5.set_title('Fare Distribution by Survival')
ax5.set_xticklabels(['Died', 'Survived'])
# 6. Correlation Heatmap (middle right)
ax6 = fig.add_subplot(gs[1, 2])
corr_cols = ['Survived', 'Pclass', 'Age', 'Fare', 'FamilySize']
corr = df_clean[corr_cols].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0,
ax=ax6, fmt='.2f', square=True)
ax6.set_title('Feature Correlations')
# 7. Survival by Family Size (bottom left)
ax7 = fig.add_subplot(gs[2, 0])
family_survival = df_clean.groupby('FamilySize')['Survived'].mean() * 100
ax7.bar(family_survival.index, family_survival.values, color='skyblue', edgecolor='black')
ax7.set_title('Survival by Family Size')
ax7.set_xlabel('Family Size')
ax7.set_ylabel('Survival Rate (%)')
# 8. Survival by Title (bottom middle)
ax8 = fig.add_subplot(gs[2, 1])
title_survival = df_clean.groupby('Title')['Survived'].agg(['mean', 'count'])
title_survival = title_survival[title_survival['count'] >= 10].sort_values('mean', ascending=False)
ax8.barh(title_survival.index, title_survival['mean'] * 100, color='lightgreen')
ax8.set_title('Survival by Title (n โฅ 10)')
ax8.set_xlabel('Survival Rate (%)')
# 9. Key Statistics (bottom right)
ax9 = fig.add_subplot(gs[2, 2])
ax9.axis('off')
stats_text = f"""
KEY STATISTICS
โโโโโโโโโโโโโโโโโโโโโ
Total Passengers: {len(df_clean)}
Survival Rate: {overall:.1f}%
Female Survival: {female_survival:.1f}%
Male Survival: {male_survival:.1f}%
โโโโโโโโโโโโโโโโโโโโโ
1st Class: {class_survival[1]:.1f}%
2nd Class: {class_survival[2]:.1f}%
3rd Class: {class_survival[3]:.1f}%
โโโโโโโโโโโโโโโโโโโโโ
Children (0-12): {child_survival:.1f}%
Solo Travelers: {solo_survival:.1f}%
"""
ax9.text(0.1, 0.5, stats_text, transform=ax9.transAxes,
fontsize=11, verticalalignment='center',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.suptitle('Titanic EDA Summary Dashboard', fontsize=16, fontweight='bold', y=1.02)
plt.savefig('titanic_eda_summary.png', dpi=300, bbox_inches='tight')
plt.show()
print("\nSummary dashboard saved as 'titanic_eda_summary.png'")
๐ก Communicating EDA Results
A good EDA summary tells a story:
- What: Key findings with supporting statistics
- So what: Why these findings matter (business/practical implications)
- Now what: Recommended next steps (feature engineering, modeling, further analysis) Always lead with the most impactful finding and use visualizations to support each claim.
Key Takeaways
๐Summary: EDA on Real Dataset
- EDA is systematic: Follow a structured approach โ inspect โ clean โ univariate โ bivariate โ multivariate โ report
- Missing data is informative: Understand MCAR/MAR/MNAR mechanisms before choosing imputation strategy
- Start simple: Understand distributions before complex relationships โ always visualize before computing statistics
- Use statistics to quantify: Don't just visualize โ use chi-square tests, t-tests, and correlation to measure the strength of associations
- Feature engineering is creative: Derived features (FamilySize, Title, AgeBin) often reveal patterns hidden in raw data
- Tell a story: EDA should lead to actionable insights โ the "so what" matters as much as the "what"
- Class imbalance: Check target distribution early โ it affects both EDA interpretation and downstream modeling
Next Steps
- Apply these techniques to your own datasets
- Practice with different types of data (time series, text, images)
- Build automated EDA reports using libraries like
ydata-profilingorsweetviz - Move to predictive modeling in Module 2
- Consider: how would your EDA findings change if the dataset had 10x more missing values? What about 100x more rows?