What is Data Science + EDA Mindset

Module 1: FoundationsFree Lesson

Advertisement

What is Data Science?

DfData Science

Data Science is an interdisciplinary field that systematically extracts knowledge, patterns, and actionable insights from structured and unstructured data using methods from statistics, computer science, and domain expertise. It encompasses the entire lifecycle from data acquisition and cleaning to model deployment and decision support.

Data Science sits at the intersection of three foundational disciplines:

Architecture Diagram
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Data Science   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                    β”‚                    β”‚
   β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”          β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”          β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
   β”‚  Math & β”‚          β”‚Coding β”‚          β”‚Domain   β”‚
   β”‚Statisticsβ”‚          β”‚Skills β”‚          β”‚Knowledgeβ”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

ComponentDescriptionToolsOutput
Data CollectionGathering raw data from various sourcesAPIs, SQL, Web ScrapingRaw datasets
Data CleaningTransforming raw data into usable formatPandas, OpenRefineClean datasets
Exploratory Data AnalysisUnderstanding data through visualizationMatplotlib, SeabornInsights, hypotheses
Feature EngineeringCreating new variables from existing dataScikit-learn, Domain KnowledgeFeature matrices
ModelingBuilding predictive or descriptive modelsScikit-learn, TensorFlow, PyTorchTrained models
CommunicationSharing insights with stakeholdersJupyter, Tableau, PowerBIReports, dashboards

ℹ️ The Data Science Hierarchy

According to DJ Patil (former US Chief Data Scientist), Data Science is "the ability to extract knowledge from data, in all of its forms." The field has evolved from simple reporting (what happened?) to predictive analytics (what will happen?) to prescriptive analytics (what should we do?).

The Data Science Process

The Data Science lifecycle follows a systematic, iterative process:

  1. Define the Problem: Formulate a clear, answerable question
  2. Collect Data: Gather relevant data from appropriate sources
  3. Clean Data: Handle missing values, outliers, and inconsistencies
  4. Explore Data: Understand distributions, relationships, and patterns
  5. Model Data: Build statistical or machine learning models
  6. Evaluate Results: Validate model performance and assumptions
  7. Communicate Findings: Present insights to stakeholders
  8. Deploy Solution: Integrate models into production systems
Architecture Diagram
graph LR
    A[Define Question] --> B[Collect Data]
    B --> C[Clean Data]
    C --> D[Explore Data]
    D --> E[Model Data]
    E --> F[Communicate Results]
    F --> G[Deploy Solution]
    G --> A

⚠️ Iterative Process

In practice, the data science process is rarely linear. You often need to revisit earlier steps as new insights emerge. For example, exploring data may reveal you need additional data sources, or modeling results may indicate that your feature engineering needs improvement.

The EDA Mindset

DfExploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the critical first step in any data analysis, pioneered by John Tukey in 1977. It involves approaching data with curiosity and skepticism, using summary statistics and visualization to understand the underlying structure, identify anomalies, and formulate hypotheses before formal modeling.

The 5 Questions of EDA

  1. What does the data look like? β€” Structure, types, size, completeness
  2. What is the distribution of each variable? β€” Univariate analysis (histograms, box plots)
  3. How do variables relate to each other? β€” Bivariate/multivariate analysis (scatter plots, correlations)
  4. Are there anomalies or outliers? β€” Data quality assessment
  5. What patterns or trends emerge? β€” Storytelling and hypothesis generation

EDA Workflow

πŸ“Complete EDA Workflow in Python

# 1. Load and inspect data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('data.csv')
print(f"Shape: {df.shape}")           # Dimensions
print(df.head())                      # First 5 rows
print(df.info())                      # Data types and missing values
print(df.describe())                  # Statistical summary

# 2. Check for missing values
print(df.isnull().sum())
print(f"Missing percentage:\n{df.isnull().mean() * 100}")

# 3. Examine distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
numerical_cols = df.select_dtypes(include=[np.number]).columns
for i, col in enumerate(numerical_cols[:6]):
    row, col_idx = i // 3, i % 3
    axes[row, col_idx].hist(df[col].dropna(), bins=30, edgecolor='black')
    axes[row, col_idx].set_title(f'Distribution of {col}')
    axes[row, col_idx].axvline(df[col].mean(), color='red', linestyle='--', label='Mean')
    axes[row, col_idx].axvline(df[col].median(), color='green', linestyle='--', label='Median')
    axes[row, col_idx].legend()
plt.tight_layout()
plt.show()

# 4. Correlation analysis
correlation_matrix = df.select_dtypes(include=[np.number]).corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            fmt='.2f', square=True, linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

# 5. Pairwise relationships
sns.pairplot(df[numerical_cols[:4]], diag_kind='kde')
plt.suptitle('Pairwise Relationships', y=1.02)
plt.show()

Common EDA Patterns to Spot

PatternWhat It MeansAction
Skewed distributionData not normally distributedConsider log/Box-Cox transformation
Multiple peaks (multimodal)Mixed populations presentSegment analysis needed
OutliersPotential errors or special casesInvestigate, consider robust methods
Strong correlationsVariables are relatedFeature selection opportunities
Missing data patternsSystematic gaps in dataUnderstand mechanism (MCAR/MAR/MNAR)

ℹ️ Missing Data Mechanisms

Understanding why data is missing is crucial:

  • MCAR (Missing Completely at Random): Missingness is unrelated to any variable β€” safe to ignore
  • MAR (Missing at Random): Missingness depends on observed variables β€” can model the missingness
  • MNAR (Missing Not at Random): Missingness depends on the missing value itself β€” most challenging, requires domain knowledge

Types of Data Science Problems

DfSupervised Learning

In supervised learning, we have labeled training data where both inputs XX and outputs YY are known. The goal is to learn a mapping function f:X→Yf: X \rightarrow Y that can predict YY for new, unseen XX. Examples include classification (discrete YY) and regression (continuous YY).

DfUnsupervised Learning

In unsupervised learning, we have only input data XX with no corresponding labels. The goal is to discover hidden patterns, structures, or groupings in the data. Examples include clustering (grouping similar observations) and dimensionality reduction (finding lower-dimensional representations).

DfSemi-Supervised Learning

Semi-supervised learning uses a small amount of labeled data combined with a large amount of unlabeled data. This is particularly valuable when labeling is expensive (e.g., medical imaging) but unlabeled data is abundant.

Problem TypeTarget VariableExampleAlgorithm Family
ClassificationDiscrete (categories)Spam detectionLogistic Regression, SVM, Random Forest
RegressionContinuous (numbers)Price predictionLinear Regression, Ridge, Neural Networks
ClusteringNone (discover groups)Customer segmentationK-Means, DBSCAN, Hierarchical
Dimensionality ReductionNone (compress features)Feature extractionPCA, t-SNE, UMAP
Anomaly DetectionNone (find outliers)Fraud detectionIsolation Forest, Autoencoders

Real-World Example: Customer Churn Analysis

πŸ“EDA for Customer Churn Dataset

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_csv('customer_churn.csv')

# Basic info
print(f"Dataset shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")

# Target variable distribution
churn_counts = df['Churn'].value_counts()
print(f"\nChurn distribution:\n{churn_counts}")
print(f"\nChurn rate: {churn_counts['Yes']/len(df)*100:.2f}%")

# Numerical features β€” distributions
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
for i, col in enumerate(numerical_cols[:6]):
    row, col_idx = i // 3, i % 3
    axes[row, col_idx].hist(df[col], bins=30, edgecolor='black', alpha=0.7)
    axes[row, col_idx].set_title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

# Categorical features vs Churn
categorical_cols = df.select_dtypes(include=['object']).columns
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
for i, col in enumerate(categorical_cols[:-1]):
    row, col_idx = i // 3, i % 3
    pd.crosstab(df[col], df['Churn'], normalize='index').plot(
        kind='bar', stacked=True, ax=axes[row, col_idx])
    axes[row, col_idx].set_title(f'{col} vs Churn Rate')
    axes[row, col_idx].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()

# Correlation with target (encode target)
df['Churn_encoded'] = (df['Churn'] == 'Yes').astype(int)
corr_with_target = df[numerical_cols].corrwith(df['Churn_encoded']).sort_values()
print(f"\nCorrelation with churn:\n{corr_with_target}")

πŸ’‘ EDA Checklist

Before any modeling, always check:

  1. Data types: Are they correct? (e.g., dates as datetime, categories as category)
  2. Missing values: How much? What pattern? How to handle?
  3. Duplicates: Are there duplicate rows?
  4. Outliers: Using IQR or Z-score methods
  5. Feature distributions: Normal, skewed, bimodal?
  6. Class balance: For classification problems, is the target balanced?
  7. Feature relationships: Correlations, scatter plots
  8. Data leakage: Are there features that wouldn't be available at prediction time?

Key Takeaways

πŸ“‹Summary: What is Data Science + EDA Mindset

  1. Data Science combines statistics, programming, and domain knowledge to extract insights from data
  2. EDA is the critical first step β€” always explore before you model
  3. Understand missing data mechanisms (MCAR/MAR/MNAR) to handle them correctly
  4. Data Science problems fall into supervised, unsupervised, and semi-supervised categories
  5. Always follow a systematic workflow: define, collect, clean, explore, model, communicate, deploy
  6. Document your EDA findings β€” they guide all subsequent analysis

Practice Exercise

Download the Titanic dataset and perform a complete EDA:

  1. What variables are available? What are their types?
  2. What is the survival rate overall and by subgroup?
  3. Which features correlate most strongly with survival?
  4. Are there data quality issues (missing values, outliers, inconsistencies)?
  5. Formulate 3 hypotheses about survival based on your EDA

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement