What is Data Science?
DfData Science
Data Science is an interdisciplinary field that systematically extracts knowledge, patterns, and actionable insights from structured and unstructured data using methods from statistics, computer science, and domain expertise. It encompasses the entire lifecycle from data acquisition and cleaning to model deployment and decision support.
Data Science sits at the intersection of three foundational disciplines:
βββββββββββββββββββ
β Data Science β
ββββββββββ¬βββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββ
β β β
ββββββΌβββββ βββββΌββββ ββββββΌβββββ
β Math & β βCoding β βDomain β
βStatisticsβ βSkills β βKnowledgeβ
βββββββββββ βββββββββ βββββββββββ
Core Components
| Component | Description | Tools | Output |
|---|---|---|---|
| Data Collection | Gathering raw data from various sources | APIs, SQL, Web Scraping | Raw datasets |
| Data Cleaning | Transforming raw data into usable format | Pandas, OpenRefine | Clean datasets |
| Exploratory Data Analysis | Understanding data through visualization | Matplotlib, Seaborn | Insights, hypotheses |
| Feature Engineering | Creating new variables from existing data | Scikit-learn, Domain Knowledge | Feature matrices |
| Modeling | Building predictive or descriptive models | Scikit-learn, TensorFlow, PyTorch | Trained models |
| Communication | Sharing insights with stakeholders | Jupyter, Tableau, PowerBI | Reports, dashboards |
βΉοΈ The Data Science Hierarchy
According to DJ Patil (former US Chief Data Scientist), Data Science is "the ability to extract knowledge from data, in all of its forms." The field has evolved from simple reporting (what happened?) to predictive analytics (what will happen?) to prescriptive analytics (what should we do?).
The Data Science Process
The Data Science lifecycle follows a systematic, iterative process:
- Define the Problem: Formulate a clear, answerable question
- Collect Data: Gather relevant data from appropriate sources
- Clean Data: Handle missing values, outliers, and inconsistencies
- Explore Data: Understand distributions, relationships, and patterns
- Model Data: Build statistical or machine learning models
- Evaluate Results: Validate model performance and assumptions
- Communicate Findings: Present insights to stakeholders
- Deploy Solution: Integrate models into production systems
graph LR
A[Define Question] --> B[Collect Data]
B --> C[Clean Data]
C --> D[Explore Data]
D --> E[Model Data]
E --> F[Communicate Results]
F --> G[Deploy Solution]
G --> A
β οΈ Iterative Process
In practice, the data science process is rarely linear. You often need to revisit earlier steps as new insights emerge. For example, exploring data may reveal you need additional data sources, or modeling results may indicate that your feature engineering needs improvement.
The EDA Mindset
DfExploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is the critical first step in any data analysis, pioneered by John Tukey in 1977. It involves approaching data with curiosity and skepticism, using summary statistics and visualization to understand the underlying structure, identify anomalies, and formulate hypotheses before formal modeling.
The 5 Questions of EDA
- What does the data look like? β Structure, types, size, completeness
- What is the distribution of each variable? β Univariate analysis (histograms, box plots)
- How do variables relate to each other? β Bivariate/multivariate analysis (scatter plots, correlations)
- Are there anomalies or outliers? β Data quality assessment
- What patterns or trends emerge? β Storytelling and hypothesis generation
EDA Workflow
πComplete EDA Workflow in Python
# 1. Load and inspect data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('data.csv')
print(f"Shape: {df.shape}") # Dimensions
print(df.head()) # First 5 rows
print(df.info()) # Data types and missing values
print(df.describe()) # Statistical summary
# 2. Check for missing values
print(df.isnull().sum())
print(f"Missing percentage:\n{df.isnull().mean() * 100}")
# 3. Examine distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
numerical_cols = df.select_dtypes(include=[np.number]).columns
for i, col in enumerate(numerical_cols[:6]):
row, col_idx = i // 3, i % 3
axes[row, col_idx].hist(df[col].dropna(), bins=30, edgecolor='black')
axes[row, col_idx].set_title(f'Distribution of {col}')
axes[row, col_idx].axvline(df[col].mean(), color='red', linestyle='--', label='Mean')
axes[row, col_idx].axvline(df[col].median(), color='green', linestyle='--', label='Median')
axes[row, col_idx].legend()
plt.tight_layout()
plt.show()
# 4. Correlation analysis
correlation_matrix = df.select_dtypes(include=[np.number]).corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
fmt='.2f', square=True, linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
# 5. Pairwise relationships
sns.pairplot(df[numerical_cols[:4]], diag_kind='kde')
plt.suptitle('Pairwise Relationships', y=1.02)
plt.show()
Common EDA Patterns to Spot
| Pattern | What It Means | Action |
|---|---|---|
| Skewed distribution | Data not normally distributed | Consider log/Box-Cox transformation |
| Multiple peaks (multimodal) | Mixed populations present | Segment analysis needed |
| Outliers | Potential errors or special cases | Investigate, consider robust methods |
| Strong correlations | Variables are related | Feature selection opportunities |
| Missing data patterns | Systematic gaps in data | Understand mechanism (MCAR/MAR/MNAR) |
βΉοΈ Missing Data Mechanisms
Understanding why data is missing is crucial:
- MCAR (Missing Completely at Random): Missingness is unrelated to any variable β safe to ignore
- MAR (Missing at Random): Missingness depends on observed variables β can model the missingness
- MNAR (Missing Not at Random): Missingness depends on the missing value itself β most challenging, requires domain knowledge
Types of Data Science Problems
DfSupervised Learning
In supervised learning, we have labeled training data where both inputs and outputs are known. The goal is to learn a mapping function that can predict for new, unseen . Examples include classification (discrete ) and regression (continuous ).
DfUnsupervised Learning
In unsupervised learning, we have only input data with no corresponding labels. The goal is to discover hidden patterns, structures, or groupings in the data. Examples include clustering (grouping similar observations) and dimensionality reduction (finding lower-dimensional representations).
DfSemi-Supervised Learning
Semi-supervised learning uses a small amount of labeled data combined with a large amount of unlabeled data. This is particularly valuable when labeling is expensive (e.g., medical imaging) but unlabeled data is abundant.
| Problem Type | Target Variable | Example | Algorithm Family |
|---|---|---|---|
| Classification | Discrete (categories) | Spam detection | Logistic Regression, SVM, Random Forest |
| Regression | Continuous (numbers) | Price prediction | Linear Regression, Ridge, Neural Networks |
| Clustering | None (discover groups) | Customer segmentation | K-Means, DBSCAN, Hierarchical |
| Dimensionality Reduction | None (compress features) | Feature extraction | PCA, t-SNE, UMAP |
| Anomaly Detection | None (find outliers) | Fraud detection | Isolation Forest, Autoencoders |
Real-World Example: Customer Churn Analysis
πEDA for Customer Churn Dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load data
df = pd.read_csv('customer_churn.csv')
# Basic info
print(f"Dataset shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
# Target variable distribution
churn_counts = df['Churn'].value_counts()
print(f"\nChurn distribution:\n{churn_counts}")
print(f"\nChurn rate: {churn_counts['Yes']/len(df)*100:.2f}%")
# Numerical features β distributions
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
for i, col in enumerate(numerical_cols[:6]):
row, col_idx = i // 3, i % 3
axes[row, col_idx].hist(df[col], bins=30, edgecolor='black', alpha=0.7)
axes[row, col_idx].set_title(f'Distribution of {col}')
plt.tight_layout()
plt.show()
# Categorical features vs Churn
categorical_cols = df.select_dtypes(include=['object']).columns
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
for i, col in enumerate(categorical_cols[:-1]):
row, col_idx = i // 3, i % 3
pd.crosstab(df[col], df['Churn'], normalize='index').plot(
kind='bar', stacked=True, ax=axes[row, col_idx])
axes[row, col_idx].set_title(f'{col} vs Churn Rate')
axes[row, col_idx].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
# Correlation with target (encode target)
df['Churn_encoded'] = (df['Churn'] == 'Yes').astype(int)
corr_with_target = df[numerical_cols].corrwith(df['Churn_encoded']).sort_values()
print(f"\nCorrelation with churn:\n{corr_with_target}")
π‘ EDA Checklist
Before any modeling, always check:
- Data types: Are they correct? (e.g., dates as datetime, categories as category)
- Missing values: How much? What pattern? How to handle?
- Duplicates: Are there duplicate rows?
- Outliers: Using IQR or Z-score methods
- Feature distributions: Normal, skewed, bimodal?
- Class balance: For classification problems, is the target balanced?
- Feature relationships: Correlations, scatter plots
- Data leakage: Are there features that wouldn't be available at prediction time?
Key Takeaways
πSummary: What is Data Science + EDA Mindset
- Data Science combines statistics, programming, and domain knowledge to extract insights from data
- EDA is the critical first step β always explore before you model
- Understand missing data mechanisms (MCAR/MAR/MNAR) to handle them correctly
- Data Science problems fall into supervised, unsupervised, and semi-supervised categories
- Always follow a systematic workflow: define, collect, clean, explore, model, communicate, deploy
- Document your EDA findings β they guide all subsequent analysis
Practice Exercise
Download the Titanic dataset and perform a complete EDA:
- What variables are available? What are their types?
- What is the survival rate overall and by subgroup?
- Which features correlate most strongly with survival?
- Are there data quality issues (missing values, outliers, inconsistencies)?
- Formulate 3 hypotheses about survival based on your EDA