Why Your Portfolio Matters
In data science hiring, your portfolio is your proof of work. Recruiters spend an average of 6 seconds scanning a resume β but they'll spend minutes on a compelling GitHub profile.
Architecture Diagram
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β What Hiring Managers Look For β
β β
β Resume (6 sec) Portfolio (minutes) Interview β
β ββββββββββββ ββββββββββββββββ ββββββββββββ β
β β Keywords ββββββββββββ>β Proof of ββββββββ>β Culture β β
β β Schools β β Skills β β Fit + β β
β β Companies β β β β Depth β β
β ββββββββββββ ββββββββββββββββ ββββββββββββ β
β β
β "Can they do the job?" "Show me the code" "Will they β
β thrive here?" β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Portfolio Components
The 3-Project Minimum
DfPortfolio Project Triad
Every data science portfolio needs three types of projects: Foundation, Applied, and Advanced.
Architecture Diagram
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Portfolio Project Triad β
β β
β Project 1: Foundation Project 2: Applied β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β End-to-end ML β β Domain-specific β β
β β pipeline β β problem solving β β
β β β β β β
β β - Data cleaning β β - Business framingβ β
β β - EDA β β - Feature eng. β β
β β - Modeling β β - Deployment β β
β β - Evaluation β β - Impact metrics β β
β β β β β β
β β Shows: Technical β β Shows: Business β β
β β fundamentals β β acumen β β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β
β Project 3: Advanced β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Innovation + Depth β β
β β β β
β β - Research paper replication OR β β
β β - Novel approach to known problem OR β β
β β - Full-stack deployment β β
β β β β
β β Shows: Ability to push boundaries β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Project Selection Strategy
project_selection_criteria = {
"High Value Projects": {
"description": "Projects that demonstrate real-world impact",
"examples": [
"End-to-end churn prediction with deployment",
"Recommendation system with A/B test design",
"NLP sentiment analysis on real customer reviews",
"Time series forecasting for business planning"
],
"why": "Shows you can solve business problems"
},
"Avoid": {
"description": "Overused or low-signal projects",
"examples": [
"Titanic survival prediction (everyone has this)",
"Iris classification (too simple)",
"MNIST digit recognition (commoditized)",
"House prices regression (generic)"
],
"why": "Doesn't differentiate you from other candidates"
},
"Differentiation": {
"description": "Projects that stand out",
"strategies": [
"Use unique, non-standard datasets",
"Deploy as a real product (not just a notebook)",
"Include business impact analysis",
"Write for a non-technical audience",
"Contribute to open source"
]
}
}
GitHub Best Practices
Profile Structure
Architecture Diagram
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GitHub Profile Layout β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β YOUR NAME β β
β β Data Scientist | [Your Focus Area] β β
β β [LinkedIn] [Portfolio] [Email] β β
β β β β
β β π Featured: project-1 | project-2 | project-3 β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Pinned Repositories (6 max): β
β βββββββββββ βββββββββββ βββββββββββ β
β β project β β project β β project β β
β β 1 β β 2 β β 3 β β
β β β 45 β β β 32 β β β 28 β β
β βββββββββββ βββββββββββ βββββββββββ β
β βββββββββββ βββββββββββ βββββββββββ β
β β utility β β blog β β contrib β β
β β tool β β posts β β utions β β
β βββββββββββ βββββββββββ βββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
README Template
# Project Name
> One-line description of what this project does and why it matters.
[]()
[]()
## The Problem
Business context in 2-3 sentences. What pain point does this solve?
## My Approach
Brief overview of methodology (2-3 bullets):
- Data source and size
- Key techniques used
- Why I chose this approach
## Results
| Metric | Baseline | My Model | Improvement |
|--------|----------|----------|-------------|
| Accuracy | 72% | 89% | +17% |
| F1 Score | 0.65 | 0.84 | +29% |
**Business Impact**: [Concrete dollar/time value if possible]
## Quick Start
```bash
git clone https://github.com/yourname/project.git
cd project
pip install -r requirements.txt
python main.py
Project Structure
Architecture Diagram
project/
βββ data/ # Data files (.gitignored)
βββ notebooks/ # Jupyter notebooks with EDA
βββ src/ # Source code
βββ tests/ # Unit tests
βββ models/ # Saved models
βββ reports/ # Generated analysis
βββ README.md
Key Findings
- Finding 1 with supporting visualization
- Finding 2 with business implication
- Finding 3 with recommendation
Future Work
- Add real-time prediction API
- Deploy to cloud
- Add A/B test framework
Lessons Learned
What I learned building this (shows growth mindset).
Author
Architecture Diagram
### Code Quality Standards
```python
# GOOD: Clean, documented, tested code
"""
churn_prediction.py
Predicts customer churn using gradient boosting.
Returns probability scores and top risk factors.
"""
import logging
from dataclasses import dataclass
from typing import Optional
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
logger = logging.getLogger(__name__)
@dataclass
class ChurnPrediction:
"""Structured output for churn predictions."""
customer_id: str
churn_probability: float
risk_level: str # low, medium, high
top_factors: list[str]
recommended_action: str
class ChurnPredictor:
"""Predict customer churn with interpretable results.
This model identifies at-risk customers and provides
actionable insights for retention teams.
Attributes:
model: Trained XGBoost classifier
feature_names: List of feature names used in training
threshold: Probability threshold for high-risk classification
"""
def __init__(self, threshold: float = 0.5):
self.model = Pipeline([
('scaler', StandardScaler()),
('classifier', XGBClassifier(
n_estimators=100,
max_depth=5,
learning_rate=0.1,
random_state=42
))
])
self.feature_names: Optional[list[str]] = None
self.threshold = threshold
self.is_fitted = False
def fit(self, X: pd.DataFrame, y: pd.Series) -> "ChurnPredictor":
"""Train the churn prediction model.
Args:
X: Feature matrix with shape (n_samples, n_features)
y: Binary target (1 = churned, 0 = retained)
Returns:
self: Fitted predictor instance
"""
self.feature_names = list(X.columns)
self.model.fit(X, y)
self.is_fitted = True
# Cross-validation score
cv_scores = cross_val_score(self.model, X, y, cv=5, scoring='roc_auc')
logger.info(f"Model trained. CV AUC: {cv_scores.mean():.3f} "
f"(+/- {cv_scores.std():.3f})")
return self
def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
"""Get churn probability scores.
Args:
X: Feature matrix
Returns:
Array of churn probabilities
"""
if not self.is_fitted:
raise RuntimeError("Model must be fitted before prediction")
return self.model.predict_proba(X)[:, 1]
def predict(self, X: pd.DataFrame) -> list[ChurnPrediction]:
"""Generate predictions with explanations.
Args:
X: Feature matrix
Returns:
List of structured predictions with risk factors
"""
probabilities = self.predict_proba(X)
predictions = []
for idx, prob in enumerate(probabilities):
risk_level = self._classify_risk(prob)
factors = self._get_top_factors(X.iloc[idx])
action = self._recommend_action(risk_level)
predictions.append(ChurnPrediction(
customer_id=str(X.index[idx]),
churn_probability=round(prob, 4),
risk_level=risk_level,
top_factors=factors,
recommended_action=action
))
return predictions
def _classify_risk(self, probability: float) -> str:
if probability >= 0.7:
return "high"
elif probability >= self.threshold:
return "medium"
return "low"
def _get_top_factors(self, row: pd.Series, top_k: int = 3) -> list[str]:
"""Identify top risk factors for a customer."""
# Simplified: in production use SHAP values
factors = []
if row.get('days_since_last_purchase', 0) > 30:
factors.append("No purchase in 30+ days")
if row.get('support_tickets', 0) > 2:
factors.append("Multiple support tickets")
if row.get('session_decline_pct', 0) > 0.3:
factors.append("Significant session decline")
return factors[:top_k]
def _recommend_action(self, risk_level: str) -> str:
actions = {
"high": "Immediate outreach with personalized retention offer",
"medium": "Engagement email with usage tips",
"low": "Standard nurture sequence"
}
return actions[risk_level]
Project Documentation
Jupyter Notebook Best Practices
# GOOD notebook structure
"""
Notebook: Customer Churn Analysis
Author: Your Name
Date: 2024-01-15
Description: End-to-end analysis of customer churn patterns
with actionable recommendations for retention team.
Table of Contents:
1. Executive Summary
2. Data Loading & Overview
3. Exploratory Data Analysis
4. Feature Engineering
5. Model Building
6. Results & Business Impact
7. Recommendations
"""
# Cell 1: Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Settings
pd.set_option('display.max_columns', 20)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
# Cell 2: Data Loading
"""
## 2. Data Loading & Overview
We're working with 50,000 customer records from the last 12 months.
The dataset includes behavioral, transactional, and demographic features.
"""
df = pd.read_csv('data/customers.csv')
print(f"Dataset: {df.shape[0]:,} rows Γ {df.shape[1]} columns")
print(f"\nTarget distribution:\n{df['churned'].value_counts(normalize=True):.2%}")
# Cell 3: Executive Summary (at the top, after imports)
"""
## 1. Executive Summary
**Key Finding**: Customers who don't make a purchase within 14 days of
signing up have a 3.2x higher churn rate.
**Business Impact**: Implementing a 14-day activation campaign could
reduce churn by ~18%, saving an estimated $1.2M annually.
**Recommendation**: Launch targeted onboarding email sequence for
users inactive beyond day 7.
"""
Data Validation Tests
# tests/test_data.py
import pytest
import pandas as pd
import numpy as np
class TestDataQuality:
"""Data quality tests for the churn dataset."""
@pytest.fixture
def sample_data(self):
return pd.read_csv('data/customers.csv')
def test_row_count(self, sample_data):
"""Dataset should have at least 10K rows."""
assert len(sample_data) >= 10_000, \
f"Expected 10K+ rows, got {len(sample_data):,}"
def test_no_all_null_columns(self, sample_data):
"""No column should be 100% null."""
null_pct = sample_data.isnull().mean()
assert null_pct.max() < 1.0, \
f"Column {null_pct.idxmax()} is 100% null"
def test_churn_rate_reasonable(self, sample_data):
"""Churn rate should be between 5% and 50%."""
churn_rate = sample_data['churned'].mean()
assert 0.05 <= churn_rate <= 0.50, \
f"Churn rate {churn_rate:.1%} outside expected range"
def test_no_future_dates(self, sample_data):
"""No signup dates should be in the future."""
if 'signup_date' in sample_data.columns:
dates = pd.to_datetime(sample_data['signup_date'])
assert dates.max() <= pd.Timestamp.now(), \
"Future dates found in signup_date"
def test_feature_correlations(self, sample_data):
"""No two features should have >0.95 correlation (multicollinearity)."""
numeric_cols = sample_data.select_dtypes(include=[np.number]).columns
corr_matrix = sample_data[numeric_cols].corr().abs()
np.fill_diagonal(corr_matrix.values, 0)
max_corr = corr_matrix.max().max()
assert max_corr < 0.95, \
f"High multicollinearity detected: {max_corr:.3f}"
Personal Branding
Building Your Brand
brand_strategy = {
"LinkedIn": {
"headline": "Data Scientist | [Specialty] | [Impact Metric]",
"examples": [
"Data Scientist | NLP & Recommendation Systems | Built models serving 10M+ users",
"ML Engineer | Computer Vision | Reducing defect detection time by 60%"
],
"content_strategy": [
"Post 1x/week about projects or learnings",
"Comment on industry discussions",
"Share insights from your work (anonymized)",
"Write articles about technical topics"
]
},
"GitHub": {
"profile_readme": "Tell your story: what you do, what you're learning, how to contact you",
"contribution_consistency": "Green squares matter β commit regularly",
"repository_quality": "README > code quality > quantity of repos"
},
"Blog": {
"platforms": ["Medium", "Dev.to", "Personal site", "Substack"],
"topics": [
"Project walkthroughs with business context",
"Technical deep dives on techniques",
"Lessons learned from competitions",
"Industry trend analysis"
],
"cadence": "1-2 posts per month minimum"
}
}
Resume Integration
resume_portfolio_alignment = {
"Technical Skills": {
"from_portfolio": "Show, don't just list",
"example": "Instead of 'Python, scikit-learn' β 'Built XGBoost pipeline (AUC: 0.89) predicting customer churn'"
},
"Projects": {
"from_portfolio": "3 bullet points per project",
"structure": [
"What: One sentence describing the project",
"How: Key technical approach",
"Impact: Quantified business result"
]
},
"Impact": {
"from_portfolio": "Use numbers from your analysis",
"examples": [
"Saved $200K annually through demand forecasting",
"Improved recommendation CTR by 15%",
"Reduced data processing time from 4 hours to 12 minutes"
]
}
}
Job Search Strategies
Application Tracking
class JobApplicationTracker:
def __init__(self):
self.applications = []
def add_application(self, company: str, role: str, status: str,
notes: str = "", referral: bool = False):
self.applications.append({
"company": company,
"role": role,
"status": status,
"notes": notes,
"referral": referral,
"applied_date": pd.Timestamp.now()
})
def get_pipeline_summary(self) -> dict:
"""Get counts by status."""
df = pd.DataFrame(self.applications)
if df.empty:
return {}
return df['status'].value_counts().to_dict()
def get_stats(self) -> dict:
df = pd.DataFrame(self.applications)
if df.empty:
return {"total": 0}
return {
"total": len(df),
"referral_rate": df['referral'].mean(),
"by_status": df['status'].value_counts().to_dict()
}
# Usage
tracker = JobApplicationTracker()
tracker.add_application("Google", "Data Scientist", "applied",
notes="Found via referral from LinkedIn", referral=True)
tracker.add_application("Meta", "ML Engineer", "phone_screen",
notes="LeetCode medium problems")
print(tracker.get_stats())
Networking Strategy
networking_plan = {
"Weekly Goals": {
"linkedin_connections": 5,
"informational_interviews": 2,
"community_participation": "1 forum/Discord Slack",
"content_creation": "1 post or article"
},
"Event Types": [
"Meetups (data science, ML, Python)",
"Conferences (PyData, NeurIPS, local events)",
"Hackathons (build portfolio + network)",
"Online communities (Kaggle, GitHub Discussions)"
],
"Follow_Up_Template": """
Hi [Name],
Great meeting you at [event]. I enjoyed our conversation about [topic].
I'd love to stay connected and learn more about your work at [company].
Best,
[Your name]
"""
}
Key Takeaways
πSummary: Portfolio & GitHub
- Three projects minimum: foundation, applied, and advanced -- each demonstrates a different capability
- Your README is as important as your code -- make it compelling with a clear problem statement, results, and business impact
- Code quality signals professionalism: tests, docs, clean structure -- hiring managers notice these details
- GitHub profile tells a story -- pin your best work and maintain consistent contribution history
- Build in public: consistent commits and content creation matter -- they demonstrate passion and growth
- Tailor your portfolio to the roles you're targeting -- a recommendation system project matters more for an ML role
- Networking + portfolio > cold applications alone -- referrals dramatically increase interview callback rates
Practice Exercises
- Create a GitHub profile README introducing yourself and your work
- Audit an existing project: Add a proper README, tests, and documentation
- Build a new project from the "High Value" list with full documentation
- Set up a contribution streak: Commit something every day for 30 days
- Write a project blog post explaining one of your projects to a non-technical audience
- Create a portfolio website (GitHub Pages or similar) with project summaries
- Reach out to 5 people in your network for informational interviews