Building a Portfolio + GitHub

Why Your Portfolio Matters

In data science hiring, your portfolio is your proof of work. Recruiters spend an average of 6 seconds scanning a resume — but they'll spend minutes on a compelling GitHub profile.

Architecture Diagram

┌──────────────────────────────────────────────────────────────────┐
│                 What Hiring Managers Look For                     │
│                                                                   │
│  Resume (6 sec)          Portfolio (minutes)      Interview      │
│  ┌──────────┐            ┌──────────────┐        ┌──────────┐   │
│  │ Keywords  │───────────>│ Proof of     │───────>│ Culture  │   │
│  │ Schools   │            │ Skills       │        │ Fit +    │   │
│  │ Companies │            │              │        │ Depth    │   │
│  └──────────┘            └──────────────┘        └──────────┘   │
│                                                                   │
│  "Can they do the job?"  "Show me the code"    "Will they      │
│                                              thrive here?"      │
└──────────────────────────────────────────────────────────────────┘

Portfolio Components

The 3-Project Minimum

DfPortfolio Project Triad

Every data science portfolio needs three types of projects: Foundation, Applied, and Advanced.

Architecture Diagram

┌──────────────────────────────────────────────────────────────┐
│                  Portfolio Project Triad                       │
│                                                               │
│  Project 1: Foundation          Project 2: Applied            │
│  ┌──────────────────┐          ┌──────────────────┐          │
│  │ End-to-end ML     │          │ Domain-specific   │          │
│  │ pipeline          │          │ problem solving   │          │
│  │                   │          │                   │          │
│  │ - Data cleaning   │          │ - Business framing│          │
│  │ - EDA             │          │ - Feature eng.    │          │
│  │ - Modeling        │          │ - Deployment      │          │
│  │ - Evaluation      │          │ - Impact metrics  │          │
│  │                   │          │                   │          │
│  │ Shows: Technical  │          │ Shows: Business   │          │
│  │ fundamentals      │          │ acumen            │          │
│  └──────────────────┘          └──────────────────┘          │
│                                                               │
│  Project 3: Advanced                                            │
│  ┌──────────────────────────────────────────────────┐        │
│  │ Innovation + Depth                                 │        │
│  │                                                    │        │
│  │ - Research paper replication OR                    │        │
│  │ - Novel approach to known problem OR               │        │
│  │ - Full-stack deployment                            │        │
│  │                                                    │        │
│  │ Shows: Ability to push boundaries                  │        │
│  └──────────────────────────────────────────────────┘        │
└──────────────────────────────────────────────────────────────┘

Project Selection Strategy

project_selection_criteria = {
    "High Value Projects": {
        "description": "Projects that demonstrate real-world impact",
        "examples": [
            "End-to-end churn prediction with deployment",
            "Recommendation system with A/B test design",
            "NLP sentiment analysis on real customer reviews",
            "Time series forecasting for business planning"
        ],
        "why": "Shows you can solve business problems"
    },
    "Avoid": {
        "description": "Overused or low-signal projects",
        "examples": [
            "Titanic survival prediction (everyone has this)",
            "Iris classification (too simple)",
            "MNIST digit recognition (commoditized)",
            "House prices regression (generic)"
        ],
        "why": "Doesn't differentiate you from other candidates"
    },
    "Differentiation": {
        "description": "Projects that stand out",
        "strategies": [
            "Use unique, non-standard datasets",
            "Deploy as a real product (not just a notebook)",
            "Include business impact analysis",
            "Write for a non-technical audience",
            "Contribute to open source"
        ]
    }
}

GitHub Best Practices

Profile Structure

Architecture Diagram

┌──────────────────────────────────────────────────────────────┐
│                   GitHub Profile Layout                        │
│                                                               │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  YOUR NAME                                           │   │
│  │  Data Scientist | [Your Focus Area]                  │   │
│  │  [LinkedIn] [Portfolio] [Email]                      │   │
│  │                                                      │   │
│  │  🏆 Featured: project-1 | project-2 | project-3     │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                               │
│  Pinned Repositories (6 max):                                │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐                      │
│  │ project │ │ project │ │ project │                      │
│  │   1     │ │   2     │ │   3     │                      │
│  │ ⭐ 45   │ │ ⭐ 32   │ │ ⭐ 28   │                      │
│  └─────────┘ └─────────┘ └─────────┘                      │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐                      │
│  │ utility │ │ blog    │ │ contrib │                      │
│  │  tool   │ │  posts  │ │ utions  │                      │
│  └─────────┘ └─────────┘ └─────────┘                      │
└──────────────────────────────────────────────────────────────┘

README Template

# Project Name

> One-line description of what this project does and why it matters.

[![Python](https://img.shields.io/badge/Python-3.9-blue)]()
[![License](https://img.shields.io/badge/License-MIT-green)]()

## The Problem

Business context in 2-3 sentences. What pain point does this solve?

## My Approach

Brief overview of methodology (2-3 bullets):
- Data source and size
- Key techniques used
- Why I chose this approach

## Results

| Metric | Baseline | My Model | Improvement |
|--------|----------|----------|-------------|
| Accuracy | 72% | 89% | +17% |
| F1 Score | 0.65 | 0.84 | +29% |

**Business Impact**: [Concrete dollar/time value if possible]

## Quick Start

```bash
git clone https://github.com/yourname/project.git
cd project
pip install -r requirements.txt
python main.py

Project Structure

Architecture Diagram

project/
├── data/           # Data files (.gitignored)
├── notebooks/      # Jupyter notebooks with EDA
├── src/            # Source code
├── tests/          # Unit tests
├── models/         # Saved models
├── reports/        # Generated analysis
└── README.md

Key Findings

Finding 1 with supporting visualization
Finding 2 with business implication
Finding 3 with recommendation

Future Work

Add real-time prediction API
Deploy to cloud
Add A/B test framework

Lessons Learned

What I learned building this (shows growth mindset).

Author

Your Name - LinkedIn - Email

Architecture Diagram


### Code Quality Standards

```python
# GOOD: Clean, documented, tested code
"""
churn_prediction.py

Predicts customer churn using gradient boosting.
Returns probability scores and top risk factors.
"""

import logging
from dataclasses import dataclass
from typing import Optional

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

logger = logging.getLogger(__name__)


@dataclass
class ChurnPrediction:
    """Structured output for churn predictions."""
    customer_id: str
    churn_probability: float
    risk_level: str  # low, medium, high
    top_factors: list[str]
    recommended_action: str


class ChurnPredictor:
    """Predict customer churn with interpretable results.

    This model identifies at-risk customers and provides
    actionable insights for retention teams.

    Attributes:
        model: Trained XGBoost classifier
        feature_names: List of feature names used in training
        threshold: Probability threshold for high-risk classification
    """

    def __init__(self, threshold: float = 0.5):
        self.model = Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', XGBClassifier(
                n_estimators=100,
                max_depth=5,
                learning_rate=0.1,
                random_state=42
            ))
        ])
        self.feature_names: Optional[list[str]] = None
        self.threshold = threshold
        self.is_fitted = False

    def fit(self, X: pd.DataFrame, y: pd.Series) -> "ChurnPredictor":
        """Train the churn prediction model.

        Args:
            X: Feature matrix with shape (n_samples, n_features)
            y: Binary target (1 = churned, 0 = retained)

        Returns:
            self: Fitted predictor instance
        """
        self.feature_names = list(X.columns)
        self.model.fit(X, y)
        self.is_fitted = True

        # Cross-validation score
        cv_scores = cross_val_score(self.model, X, y, cv=5, scoring='roc_auc')
        logger.info(f"Model trained. CV AUC: {cv_scores.mean():.3f} "
                    f"(+/- {cv_scores.std():.3f})")

        return self

    def predict_proba(self, X: pd.DataFrame) -> np.ndarray:
        """Get churn probability scores.

        Args:
            X: Feature matrix

        Returns:
            Array of churn probabilities
        """
        if not self.is_fitted:
            raise RuntimeError("Model must be fitted before prediction")
        return self.model.predict_proba(X)[:, 1]

    def predict(self, X: pd.DataFrame) -> list[ChurnPrediction]:
        """Generate predictions with explanations.

        Args:
            X: Feature matrix

        Returns:
            List of structured predictions with risk factors
        """
        probabilities = self.predict_proba(X)

        predictions = []
        for idx, prob in enumerate(probabilities):
            risk_level = self._classify_risk(prob)
            factors = self._get_top_factors(X.iloc[idx])
            action = self._recommend_action(risk_level)

            predictions.append(ChurnPrediction(
                customer_id=str(X.index[idx]),
                churn_probability=round(prob, 4),
                risk_level=risk_level,
                top_factors=factors,
                recommended_action=action
            ))

        return predictions

    def _classify_risk(self, probability: float) -> str:
        if probability >= 0.7:
            return "high"
        elif probability >= self.threshold:
            return "medium"
        return "low"

    def _get_top_factors(self, row: pd.Series, top_k: int = 3) -> list[str]:
        """Identify top risk factors for a customer."""
        # Simplified: in production use SHAP values
        factors = []
        if row.get('days_since_last_purchase', 0) > 30:
            factors.append("No purchase in 30+ days")
        if row.get('support_tickets', 0) > 2:
            factors.append("Multiple support tickets")
        if row.get('session_decline_pct', 0) > 0.3:
            factors.append("Significant session decline")
        return factors[:top_k]

    def _recommend_action(self, risk_level: str) -> str:
        actions = {
            "high": "Immediate outreach with personalized retention offer",
            "medium": "Engagement email with usage tips",
            "low": "Standard nurture sequence"
        }
        return actions[risk_level]

Project Documentation

Jupyter Notebook Best Practices

# GOOD notebook structure
"""
Notebook: Customer Churn Analysis
Author: Your Name
Date: 2024-01-15
Description: End-to-end analysis of customer churn patterns
             with actionable recommendations for retention team.

Table of Contents:
1. Executive Summary
2. Data Loading & Overview
3. Exploratory Data Analysis
4. Feature Engineering
5. Model Building
6. Results & Business Impact
7. Recommendations
"""

# Cell 1: Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Settings
pd.set_option('display.max_columns', 20)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Cell 2: Data Loading
"""
## 2. Data Loading & Overview

We're working with 50,000 customer records from the last 12 months.
The dataset includes behavioral, transactional, and demographic features.
"""
df = pd.read_csv('data/customers.csv')
print(f"Dataset: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"\nTarget distribution:\n{df['churned'].value_counts(normalize=True):.2%}")

# Cell 3: Executive Summary (at the top, after imports)
"""
## 1. Executive Summary

**Key Finding**: Customers who don't make a purchase within 14 days of
signing up have a 3.2x higher churn rate.

**Business Impact**: Implementing a 14-day activation campaign could
reduce churn by ~18%, saving an estimated $1.2M annually.

**Recommendation**: Launch targeted onboarding email sequence for
users inactive beyond day 7.
"""

Data Validation Tests

# tests/test_data.py
import pytest
import pandas as pd
import numpy as np

class TestDataQuality:
    """Data quality tests for the churn dataset."""

    @pytest.fixture
    def sample_data(self):
        return pd.read_csv('data/customers.csv')

    def test_row_count(self, sample_data):
        """Dataset should have at least 10K rows."""
        assert len(sample_data) >= 10_000, \
            f"Expected 10K+ rows, got {len(sample_data):,}"

    def test_no_all_null_columns(self, sample_data):
        """No column should be 100% null."""
        null_pct = sample_data.isnull().mean()
        assert null_pct.max() < 1.0, \
            f"Column {null_pct.idxmax()} is 100% null"

    def test_churn_rate_reasonable(self, sample_data):
        """Churn rate should be between 5% and 50%."""
        churn_rate = sample_data['churned'].mean()
        assert 0.05 <= churn_rate <= 0.50, \
            f"Churn rate {churn_rate:.1%} outside expected range"

    def test_no_future_dates(self, sample_data):
        """No signup dates should be in the future."""
        if 'signup_date' in sample_data.columns:
            dates = pd.to_datetime(sample_data['signup_date'])
            assert dates.max() <= pd.Timestamp.now(), \
                "Future dates found in signup_date"

    def test_feature_correlations(self, sample_data):
        """No two features should have >0.95 correlation (multicollinearity)."""
        numeric_cols = sample_data.select_dtypes(include=[np.number]).columns
        corr_matrix = sample_data[numeric_cols].corr().abs()
        np.fill_diagonal(corr_matrix.values, 0)
        max_corr = corr_matrix.max().max()
        assert max_corr < 0.95, \
            f"High multicollinearity detected: {max_corr:.3f}"

Personal Branding

Building Your Brand

brand_strategy = {
    "LinkedIn": {
        "headline": "Data Scientist | [Specialty] | [Impact Metric]",
        "examples": [
            "Data Scientist | NLP & Recommendation Systems | Built models serving 10M+ users",
            "ML Engineer | Computer Vision | Reducing defect detection time by 60%"
        ],
        "content_strategy": [
            "Post 1x/week about projects or learnings",
            "Comment on industry discussions",
            "Share insights from your work (anonymized)",
            "Write articles about technical topics"
        ]
    },
    "GitHub": {
        "profile_readme": "Tell your story: what you do, what you're learning, how to contact you",
        "contribution_consistency": "Green squares matter — commit regularly",
        "repository_quality": "README > code quality > quantity of repos"
    },
    "Blog": {
        "platforms": ["Medium", "Dev.to", "Personal site", "Substack"],
        "topics": [
            "Project walkthroughs with business context",
            "Technical deep dives on techniques",
            "Lessons learned from competitions",
            "Industry trend analysis"
        ],
        "cadence": "1-2 posts per month minimum"
    }
}

Resume Integration

resume_portfolio_alignment = {
    "Technical Skills": {
        "from_portfolio": "Show, don't just list",
        "example": "Instead of 'Python, scikit-learn' → 'Built XGBoost pipeline (AUC: 0.89) predicting customer churn'"
    },
    "Projects": {
        "from_portfolio": "3 bullet points per project",
        "structure": [
            "What: One sentence describing the project",
            "How: Key technical approach",
            "Impact: Quantified business result"
        ]
    },
    "Impact": {
        "from_portfolio": "Use numbers from your analysis",
        "examples": [
            "Saved $200K annually through demand forecasting",
            "Improved recommendation CTR by 15%",
            "Reduced data processing time from 4 hours to 12 minutes"
        ]
    }
}

Job Search Strategies

Application Tracking

class JobApplicationTracker:
    def __init__(self):
        self.applications = []

    def add_application(self, company: str, role: str, status: str,
                       notes: str = "", referral: bool = False):
        self.applications.append({
            "company": company,
            "role": role,
            "status": status,
            "notes": notes,
            "referral": referral,
            "applied_date": pd.Timestamp.now()
        })

    def get_pipeline_summary(self) -> dict:
        """Get counts by status."""
        df = pd.DataFrame(self.applications)
        if df.empty:
            return {}
        return df['status'].value_counts().to_dict()

    def get_stats(self) -> dict:
        df = pd.DataFrame(self.applications)
        if df.empty:
            return {"total": 0}

        return {
            "total": len(df),
            "referral_rate": df['referral'].mean(),
            "by_status": df['status'].value_counts().to_dict()
        }

# Usage
tracker = JobApplicationTracker()
tracker.add_application("Google", "Data Scientist", "applied",
                       notes="Found via referral from LinkedIn", referral=True)
tracker.add_application("Meta", "ML Engineer", "phone_screen",
                       notes="LeetCode medium problems")
print(tracker.get_stats())

Networking Strategy

networking_plan = {
    "Weekly Goals": {
        "linkedin_connections": 5,
        "informational_interviews": 2,
        "community_participation": "1 forum/Discord Slack",
        "content_creation": "1 post or article"
    },
    "Event Types": [
        "Meetups (data science, ML, Python)",
        "Conferences (PyData, NeurIPS, local events)",
        "Hackathons (build portfolio + network)",
        "Online communities (Kaggle, GitHub Discussions)"
    ],
    "Follow_Up_Template": """
    Hi [Name],

    Great meeting you at [event]. I enjoyed our conversation about [topic].

    I'd love to stay connected and learn more about your work at [company].

    Best,
    [Your name]
    """
}

Key Takeaways

📋Summary: Portfolio & GitHub

Three projects minimum: foundation, applied, and advanced -- each demonstrates a different capability
Your README is as important as your code -- make it compelling with a clear problem statement, results, and business impact
Code quality signals professionalism: tests, docs, clean structure -- hiring managers notice these details
GitHub profile tells a story -- pin your best work and maintain consistent contribution history
Build in public: consistent commits and content creation matter -- they demonstrate passion and growth
Tailor your portfolio to the roles you're targeting -- a recommendation system project matters more for an ML role
Networking + portfolio > cold applications alone -- referrals dramatically increase interview callback rates

Practice Exercises

Create a GitHub profile README introducing yourself and your work
Audit an existing project: Add a proper README, tests, and documentation
Build a new project from the "High Value" list with full documentation
Set up a contribution streak: Commit something every day for 30 days
Write a project blog post explaining one of your projects to a non-technical audience
Create a portfolio website (GitHub Pages or similar) with project summaries
Reach out to 5 people in your network for informational interviews

Building a Portfolio + GitHub

Why Your Portfolio Matters

Portfolio Components

The 3-Project Minimum

DfPortfolio Project Triad

Project Selection Strategy

GitHub Best Practices

Profile Structure

README Template

Project Structure

Key Findings

Future Work

Lessons Learned

Author

Project Documentation

Jupyter Notebook Best Practices

Data Validation Tests

Personal Branding

Building Your Brand

Resume Integration

Job Search Strategies

Application Tracking

Networking Strategy

Key Takeaways

📋Summary: Portfolio & GitHub

Practice Exercises

Need Expert Data Science Help?