Data Science Technical Writing

Great work that nobody understands is wasted work. Learn to communicate data science clearly through blog posts, notebooks, and documentation.

Writing Framework

Architecture Diagram

1. Know your audience  → Beginner? Practitioner? Researcher?
2. Define the goal     → Teach? Persuade? Document?
3. Structure clearly   → Problem → Approach → Results → Takeaways
4. Write simply        → Avoid jargon unless necessary
5. Show, don't tell    → Code, visualizations, examples

Blog Post Structure

# Template for data science blog posts
blog_template = """
# [Action-oriented title]
## e.g., "How I Reduced Model Latency by 10x Without Sacrificing Accuracy"

## Introduction (100-150 words)
- Hook: Start with a surprising result or relatable problem
- Context: What's the real-world scenario?
- Promise: What will the reader learn?

## The Problem (200-300 words)
- Business context
- Why existing solutions fall short
- Constraints and requirements

## The Approach (500-800 words)
- Solution overview with architecture diagram
- Key design decisions and tradeoffs
- Code snippets (keep them short and focused)

## Results (300-500 words)
- Quantitative metrics with before/after
- Visualizations
- Ablation studies if applicable

## Lessons Learned (200-300 words)
- What worked
- What didn't
- What you'd do differently

## Next Steps (100 words)
- Future improvements
- Open questions
- Links to resources
"""

Jupyter Notebook Storytelling

# Cell structure for effective notebooks

# Cell 1: Title and context
"""
# Predicting Customer Churn with Gradient Boosting

**Business Goal:** Identify at-risk customers for targeted retention campaigns
**Dataset:** 50K customer records with 6 months of behavioral data
**Result:** 89% AUC, $2.3M estimated annual impact
"""

# Cell 2: Setup and imports
"""
## Setup
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Cell 3: Data exploration with narrative
"""
## What does our data look like?
"""
# Always explain what you're looking for
df = pd.read_csv('customers.csv')

# Add markdown before visualization
"""
The distribution of customer tenure shows a bimodal pattern,
suggesting two distinct customer segments. This insight will
inform our feature engineering.
"""
plt.figure(figsize=(10, 6))
plt.hist(df['tenure_months'], bins=30, edgecolor='black')
plt.xlabel('Tenure (months)')
plt.ylabel('Count')
plt.title('Customer Tenure Distribution')
plt.show()

# Cell 4: Feature engineering with reasoning
"""
## Feature Engineering

Key insight: Recent behavior is more predictive than historical averages.
We create rolling window features at 7, 30, and 90-day horizons.
"""

# Cell 5: Model training with clear output
"""
## Model Training

We use LightGBM for its speed and native handling of categorical features.
"""
import lightgbm as lgb

model = lgb.LGBMClassifier(n_estimators=100, max_depth=6)
model.fit(X_train, y_train)

# Cell 6: Evaluation with business context
"""
## Results

Our model achieves 89% AUC, meaning it correctly ranks a churner higher 
than a non-churner 89% of the time. In practical terms:

- **Top 10% flagged customers** have a 45% actual churn rate (vs 8% baseline)
- **Expected impact:** Retaining 30% of flagged customers = $2.3M/year
"""

# Notebook best practices
notebook_checklist = {
    "do": [
        "Start with a clear title and goal",
        "Use markdown cells to explain reasoning",
        "Show outputs inline (not just code)",
        "Include a requirements.txt",
        "Use relative paths for data",
        "Add a table of contents for long notebooks"
    ],
    "avoid": [
        "Long code cells without explanation",
        "Output cells that scroll for pages",
        "Hardcoded paths",
        "Unused imports",
        "Printing raw DataFrames",
        "Undefined variables"
    ]
}

Documentation Best Practices

# Docstring format (Google style)
def train_model(X_train, y_train, config):
    """Train a gradient boosting model with cross-validation.
    
    This function handles feature preprocessing, hyperparameter tuning,
    and model training in a single pipeline. It logs metrics to MLflow
    and saves the best model to the specified path.
    
    Args:
        X_train: Training features as a pandas DataFrame.
        y_train: Training labels as a numpy array.
        config: Configuration dictionary with keys:
            - n_estimators: Number of trees (default: 100)
            - max_depth: Maximum tree depth (default: 6)
            - learning_rate: Learning rate (default: 0.1)
    
    Returns:
        dict: Training results containing:
            - model: Trained model object
            - metrics: Cross-validation metrics
            - feature_importance: Feature importance scores
    
    Raises:
        ValueError: If X_train and y_train have different lengths.
        RuntimeError: If training fails to converge.
    
    Example:
        >>> config = {'n_estimators': 200, 'max_depth': 8}
        >>> results = train_model(X_train, y_train, config)
        >>> print(f"CV AUC: {results['metrics']['auc']:.4f}")
    """
    pass

# README template
readme_template = """
# Project Name

Brief description of what this project does and why.

## Quick Start

```bash
# Clone and setup
git clone <repo>
cd project
pip install -r requirements.txt

# Run
python main.py --config config.yaml

Architecture

[Architecture diagram or description]

Data

Dataset	Description	Size
train.csv	Training data	50K rows
test.csv	Test set	10K rows

Model Performance

Model	AUC	Latency	Notes
Baseline (Logistic Regression)	0.82	5ms	Simple baseline
LightGBM	0.89	15ms	Production model
Neural Network	0.91	45ms	Too slow for production

Configuration

See config.yaml for all options.

Deployment

docker build -t ml-service .
docker run -p 8000:8000 ml-service

License

MIT """

Architecture Diagram


## Technical Communication Patterns

```python
# Pattern: Lead with the answer
def bad_example():
    """We tried several approaches. First we tried X, then Y, then Z,
    and after much deliberation we found that Z was best."""
    pass

def good_example():
    """We recommend approach Z because it achieves 15% better latency
    than alternatives. Here's why: [explanation]"""
    pass

# Pattern: Use concrete numbers
def vague():
    """The model performs significantly better than the baseline."""
    pass

def specific():
    """The model achieves 0.89 AUC, a 7-point improvement over the
    0.82 AUC baseline (p < 0.001, n=10,000)."""
    pass

# Pattern: Explain tradeoffs
def incomplete():
    """We chose LightGBM because it's fast."""
    pass

def thorough():
    """We chose LightGBM over XGBoost and neural networks:
    
    - **vs XGBoost:** 3x faster training with comparable accuracy
    - **vs Neural Networks:** No GPU required, better on small data
    - **Tradeoff:** Slightly lower accuracy than neural networks on large datasets
    
    For our use case (50K samples, CPU-only serving), LightGBM is optimal."""
    pass

Visualization Best Practices

import matplotlib.pyplot as plt
import seaborn as sns

# Effective visualization checklist
viz_checklist = {
    "title": "Clear, descriptive title",
    "labels": "Axis labels with units",
    "legend": "Clear legend if needed",
    "colors": "Colorblind-friendly palette",
    "annotations": "Highlight key points",
    "source": "Data source attribution"
}

# Good visualization example
def plot_model_comparison(results):
    """Compare model performance with clear labels and context."""
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Left: Metric comparison
    models = list(results.keys())
    aucs = [r['auc'] for r in results.values()]
    
    colors = ['#2ecc71' if a > 0.85 else '#3498db' for a in aucs]
    axes[0].barh(models, aucs, color=colors)
    axes[0].set_xlabel('AUC Score')
    axes[0].set_title('Model Performance Comparison')
    axes[0].axvline(x=0.85, color='red', linestyle='--', label='Target: 0.85')
    axes[0].legend()
    
    # Right: Latency vs Accuracy tradeoff
    latencies = [r['latency_ms'] for r in results.values()]
    axes[1].scatter(latencies, aucs, s=100)
    
    for i, model in enumerate(models):
        axes[1].annotate(model, (latencies[i], aucs[i]), 
                        textcoords="offset points", xytext=(5, 5))
    
    axes[1].set_xlabel('Inference Latency (ms)')
    axes[1].set_ylabel('AUC Score')
    axes[1].set_title('Accuracy vs Latency Tradeoff')
    
    plt.tight_layout()
    plt.savefig('model_comparison.png', dpi=150, bbox_inches='tight')
    plt.show()

Common Writing Mistakes

# Mistake 1: Walls of code
bad_code_block = """
```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# ... 50 more lines

"""

Better: Break into focused chunks with explanation

good_code_block = """ We'll use a Random Forest classifier for its interpretability and robustness to outliers:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=42)
model.fit(X_train, y_train)

"""

Mistake 2: No context for code

bad = "We got 0.89 AUC." good = "Our model achieves 0.89 AUC on the held-out test set, meaning it correctly ranks a churner higher than a non-churner 89% of the time."

Mistake 3: Burying the lead

bad = "After extensive experimentation with various architectures..." good = "A simple LightGBM model outperforms complex neural networks while being 10x faster to train and serve."

Architecture Diagram


## Key Takeaways

1. **Lead with the answer** – readers want to know the conclusion first
2. **Use concrete numbers** – "15% improvement" beats "significantly better"
3. **Show your thinking** – explain why, not just what
4. **Keep code focused** – one concept per code cell
5. **Make it reproducible** – include requirements, data sources, and clear instructions