Project 2: End-to-End ML Pipeline

Overview

DfML Pipeline

A sequence of automated steps that transform raw data into predictions, including data ingestion, preprocessing, feature engineering, model training, evaluation, and deployment. A well-designed pipeline ensures reproducibility, prevents data leakage, and enables consistent predictions in production.

Architecture Diagram

ML Pipeline Architecture:
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐    │
│  │   Data    │──→│ Feature  │──→│  Model   │──→│Evaluation│    │
│  │ Ingestion│   │Engineering│   │ Training │   │ & Tuning │    │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘    │
│       │              │              │              │            │
│       ▼              ▼              ▼              ▼            │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐    │
│  │  Schema  │   │Pipeline  │   │  Model   │   │  Metric  │    │
│  │Validation│   │ Object   │   │ Registry │   │ Dashboard│    │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Project: Predicting Customer Churn

We'll build a pipeline to predict whether a customer will churn (cancel subscription) based on usage patterns and demographics.

Step 1: Data Loading and Exploration

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
filterwarnings('ignore')

# Generate realistic customer churn dataset
np.random.seed(42)
n_customers = 5000

data = pd.DataFrame({
    'customer_id': range(1, n_customers + 1),
    'tenure_months': np.random.exponential(24, n_customers).astype(int).clip(1, 72),
    'monthly_charges': np.random.normal(65, 25, n_customers).clip(20, 120).round(2),
    'total_charges': None,
    'num_support_tickets': np.random.poisson(2, n_customers),
    'num_referrals': np.random.poisson(1, n_customers),
    'avg_monthly_usage_gb': np.random.gamma(3, 5, n_customers).round(2),
    'contract_type': np.random.choice(
        ['Month-to-Month', 'One Year', 'Two Year'],
        n_customers, p=[0.5, 0.3, 0.2]
    ),
    'payment_method': np.random.choice(
        ['Credit Card', 'Bank Transfer', 'Electronic Check', 'Mailed Check'],
        n_customers
    ),
    'has_partner': np.random.choice([0, 1], n_customers, p=[0.5, 0.4]),
    'has_dependents': np.random.choice([0, 1], n_customers, p=[0.7, 0.3]),
})

# Calculate total charges
data['total_charges'] = (data['tenure_months'] * data['monthly_charges'] *
                         np.random.uniform(0.8, 1.2, n_customers)).round(2)

# Generate target variable (churn) with realistic relationships
churn_prob = (
    0.1 +
    0.3 * (data['contract_type'] == 'Month-to-Month').astype(int) +
    0.1 * (data['num_support_tickets'] > 3).astype(int) +
    0.05 * (data['monthly_charges'] > 80).astype(int) -
    0.02 * (data['tenure_months'] > 12).astype(int) +
    np.random.randn(n_customers) * 0.05
).clip(0, 1)

data['churned'] = (np.random.random(n_customers) < churn_prob).astype(int)

# Save dataset
data.to_csv('customer_churn.csv', index=False)
print(f"Dataset shape: {data.shape}")
print(f"\nChurn rate: {data['churned'].mean():.2%}")
print(f"\nFirst few rows:")
print(data.head())

# Exploratory Data Analysis
def eda_report(df):
    """Comprehensive EDA report."""
    print("=" * 60)
    print("DATA OVERVIEW")
    print("=" * 60)
    print(f"\nShape: {df.shape}")
    print(f"\nData Types:\n{df.dtypes}")
    print(f"\nMissing Values:\n{df.isnull().sum()}")
    print(f"\nDuplicates: {df.duplicated().sum()}")

    print("\n" + "=" * 60)
    print("NUMERIC FEATURES")
    print("=" * 60)
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    print(df[numeric_cols].describe().round(2))

    print("\n" + "=" * 60)
    print("CATEGORICAL FEATURES")
    print("=" * 60)
    cat_cols = df.select_dtypes(include=['object']).columns
    for col in cat_cols:
        print(f"\n{col}:")
        print(df[col].value_counts())

    return numeric_cols, cat_cols

numeric_cols, cat_cols = eda_report(data)

Always perform EDA before building models. Understanding the data distribution, class balance, feature types, and missing values guides your preprocessing and model choices. Pay special attention to the target distribution — an imbalanced target requires special handling (see Lesson 21).

Step 2: Data Visualization

def plot_eda(df, target='churned'):
    """Create EDA visualizations."""
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))

    df[target].value_counts().plot(kind='bar', ax=axes[0, 0], color=['steelblue', 'coral'])
    axes[0, 0].set_title('Churn Distribution')
    axes[0, 0].set_xticklabels(['No Churn', 'Churned'], rotation=0)

    for label in [0, 1]:
        subset = df[df[target] == label]
        axes[0, 1].hist(subset['tenure_months'], alpha=0.6,
                        label=f'Churn={label}', bins=20)
    axes[0, 1].set_title('Tenure Distribution by Churn')
    axes[0, 1].legend()

    for label in [0, 1]:
        subset = df[df[target] == label]
        axes[0, 2].hist(subset['monthly_charges'], alpha=0.6,
                        label=f'Churn={label}', bins=20)
    axes[0, 2].set_title('Monthly Charges by Churn')
    axes[0, 2].legend()

    ct = pd.crosstab(df['contract_type'], df[target], normalize='index')
    ct.plot(kind='bar', ax=axes[1, 0], color=['steelblue', 'coral'])
    axes[1, 0].set_title('Churn Rate by Contract Type')
    axes[1, 0].set_xticklabels(axes[1, 0].get_xticklabels(), rotation=45)

    ticket_churn = df.groupby('num_support_tickets')[target].mean()
    axes[1, 1].bar(ticket_churn.index, ticket_churn.values, color='steelblue')
    axes[1, 1].set_title('Churn Rate by Support Tickets')
    axes[1, 1].set_xlabel('Number of Support Tickets')

    numeric_df = df.select_dtypes(include=[np.number])
    corr = numeric_df.corr()
    sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm',
                ax=axes[1, 2], center=0, vmin=-1, vmax=1)
    axes[1, 2].set_title('Feature Correlations')

    plt.tight_layout()
    plt.savefig('eda_plots.png', dpi=150)
    plt.show()

plot_eda(data)

Step 3: Feature Engineering Pipeline

DfFeature Engineering

The process of creating new input features from raw data to improve model performance. Domain-specific features (ratios, aggregations, interactions) often provide more predictive power than raw features alone.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    StandardScaler, OneHotEncoder, OrdinalEncoder,
    PolynomialFeatures
)
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin

# Define feature groups
numeric_features = [
    'tenure_months', 'monthly_charges', 'total_charges',
    'num_support_tickets', 'num_referrals', 'avg_monthly_usage_gb'
]

categorical_features = ['contract_type', 'payment_method']

binary_features = ['has_partner', 'has_dependents']

# Custom transformer for feature engineering
class FeatureEngineer(BaseEstimator, TransformerMixin):
    """Create domain-specific features."""

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()

        # Revenue features
        X['revenue_per_month'] = X['total_charges'] / X['tenure_months'].clip(lower=1)
        X['charges_ratio'] = X['monthly_charges'] / X['monthly_charges'].mean()

        # Usage features
        X['usage_per_charge'] = X['avg_monthly_usage_gb'] / X['monthly_charges'].clip(lower=1)
        X['tickets_per_month'] = X['num_support_tickets'] / X['tenure_months'].clip(lower=1)

        # Tenure buckets
        X['tenure_bucket'] = pd.cut(
            X['tenure_months'],
            bins=[0, 6, 12, 24, 48, 100],
            labels=['0-6m', '6-12m', '1-2y', '2-4y', '4y+']
        ).astype(str)

        return X

# Build preprocessing pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('binary', 'passthrough', binary_features)
    ],
    remainder='drop'
)

# Full pipeline
full_pipeline = Pipeline(steps=[
    ('feature_engineer', FeatureEngineer()),
    ('preprocessor', preprocessor)
])

print("Pipeline structure:")
for i, (name, step) in enumerate(full_pipeline.steps):
    print(f"  {i}: {name} ({type(step).__name__})")

Encapsulating preprocessing in a ColumnTransformer or Pipeline ensures that transformations are applied consistently to training and test data, preventing data leakage. The fit_transform method should only be called on training data; transform is used on test data to avoid information leakage from the test set.

Step 4: Model Selection and Training

from sklearn.model_selection import (
    train_test_split, cross_val_score,
    StratifiedKFold, GridSearchCV
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
    RandomForestClassifier, GradientBoostingClassifier,
    VotingClassifier, StackingClassifier
)
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, classification_report,
    confusion_matrix, roc_curve
)

# Split data
X = data.drop(['customer_id', 'churned'], axis=1)
y = data['churned']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train size: {X_train.shape[0]}")
print(f"Test size:  {X_test.shape[0]}")
print(f"Train churn rate: {y_train.mean():.2%}")
print(f"Test churn rate:  {y_test.mean():.2%}")

# Apply preprocessing
X_train_processed = full_pipeline.fit_transform(X_train)
X_test_processed = full_pipeline.transform(X_test)

print(f"\nProcessed features: {X_train_processed.shape[1]}")

📝Model Comparison with Cross-Validation

# Compare multiple models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(
        n_estimators=100, use_label_encoder=False,
        eval_metric='logloss', random_state=42
    ),
    'LightGBM': LGBMClassifier(n_estimators=100, random_state=42, verbose=-1),
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("Model Comparison (5-fold CV):")
print(f"{'Model':<25} {'Accuracy':>10} {'F1':>10} {'AUC':>10}")
print("-" * 60)

results = {}
for name, model in models.items():
    acc_scores = cross_val_score(model, X_train_processed, y_train, cv=cv, scoring='accuracy')
    f1_scores = cross_val_score(model, X_train_processed, y_train, cv=cv, scoring='f1')
    auc_scores = cross_val_score(model, X_train_processed, y_train, cv=cv, scoring='roc_auc')

    results[name] = {
        'accuracy': acc_scores.mean(),
        'f1': f1_scores.mean(),
        'auc': auc_scores.mean()
    }

    print(f"{name:<25} {acc_scores.mean():.4f}±{acc_scores.std():.4f}"
          f"  {f1_scores.mean():.4f}±{f1_scores.std():.4f}"
          f"  {auc_scores.mean():.4f}±{auc_scores.std():.4f}")

Step 5: Hyperparameter Tuning

DfHyperparameter Tuning

The process of finding the optimal configuration of a model's hyperparameters (parameters not learned from data) that maximizes performance on a validation set. Methods include grid search (exhaustive), random search (sampled), and Bayesian optimization (sequential).

# Tune the best model (XGBoost)
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.8, 0.9],
    'colsample_bytree': [0.8, 0.9],
    'min_child_weight': [1, 3, 5]
}

from sklearn.model_selection import RandomizedSearchCV

xgb_model = XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

random_search = RandomizedSearchCV(
    xgb_model,
    param_distributions=param_grid,
    n_iter=20,
    cv=cv,
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1,
    verbose=1
)

random_search.fit(X_train_processed, y_train)

print(f"\nBest parameters: {random_search.best_params_}")
print(f"Best AUC: {random_search.best_score_:.4f}")

# Train best model
best_model = random_search.best_estimator_

RandomizedSearchCV is preferred over GridSearchCV when the hyperparameter space is large. It samples a fixed number of parameter combinations (n_iter) and is often more efficient because it doesn't exhaustively evaluate all combinations. For even more efficient tuning, consider Bayesian optimization (e.g., Optuna, Hyperopt) which uses past results to guide the search.

Step 6: Model Evaluation

📝Comprehensive Model Evaluation

def comprehensive_evaluation(model, X_test, y_test, model_name='Model'):
    """Full model evaluation with visualizations."""
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]

    metrics = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred),
        'auc': roc_auc_score(y_test, y_prob)
    }

    print(f"\n{model_name} - Test Set Performance:")
    print("=" * 45)
    for metric, value in metrics.items():
        print(f"  {metric.capitalize():<12}: {value:.4f}")

    print(f"\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['No Churn', 'Churned']))

    # Visualizations
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))

    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
                xticklabels=['No Churn', 'Churned'],
                yticklabels=['No Churn', 'Churned'])
    axes[0].set_title('Confusion Matrix')
    axes[0].set_ylabel('Actual')
    axes[0].set_xlabel('Predicted')

    fpr, tpr, _ = roc_curve(y_test, y_prob)
    axes[1].plot(fpr, tpr, 'b-', linewidth=2, label=f'AUC = {metrics["auc"]:.3f}')
    axes[1].plot([0, 1], [0, 1], 'k--', alpha=0.5)
    axes[1].set_xlabel('False Positive Rate')
    axes[1].set_ylabel('True Positive Rate')
    axes[1].set_title('ROC Curve')
    axes[1].legend()

    if hasattr(model, 'feature_importances_'):
        importance = model.feature_importances_
        feature_names = [f'Feature {i}' for i in range(len(importance))]
        indices = np.argsort(importance)[-15:]

        axes[2].barh(range(len(indices)), importance[indices], color='steelblue')
        axes[2].set_yticks(range(len(indices)))
        axes[2].set_yticklabels([feature_names[i] for i in indices])
        axes[2].set_title('Top 15 Feature Importances')

    plt.tight_layout()
    plt.savefig(f'{model_name.lower()}_evaluation.png', dpi=150)
    plt.show()

    return metrics

test_metrics = comprehensive_evaluation(best_model, X_test_processed, y_test, 'XGBoost Tuned')

Step 7: Model Interpretation

import shap

# SHAP values for model interpretation
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test_processed[:100])

# Summary plot
plt.figure(figsize=(10, 8))
shap.summary_plot(
    shap_values,
    X_test_processed[:100],
    feature_names=[f'f{i}' for i in range(X_test_processed.shape[1])],
    show=False
)
plt.tight_layout()
plt.savefig('shap_summary.png', dpi=150)
plt.show()

# Feature importance from model
feature_importance = pd.DataFrame({
    'feature': [f'f{i}' for i in range(X_train_processed.shape[1])],
    'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 Most Important Features:")
print(feature_importance.head(10).to_string(index=False))

SHAP (SHapley Additive exPlanations) provides a unified measure of feature importance based on game theory. Each feature's SHAP value represents its marginal contribution to the prediction, considering all possible feature subsets. Unlike permutation importance or tree-based importance, SHAP values are locally faithful and provide both magnitude and direction of each feature's impact.

Step 8: Build Production-Ready Pipeline

DfModel Serialization

The process of saving a trained model to disk so it can be loaded and used for predictions without retraining. Common formats include joblib (Python native), pickle, and ONNX (cross-platform).

import joblib
from datetime import datetime

# Create complete pipeline with model
production_pipeline = Pipeline(steps=[
    ('preprocessing', full_pipeline),
    ('classifier', best_model)
])

# Fit on full training data
production_pipeline.fit(X_train, y_train)

# Verify on test set
y_pred_prod = production_pipeline.predict(X_test)
print(f"Production pipeline test accuracy: {accuracy_score(y_test, y_pred_prod):.4f}")

# Save pipeline
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
model_path = f'churn_model_{timestamp}.joblib'
joblib.dump(production_pipeline, model_path)
print(f"\nModel saved to: {model_path}")

# Load and verify
loaded_pipeline = joblib.load(model_path)
y_pred_loaded = loaded_pipeline.predict(X_test)
assert np.array_equal(y_pred_prod, y_pred_loaded), "Loaded model produces different predictions!"
print("Loaded model verified - predictions match")

Step 9: Prediction Function

def predict_churn(customer_data, pipeline):
    """
    Predict churn probability for new customers.

    Parameters:
    -----------
    customer_data : dict or DataFrame
        Customer features
    pipeline : fitted Pipeline
        Complete ML pipeline

    Returns:
    --------
    dict with prediction and probability
    """
    if isinstance(customer_data, dict):
        customer_df = pd.DataFrame([customer_data])
    else:
        customer_df = customer_data.copy()

    prediction = pipeline.predict(customer_df)[0]
    probability = pipeline.predict_proba(customer_df)[0][1]

    if probability < 0.3:
        risk_level = 'LOW'
    elif probability < 0.6:
        risk_level = 'MEDIUM'
    else:
        risk_level = 'HIGH'

    return {
        'churn_prediction': bool(prediction),
        'churn_probability': round(probability, 4),
        'risk_level': risk_level
    }

# Example prediction
sample_customer = {
    'tenure_months': 3,
    'monthly_charges': 85.0,
    'total_charges': 255.0,
    'num_support_tickets': 5,
    'num_referrals': 0,
    'avg_monthly_usage_gb': 15.0,
    'contract_type': 'Month-to-Month',
    'payment_method': 'Electronic Check',
    'has_partner': 0,
    'has_dependents': 0
}

result = predict_churn(sample_customer, production_pipeline)
print("\nSample Prediction:")
print(f"  Customer: New subscriber, 3 months, $85/month")
print(f"  Prediction: {'Will Churn' if result['churn_prediction'] else 'Will Stay'}")
print(f"  Probability: {result['churn_probability']:.2%}")
print(f"  Risk Level: {result['risk_level']}")

Step 10: Model Monitoring

DfModel Drift

A degradation in model performance over time caused by changes in the underlying data distribution (data drift) or changes in the relationship between features and target (concept drift). Regular monitoring is essential to detect when a model needs retraining.

class ModelMonitor:
    """Monitor model performance and data drift."""

    def __init__(self, pipeline, reference_data):
        self.pipeline = pipeline
        self.reference_stats = self._compute_stats(reference_data)
        self.predictions_log = []

    def _compute_stats(self, data):
        """Compute reference statistics."""
        numeric_cols = data.select_dtypes(include=[np.number]).columns
        return {
            'means': data[numeric_cols].mean().to_dict(),
            'stds': data[numeric_cols].std().to_dict(),
            'missing_rates': data.isnull().mean().to_dict()
        }

    def check_data_drift(self, new_data, threshold=0.1):
        """Check for data drift using KS test."""
        from scipy.stats import ks_2samp

        drift_detected = {}
        numeric_cols = new_data.select_dtypes(include=[np.number]).columns

        for col in numeric_cols:
            if col in self.reference_stats['means']:
                stat, p_value = ks_2samp(
                    self.reference_stats['means'].get(col, []),
                    new_data[col].dropna()
                )
                drift_detected[col] = {
                    'ks_statistic': stat,
                    'p_value': p_value,
                    'drifted': p_value < 0.05
                }

        return drift_detected

    def log_prediction(self, features, prediction, probability):
        """Log prediction for monitoring."""
        self.predictions_log.append({
            'timestamp': datetime.now(),
            'features': features,
            'prediction': prediction,
            'probability': probability
        })

    def performance_report(self):
        """Generate performance report."""
        if not self.predictions_log:
            return "No predictions logged yet."

        df_log = pd.DataFrame(self.predictions_log)
        report = {
            'total_predictions': len(df_log),
            'avg_churn_probability': df_log['probability'].mean(),
            'high_risk_count': (df_log['probability'] > 0.6).sum(),
            'prediction_distribution': df_log['prediction'].value_counts().to_dict()
        }
        return report

# Initialize monitor
monitor = ModelMonitor(production_pipeline, X_train)

# Simulate incoming predictions
for i in range(10):
    sample = X_test.iloc[i:i+1]
    pred = production_pipeline.predict(sample)[0]
    prob = production_pipeline.predict_proba(sample)[0][1]
    monitor.log_prediction(sample.to_dict(), pred, prob)

print("Monitoring Report:")
print(monitor.performance_report())

The Kolmogorov-Smirnov (KS) test compares the distribution of a feature in the reference (training) data against new incoming data. A significant p-value (p < 0.05) indicates the distribution has shifted, suggesting data drift. Monitoring prediction distributions (are probabilities drifting toward 0 or 1?) and performance metrics over time helps detect concept drift.

Complete Pipeline Summary

Architecture Diagram

Project Deliverables:
┌─────────────────────────────────────────────────────────────────┐
│  ✓ EDA Report           - Data understanding & visualization   │
│  ✓ Feature Pipeline     - Automated preprocessing              │
│  ✓ Model Comparison     - 5 algorithms benchmarked             │
│  ✓ Hyperparameter Tuning - RandomizedSearchCV optimized        │
│  ✓ Evaluation Report    - Metrics, confusion matrix, ROC       │
│  ✓ Model Interpretation - SHAP values, feature importance      │
│  ✓ Production Pipeline  - Saved with joblib                    │
│  ✓ Prediction Function  - Ready for API integration            │
│  ✓ Monitoring System    - Data drift & performance tracking    │
└─────────────────────────────────────────────────────────────────┘

Key Takeaways

Pipelines prevent data leakage: Always preprocess within cross-validation folds
Feature engineering matters: Domain-specific features often outperform raw data
Compare multiple models: No single algorithm wins everywhere
Tune systematically: RandomizedSearchCV balances speed and quality
Evaluate comprehensively: Multiple metrics reveal different strengths
Interpret with SHAP: Explain why the model makes predictions
Monitor in production: Data drift degrades model performance over time
Version everything: Timestamp models and track experiments

📋Summary: End-to-End ML Pipeline

A ML pipeline automates the workflow from raw data to predictions, ensuring reproducibility and preventing data leakage.
EDA (exploratory data analysis) should always precede modeling — understand distributions, class balance, missing values, and feature types.
Feature engineering (domain-specific ratios, aggregations, interactions) often improves model performance more than algorithm choice.
ColumnTransformer and Pipeline encapsulate all preprocessing steps, ensuring consistent transformation of training and test data.
Model comparison via cross-validation (StratifiedKFold) with multiple metrics (accuracy, F1, AUC) reveals the best algorithm for your problem.
RandomizedSearchCV efficiently tunes hyperparameters by sampling the parameter space — preferred over GridSearchCV for large search spaces.
SHAP values provide locally faithful feature importance based on game theory — superior to permutation importance for interpreting individual predictions.
Production pipelines saved with joblib ensure that the exact same preprocessing is applied at inference time.
Model monitoring (KS test for data drift, prediction distribution tracking) is essential for maintaining performance in production.
Version control (timestamped models, experiment tracking) enables rollback and reproducibility.

Practice Extensions

Add more features: Create interaction features, time-based aggregations
Handle imbalanced data: Apply SMOTE from Lesson 21 — does it improve recall?
Build API: Wrap the prediction function in a Flask/FastAPI endpoint
A/B Testing: Compare model v1 vs v2 performance in production
Automate Retraining: Schedule weekly retraining with new data

Project 2: End-to-End ML Pipeline

Project 2: End-to-End ML Pipeline

Overview

DfML Pipeline

Project: Predicting Customer Churn

Step 1: Data Loading and Exploration

Step 2: Data Visualization

Step 3: Feature Engineering Pipeline

DfFeature Engineering

Step 4: Model Selection and Training

📝Model Comparison with Cross-Validation

Step 5: Hyperparameter Tuning

DfHyperparameter Tuning

Step 6: Model Evaluation

📝Comprehensive Model Evaluation

Step 7: Model Interpretation

Step 8: Build Production-Ready Pipeline

DfModel Serialization

Step 9: Prediction Function

Step 10: Model Monitoring

DfModel Drift

Complete Pipeline Summary

Key Takeaways

📋Summary: End-to-End ML Pipeline

Practice Extensions

Need Expert Data Science Help?