Project 2: End-to-End ML Pipeline
Overview
DfML Pipeline
A sequence of automated steps that transform raw data into predictions, including data ingestion, preprocessing, feature engineering, model training, evaluation, and deployment. A well-designed pipeline ensures reproducibility, prevents data leakage, and enables consistent predictions in production.
ML Pipeline Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Data βββββ Feature βββββ Model βββββEvaluationβ β
β β Ingestionβ βEngineeringβ β Training β β & Tuning β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β β β β β
β βΌ βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Schema β βPipeline β β Model β β Metric β β
β βValidationβ β Object β β Registry β β Dashboardβ β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Project: Predicting Customer Churn
We'll build a pipeline to predict whether a customer will churn (cancel subscription) based on usage patterns and demographics.
Step 1: Data Loading and Exploration
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
filterwarnings('ignore')
# Generate realistic customer churn dataset
np.random.seed(42)
n_customers = 5000
data = pd.DataFrame({
'customer_id': range(1, n_customers + 1),
'tenure_months': np.random.exponential(24, n_customers).astype(int).clip(1, 72),
'monthly_charges': np.random.normal(65, 25, n_customers).clip(20, 120).round(2),
'total_charges': None,
'num_support_tickets': np.random.poisson(2, n_customers),
'num_referrals': np.random.poisson(1, n_customers),
'avg_monthly_usage_gb': np.random.gamma(3, 5, n_customers).round(2),
'contract_type': np.random.choice(
['Month-to-Month', 'One Year', 'Two Year'],
n_customers, p=[0.5, 0.3, 0.2]
),
'payment_method': np.random.choice(
['Credit Card', 'Bank Transfer', 'Electronic Check', 'Mailed Check'],
n_customers
),
'has_partner': np.random.choice([0, 1], n_customers, p=[0.5, 0.4]),
'has_dependents': np.random.choice([0, 1], n_customers, p=[0.7, 0.3]),
})
# Calculate total charges
data['total_charges'] = (data['tenure_months'] * data['monthly_charges'] *
np.random.uniform(0.8, 1.2, n_customers)).round(2)
# Generate target variable (churn) with realistic relationships
churn_prob = (
0.1 +
0.3 * (data['contract_type'] == 'Month-to-Month').astype(int) +
0.1 * (data['num_support_tickets'] > 3).astype(int) +
0.05 * (data['monthly_charges'] > 80).astype(int) -
0.02 * (data['tenure_months'] > 12).astype(int) +
np.random.randn(n_customers) * 0.05
).clip(0, 1)
data['churned'] = (np.random.random(n_customers) < churn_prob).astype(int)
# Save dataset
data.to_csv('customer_churn.csv', index=False)
print(f"Dataset shape: {data.shape}")
print(f"\nChurn rate: {data['churned'].mean():.2%}")
print(f"\nFirst few rows:")
print(data.head())
# Exploratory Data Analysis
def eda_report(df):
"""Comprehensive EDA report."""
print("=" * 60)
print("DATA OVERVIEW")
print("=" * 60)
print(f"\nShape: {df.shape}")
print(f"\nData Types:\n{df.dtypes}")
print(f"\nMissing Values:\n{df.isnull().sum()}")
print(f"\nDuplicates: {df.duplicated().sum()}")
print("\n" + "=" * 60)
print("NUMERIC FEATURES")
print("=" * 60)
numeric_cols = df.select_dtypes(include=[np.number]).columns
print(df[numeric_cols].describe().round(2))
print("\n" + "=" * 60)
print("CATEGORICAL FEATURES")
print("=" * 60)
cat_cols = df.select_dtypes(include=['object']).columns
for col in cat_cols:
print(f"\n{col}:")
print(df[col].value_counts())
return numeric_cols, cat_cols
numeric_cols, cat_cols = eda_report(data)
Always perform EDA before building models. Understanding the data distribution, class balance, feature types, and missing values guides your preprocessing and model choices. Pay special attention to the target distribution β an imbalanced target requires special handling (see Lesson 21).
Step 2: Data Visualization
def plot_eda(df, target='churned'):
"""Create EDA visualizations."""
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
df[target].value_counts().plot(kind='bar', ax=axes[0, 0], color=['steelblue', 'coral'])
axes[0, 0].set_title('Churn Distribution')
axes[0, 0].set_xticklabels(['No Churn', 'Churned'], rotation=0)
for label in [0, 1]:
subset = df[df[target] == label]
axes[0, 1].hist(subset['tenure_months'], alpha=0.6,
label=f'Churn={label}', bins=20)
axes[0, 1].set_title('Tenure Distribution by Churn')
axes[0, 1].legend()
for label in [0, 1]:
subset = df[df[target] == label]
axes[0, 2].hist(subset['monthly_charges'], alpha=0.6,
label=f'Churn={label}', bins=20)
axes[0, 2].set_title('Monthly Charges by Churn')
axes[0, 2].legend()
ct = pd.crosstab(df['contract_type'], df[target], normalize='index')
ct.plot(kind='bar', ax=axes[1, 0], color=['steelblue', 'coral'])
axes[1, 0].set_title('Churn Rate by Contract Type')
axes[1, 0].set_xticklabels(axes[1, 0].get_xticklabels(), rotation=45)
ticket_churn = df.groupby('num_support_tickets')[target].mean()
axes[1, 1].bar(ticket_churn.index, ticket_churn.values, color='steelblue')
axes[1, 1].set_title('Churn Rate by Support Tickets')
axes[1, 1].set_xlabel('Number of Support Tickets')
numeric_df = df.select_dtypes(include=[np.number])
corr = numeric_df.corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm',
ax=axes[1, 2], center=0, vmin=-1, vmax=1)
axes[1, 2].set_title('Feature Correlations')
plt.tight_layout()
plt.savefig('eda_plots.png', dpi=150)
plt.show()
plot_eda(data)
Step 3: Feature Engineering Pipeline
DfFeature Engineering
The process of creating new input features from raw data to improve model performance. Domain-specific features (ratios, aggregations, interactions) often provide more predictive power than raw features alone.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
StandardScaler, OneHotEncoder, OrdinalEncoder,
PolynomialFeatures
)
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
# Define feature groups
numeric_features = [
'tenure_months', 'monthly_charges', 'total_charges',
'num_support_tickets', 'num_referrals', 'avg_monthly_usage_gb'
]
categorical_features = ['contract_type', 'payment_method']
binary_features = ['has_partner', 'has_dependents']
# Custom transformer for feature engineering
class FeatureEngineer(BaseEstimator, TransformerMixin):
"""Create domain-specific features."""
def fit(self, X, y=None):
return self
def transform(self, X):
X = X.copy()
# Revenue features
X['revenue_per_month'] = X['total_charges'] / X['tenure_months'].clip(lower=1)
X['charges_ratio'] = X['monthly_charges'] / X['monthly_charges'].mean()
# Usage features
X['usage_per_charge'] = X['avg_monthly_usage_gb'] / X['monthly_charges'].clip(lower=1)
X['tickets_per_month'] = X['num_support_tickets'] / X['tenure_months'].clip(lower=1)
# Tenure buckets
X['tenure_bucket'] = pd.cut(
X['tenure_months'],
bins=[0, 6, 12, 24, 48, 100],
labels=['0-6m', '6-12m', '1-2y', '2-4y', '4y+']
).astype(str)
return X
# Build preprocessing pipeline
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
('binary', 'passthrough', binary_features)
],
remainder='drop'
)
# Full pipeline
full_pipeline = Pipeline(steps=[
('feature_engineer', FeatureEngineer()),
('preprocessor', preprocessor)
])
print("Pipeline structure:")
for i, (name, step) in enumerate(full_pipeline.steps):
print(f" {i}: {name} ({type(step).__name__})")
Encapsulating preprocessing in a ColumnTransformer or Pipeline ensures that transformations are applied consistently to training and test data, preventing data leakage. The fit_transform method should only be called on training data; transform is used on test data to avoid information leakage from the test set.
Step 4: Model Selection and Training
from sklearn.model_selection import (
train_test_split, cross_val_score,
StratifiedKFold, GridSearchCV
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
RandomForestClassifier, GradientBoostingClassifier,
VotingClassifier, StackingClassifier
)
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, classification_report,
confusion_matrix, roc_curve
)
# Split data
X = data.drop(['customer_id', 'churned'], axis=1)
y = data['churned']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train size: {X_train.shape[0]}")
print(f"Test size: {X_test.shape[0]}")
print(f"Train churn rate: {y_train.mean():.2%}")
print(f"Test churn rate: {y_test.mean():.2%}")
# Apply preprocessing
X_train_processed = full_pipeline.fit_transform(X_train)
X_test_processed = full_pipeline.transform(X_test)
print(f"\nProcessed features: {X_train_processed.shape[1]}")
πModel Comparison with Cross-Validation
# Compare multiple models
models = {
'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
'XGBoost': XGBClassifier(
n_estimators=100, use_label_encoder=False,
eval_metric='logloss', random_state=42
),
'LightGBM': LGBMClassifier(n_estimators=100, random_state=42, verbose=-1),
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print("Model Comparison (5-fold CV):")
print(f"{'Model':<25} {'Accuracy':>10} {'F1':>10} {'AUC':>10}")
print("-" * 60)
results = {}
for name, model in models.items():
acc_scores = cross_val_score(model, X_train_processed, y_train, cv=cv, scoring='accuracy')
f1_scores = cross_val_score(model, X_train_processed, y_train, cv=cv, scoring='f1')
auc_scores = cross_val_score(model, X_train_processed, y_train, cv=cv, scoring='roc_auc')
results[name] = {
'accuracy': acc_scores.mean(),
'f1': f1_scores.mean(),
'auc': auc_scores.mean()
}
print(f"{name:<25} {acc_scores.mean():.4f}Β±{acc_scores.std():.4f}"
f" {f1_scores.mean():.4f}Β±{f1_scores.std():.4f}"
f" {auc_scores.mean():.4f}Β±{auc_scores.std():.4f}")
Step 5: Hyperparameter Tuning
DfHyperparameter Tuning
The process of finding the optimal configuration of a model's hyperparameters (parameters not learned from data) that maximizes performance on a validation set. Methods include grid search (exhaustive), random search (sampled), and Bayesian optimization (sequential).
# Tune the best model (XGBoost)
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.05, 0.1],
'subsample': [0.8, 0.9],
'colsample_bytree': [0.8, 0.9],
'min_child_weight': [1, 3, 5]
}
from sklearn.model_selection import RandomizedSearchCV
xgb_model = XGBClassifier(
use_label_encoder=False,
eval_metric='logloss',
random_state=42
)
random_search = RandomizedSearchCV(
xgb_model,
param_distributions=param_grid,
n_iter=20,
cv=cv,
scoring='roc_auc',
random_state=42,
n_jobs=-1,
verbose=1
)
random_search.fit(X_train_processed, y_train)
print(f"\nBest parameters: {random_search.best_params_}")
print(f"Best AUC: {random_search.best_score_:.4f}")
# Train best model
best_model = random_search.best_estimator_
RandomizedSearchCV is preferred over GridSearchCV when the hyperparameter space is large. It samples a fixed number of parameter combinations (n_iter) and is often more efficient because it doesn't exhaustively evaluate all combinations. For even more efficient tuning, consider Bayesian optimization (e.g., Optuna, Hyperopt) which uses past results to guide the search.
Step 6: Model Evaluation
πComprehensive Model Evaluation
def comprehensive_evaluation(model, X_test, y_test, model_name='Model'):
"""Full model evaluation with visualizations."""
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
metrics = {
'accuracy': accuracy_score(y_test, y_pred),
'precision': precision_score(y_test, y_pred),
'recall': recall_score(y_test, y_pred),
'f1': f1_score(y_test, y_pred),
'auc': roc_auc_score(y_test, y_prob)
}
print(f"\n{model_name} - Test Set Performance:")
print("=" * 45)
for metric, value in metrics.items():
print(f" {metric.capitalize():<12}: {value:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['No Churn', 'Churned']))
# Visualizations
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
xticklabels=['No Churn', 'Churned'],
yticklabels=['No Churn', 'Churned'])
axes[0].set_title('Confusion Matrix')
axes[0].set_ylabel('Actual')
axes[0].set_xlabel('Predicted')
fpr, tpr, _ = roc_curve(y_test, y_prob)
axes[1].plot(fpr, tpr, 'b-', linewidth=2, label=f'AUC = {metrics["auc"]:.3f}')
axes[1].plot([0, 1], [0, 1], 'k--', alpha=0.5)
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()
if hasattr(model, 'feature_importances_'):
importance = model.feature_importances_
feature_names = [f'Feature {i}' for i in range(len(importance))]
indices = np.argsort(importance)[-15:]
axes[2].barh(range(len(indices)), importance[indices], color='steelblue')
axes[2].set_yticks(range(len(indices)))
axes[2].set_yticklabels([feature_names[i] for i in indices])
axes[2].set_title('Top 15 Feature Importances')
plt.tight_layout()
plt.savefig(f'{model_name.lower()}_evaluation.png', dpi=150)
plt.show()
return metrics
test_metrics = comprehensive_evaluation(best_model, X_test_processed, y_test, 'XGBoost Tuned')
Step 7: Model Interpretation
import shap
# SHAP values for model interpretation
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test_processed[:100])
# Summary plot
plt.figure(figsize=(10, 8))
shap.summary_plot(
shap_values,
X_test_processed[:100],
feature_names=[f'f{i}' for i in range(X_test_processed.shape[1])],
show=False
)
plt.tight_layout()
plt.savefig('shap_summary.png', dpi=150)
plt.show()
# Feature importance from model
feature_importance = pd.DataFrame({
'feature': [f'f{i}' for i in range(X_train_processed.shape[1])],
'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 10 Most Important Features:")
print(feature_importance.head(10).to_string(index=False))
SHAP (SHapley Additive exPlanations) provides a unified measure of feature importance based on game theory. Each feature's SHAP value represents its marginal contribution to the prediction, considering all possible feature subsets. Unlike permutation importance or tree-based importance, SHAP values are locally faithful and provide both magnitude and direction of each feature's impact.
Step 8: Build Production-Ready Pipeline
DfModel Serialization
The process of saving a trained model to disk so it can be loaded and used for predictions without retraining. Common formats include joblib (Python native), pickle, and ONNX (cross-platform).
import joblib
from datetime import datetime
# Create complete pipeline with model
production_pipeline = Pipeline(steps=[
('preprocessing', full_pipeline),
('classifier', best_model)
])
# Fit on full training data
production_pipeline.fit(X_train, y_train)
# Verify on test set
y_pred_prod = production_pipeline.predict(X_test)
print(f"Production pipeline test accuracy: {accuracy_score(y_test, y_pred_prod):.4f}")
# Save pipeline
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
model_path = f'churn_model_{timestamp}.joblib'
joblib.dump(production_pipeline, model_path)
print(f"\nModel saved to: {model_path}")
# Load and verify
loaded_pipeline = joblib.load(model_path)
y_pred_loaded = loaded_pipeline.predict(X_test)
assert np.array_equal(y_pred_prod, y_pred_loaded), "Loaded model produces different predictions!"
print("Loaded model verified - predictions match")
Step 9: Prediction Function
def predict_churn(customer_data, pipeline):
"""
Predict churn probability for new customers.
Parameters:
-----------
customer_data : dict or DataFrame
Customer features
pipeline : fitted Pipeline
Complete ML pipeline
Returns:
--------
dict with prediction and probability
"""
if isinstance(customer_data, dict):
customer_df = pd.DataFrame([customer_data])
else:
customer_df = customer_data.copy()
prediction = pipeline.predict(customer_df)[0]
probability = pipeline.predict_proba(customer_df)[0][1]
if probability < 0.3:
risk_level = 'LOW'
elif probability < 0.6:
risk_level = 'MEDIUM'
else:
risk_level = 'HIGH'
return {
'churn_prediction': bool(prediction),
'churn_probability': round(probability, 4),
'risk_level': risk_level
}
# Example prediction
sample_customer = {
'tenure_months': 3,
'monthly_charges': 85.0,
'total_charges': 255.0,
'num_support_tickets': 5,
'num_referrals': 0,
'avg_monthly_usage_gb': 15.0,
'contract_type': 'Month-to-Month',
'payment_method': 'Electronic Check',
'has_partner': 0,
'has_dependents': 0
}
result = predict_churn(sample_customer, production_pipeline)
print("\nSample Prediction:")
print(f" Customer: New subscriber, 3 months, $85/month")
print(f" Prediction: {'Will Churn' if result['churn_prediction'] else 'Will Stay'}")
print(f" Probability: {result['churn_probability']:.2%}")
print(f" Risk Level: {result['risk_level']}")
Step 10: Model Monitoring
DfModel Drift
A degradation in model performance over time caused by changes in the underlying data distribution (data drift) or changes in the relationship between features and target (concept drift). Regular monitoring is essential to detect when a model needs retraining.
class ModelMonitor:
"""Monitor model performance and data drift."""
def __init__(self, pipeline, reference_data):
self.pipeline = pipeline
self.reference_stats = self._compute_stats(reference_data)
self.predictions_log = []
def _compute_stats(self, data):
"""Compute reference statistics."""
numeric_cols = data.select_dtypes(include=[np.number]).columns
return {
'means': data[numeric_cols].mean().to_dict(),
'stds': data[numeric_cols].std().to_dict(),
'missing_rates': data.isnull().mean().to_dict()
}
def check_data_drift(self, new_data, threshold=0.1):
"""Check for data drift using KS test."""
from scipy.stats import ks_2samp
drift_detected = {}
numeric_cols = new_data.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
if col in self.reference_stats['means']:
stat, p_value = ks_2samp(
self.reference_stats['means'].get(col, []),
new_data[col].dropna()
)
drift_detected[col] = {
'ks_statistic': stat,
'p_value': p_value,
'drifted': p_value < 0.05
}
return drift_detected
def log_prediction(self, features, prediction, probability):
"""Log prediction for monitoring."""
self.predictions_log.append({
'timestamp': datetime.now(),
'features': features,
'prediction': prediction,
'probability': probability
})
def performance_report(self):
"""Generate performance report."""
if not self.predictions_log:
return "No predictions logged yet."
df_log = pd.DataFrame(self.predictions_log)
report = {
'total_predictions': len(df_log),
'avg_churn_probability': df_log['probability'].mean(),
'high_risk_count': (df_log['probability'] > 0.6).sum(),
'prediction_distribution': df_log['prediction'].value_counts().to_dict()
}
return report
# Initialize monitor
monitor = ModelMonitor(production_pipeline, X_train)
# Simulate incoming predictions
for i in range(10):
sample = X_test.iloc[i:i+1]
pred = production_pipeline.predict(sample)[0]
prob = production_pipeline.predict_proba(sample)[0][1]
monitor.log_prediction(sample.to_dict(), pred, prob)
print("Monitoring Report:")
print(monitor.performance_report())
The Kolmogorov-Smirnov (KS) test compares the distribution of a feature in the reference (training) data against new incoming data. A significant p-value (p < 0.05) indicates the distribution has shifted, suggesting data drift. Monitoring prediction distributions (are probabilities drifting toward 0 or 1?) and performance metrics over time helps detect concept drift.
Complete Pipeline Summary
Project Deliverables:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β EDA Report - Data understanding & visualization β
β β Feature Pipeline - Automated preprocessing β
β β Model Comparison - 5 algorithms benchmarked β
β β Hyperparameter Tuning - RandomizedSearchCV optimized β
β β Evaluation Report - Metrics, confusion matrix, ROC β
β β Model Interpretation - SHAP values, feature importance β
β β Production Pipeline - Saved with joblib β
β β Prediction Function - Ready for API integration β
β β Monitoring System - Data drift & performance tracking β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Takeaways
- Pipelines prevent data leakage: Always preprocess within cross-validation folds
- Feature engineering matters: Domain-specific features often outperform raw data
- Compare multiple models: No single algorithm wins everywhere
- Tune systematically: RandomizedSearchCV balances speed and quality
- Evaluate comprehensively: Multiple metrics reveal different strengths
- Interpret with SHAP: Explain why the model makes predictions
- Monitor in production: Data drift degrades model performance over time
- Version everything: Timestamp models and track experiments
πSummary: End-to-End ML Pipeline
- A ML pipeline automates the workflow from raw data to predictions, ensuring reproducibility and preventing data leakage.
- EDA (exploratory data analysis) should always precede modeling β understand distributions, class balance, missing values, and feature types.
- Feature engineering (domain-specific ratios, aggregations, interactions) often improves model performance more than algorithm choice.
- ColumnTransformer and Pipeline encapsulate all preprocessing steps, ensuring consistent transformation of training and test data.
- Model comparison via cross-validation (StratifiedKFold) with multiple metrics (accuracy, F1, AUC) reveals the best algorithm for your problem.
- RandomizedSearchCV efficiently tunes hyperparameters by sampling the parameter space β preferred over GridSearchCV for large search spaces.
- SHAP values provide locally faithful feature importance based on game theory β superior to permutation importance for interpreting individual predictions.
- Production pipelines saved with
joblibensure that the exact same preprocessing is applied at inference time. - Model monitoring (KS test for data drift, prediction distribution tracking) is essential for maintaining performance in production.
- Version control (timestamped models, experiment tracking) enables rollback and reproducibility.
Practice Extensions
- Add more features: Create interaction features, time-based aggregations
- Handle imbalanced data: Apply SMOTE from Lesson 21 β does it improve recall?
- Build API: Wrap the prediction function in a Flask/FastAPI endpoint
- A/B Testing: Compare model v1 vs v2 performance in production
- Automate Retraining: Schedule weekly retraining with new data