What Is Data Storytelling?
Data Storytelling is a key concept in Data Science Fundamentals. Understanding it is essential for building effective data science solutions.
Core Idea: Data Storytelling provides a systematic approach to communication problems by learning patterns from data and applying them to make predictions or decisions.
Key Concepts
Core Concepts in Data Storytelling
Data Storytelling is a fundamental technique in Data Science Fundamentals used for communication problems. It provides a structured way to extract insights from data.
Mathematical Foundation:
The objective is to minimise a loss function over parameters :
Using gradient descent:
Key Properties:
| Property | Description | Impact |
|---|---|---|
| Flexibility | Adapts to complex patterns | High expressive power |
| Regularisation | Controls model complexity | Prevents overfitting |
| Interpretability | Explains predictions | Trust and debugging |
| Scalability | Handles large datasets | Production readiness |
| Robustness | Stable under noise | Reliable predictions |
Workflow:
1. Data collection and cleaning
2. Feature engineering and selection
3. Model training with cross-validation
4. Hyperparameter optimisation
5. Final evaluation on held-out test set
6. Deployment and monitoring
Python Implementation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings("ignore")
# Load example dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Prepare data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Class distribution: {dict(pd.Series(y).value_counts())}")
print(f"Train / Test split: {len(X_train)} / {len(X_test)}")
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Try multiple models and compare
models = {
"Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"SVM (RBF)": SVC(kernel="rbf", probability=True, random_state=42),
}
results = {}
for name, clf in models.items():
cv = cross_val_score(clf, X_train_s, y_train, cv=5, scoring="f1")
clf.fit(X_train_s, y_train)
test_acc = accuracy_score(y_test, clf.predict(X_test_s))
results[name] = {"CV F1": cv.mean(), "Test Acc": test_acc}
print(f"{name:<25} CV F1={cv.mean():.4f} Test={test_acc:.4f}")
model = models["Random Forest"] # best performer
Evaluation & Results
# Evaluate model performance
y_pred = model.predict(X_test_s)
print(f"Accuracy : {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred,
target_names=data.target_names))
# Cross-validation for robust estimate
cv_scores = cross_val_score(model, X_train_s, y_train, cv=5, scoring="f1")
print(f"\n5-Fold CV F1: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
# Visualise results
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm, display_labels=data.target_names).plot(ax=axes[0])
axes[0].set_title("Confusion Matrix")
# CV scores
axes[1].bar(range(1, 6), cv_scores, color="#3b82f6", edgecolor="white")
axes[1].axhline(cv_scores.mean(), color="red", linestyle="--",
label=f"Mean={cv_scores.mean():.4f}")
axes[1].set_xlabel("Fold"); axes[1].set_ylabel("F1 Score")
axes[1].set_title("Cross-Validation Scores")
axes[1].legend(); axes[1].grid(True, alpha=0.3, axis="y")
plt.tight_layout()
plt.show()
Comparison with Related Methods
| Method | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Data Storytelling | Effective on structured data | May need tuning | Classification/Regression |
| Random Forest | Robust, handles missing data | Slow inference | Tabular data |
| XGBoost | High accuracy, fast | Many hyperparameters | Competitions, production |
| Logistic Reg. | Interpretable, fast | Linear boundary only | Binary baseline |
| SVM | Good in high-dim | Slow on large data | Text, images |
Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
param_grid = {
"C": [0.01, 0.1, 1.0, 10.0],
"gamma": ["scale", "auto"],
"kernel": ["rbf", "linear"],
}
grid = GridSearchCV(model, param_grid, cv=5,
scoring="f1", n_jobs=-1, verbose=0)
grid.fit(X_train_s, y_train)
print(f"Best params : {grid.best_params_}")
print(f"Best CV F1 : {grid.best_score_:.4f}")
print(f"Test F1 : {grid.score(X_test_s, y_test):.4f}")
Key Takeaways
- Data Storytelling is a powerful method for communication tasks
- Always scale features before applying distance-based or regularised methods
- Use cross-validation — never evaluate on the same data used for training
- Start simple — a strong baseline prevents over-engineering
- Visualise everything — confusion matrices, learning curves, feature importances