Why OOP for Data Science?
Object-Oriented Programming provides abstraction, encapsulation, and composability β essential properties for building complex data systems. While data science often uses functional patterns, OOP is the backbone of major frameworks like scikit-learn, PyTorch, and TensorFlow.
OOP vs Functional for Data Science
Functional Approach: OOP Approach:
βββββββββββββββββββββββ βββββββββββββββββββββββ
β def clean(df): β β class Cleaner: β
β return ... β β def fit(X): β
β β β def transform():β
β def model(df): β β β
β return ... β β class Pipeline: β
β β β [Cleaner, Model] β
β clean(model(data)) β β pipe.fit_transform() β
βββββββββββββββββββββββ βββββββββββββββββββββββ
β Simple for scripts β Composable systems
β Hard to reuse across β scikit-learn compatible
projects β State management
β No shared state β Type checking
Classes and Objects
A class is a blueprint; an object is an instance. Formally, a class defines a type with methods that operate on instances of .
DfClass
A blueprint for creating objects that defines a set of attributes (data) and methods (behavior). Formally, a class C is a tuple (A, M) where A is a set of attributes and M is a set of methods that operate on instances of C.
class Dataset:
"""A simple dataset class for data science.
Mathematical representation:
A dataset D is a tuple (X, y) where:
- X β ββΏΛ£α΅ (feature matrix, n samples, d features)
- y β ββΏ or {0,1,...,k-1}βΏ (target vector)
"""
# Class variable (shared across all instances)
dataset_count = 0
def __init__(self, features, targets, name="unnamed"):
"""Initialize dataset.
Parameters
----------
features : list of lists or np.array
Feature matrix X of shape (n_samples, n_features)
targets : list or np.array
Target vector y of shape (n_samples,)
name : str
Dataset identifier
"""
self.features = features # Instance variable
self.targets = targets
self.name = name
self.n_samples = len(features)
self.n_features = len(features[0]) if features else 0
Dataset.dataset_count += 1 # Modify class variable
def __repr__(self):
"""Developer-friendly string representation."""
return (f"Dataset(name='{self.name}', "
f"n_samples={self.n_samples}, "
f"n_features={self.n_features})")
def __str__(self):
"""User-friendly string representation."""
return f"Dataset '{self.name}': {self.n_samples} samples Γ {self.n_features} features"
def summary(self):
"""Compute dataset statistics."""
import numpy as np
X = np.array(self.features)
return {
'mean': X.mean(axis=0).tolist(),
'std': X.std(axis=0).tolist(),
'min': X.min(axis=0).tolist(),
'max': X.max(axis=0).tolist()
}
# Creating objects (instances)
ds1 = Dataset([[1,2],[3,4],[5,6]], [0,1,0], name="iris")
ds2 = Dataset([[7,8],[9,10]], [1,0], name="wine")
print(ds1) # Dataset 'iris': 3 samples Γ 2 features
print(repr(ds1)) # Dataset(name='iris', n_samples=3, n_features=2)
print(Dataset.dataset_count) # 2 (class variable shared)
Class variables are shared across all instances and are defined outside __init__. Instance variables are unique to each object and are defined with self. in __init__. Understanding this distinction is critical for avoiding subtle bugs in data science codebases.
The self Parameter
self is a reference to the current instance of the class. It is how Python implements method dispatch β self.method() tells Python which object's data to operate on.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Class: Dataset β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Class Variables (shared) β β
β β dataset_count = 2 β β
β ββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Instance: ds1 Instance: ds2 β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β self = ds1 β β self = ds2 β β
β β self.features β β self.features β β
β β self.targets β β self.targets β β
β β self.name β β self.name β β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β
β When you call ds1.summary(): β
β Python calls Dataset.summary(ds1) β
β self = ds1 inside the method β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
class Vector:
"""N-dimensional vector with mathematical operations."""
def __init__(self, values):
self.values = list(values)
self.n = len(values)
def dot(self, other):
"""Dot product: v Β· w = Ξ£(v_i * w_i)"""
# self is the first vector, other is the second
return sum(a * b for a, b in zip(self.values, other.values))
def norm(self):
"""Euclidean norm: ||v|| = β(Ξ£v_iΒ²)"""
import math
return math.sqrt(sum(x**2 for x in self.values))
def add(self, other):
"""Element-wise addition: (v + w)_i = v_i + w_i"""
return Vector([a + b for a, b in zip(self.values, other.values)])
v1 = Vector([1, 2, 3])
v2 = Vector([4, 5, 6])
print(v1.dot(v2)) # 32 (1*4 + 2*5 + 3*6)
print(v1.norm()) # 3.74
print(v1.add(v2).values) # [5, 7, 9]
# Why self matters:
# When v1.dot(v2) is called, Python translates to:
# Vector.dot(v1, v2)
# Inside dot(), self = v1, other = v2
Dot Product
Here,
- =First vector (self.values)
- =Second vector (other.values)
- =Number of dimensions
Encapsulation
Encapsulation controls access to internal state. Python uses naming conventions (not enforcement) to indicate access levels.
| Convention | Prefix | Meaning | Example |
|---|---|---|---|
| Public | none | Accessible everywhere | self.name |
| Protected | _ | Internal use (convention) | self._data |
| Private | __ | Name-mangled (harder to access) | self.__secret |
DfEncapsulation
The bundling of data and methods that operate on that data within a single unit (class), while restricting direct access to some components. This prevents external code from depending on implementation details, enabling independent modification of internal representation.
class MLModel:
"""Machine learning model with encapsulated state.
Encapsulation ensures:
1. Internal state cannot be corrupted
2. Public interface is well-defined
3. Implementation can change without breaking code
"""
def __init__(self, learning_rate=0.01):
self.public_param = learning_rate # Public: accessible everywhere
self._internal_state = {} # Protected: convention-only barrier
self.__weights = None # Private: name-mangled
self.__bias = 0.0
@property
def weights(self):
"""Read-only access to weights via property."""
return self.__weights
@weights.setter
def weights(self, value):
"""Controlled write access with validation."""
import numpy as np
if not isinstance(value, np.ndarray):
raise TypeError("Weights must be numpy array")
self.__weights = value
def fit(self, X, y, epochs=100):
"""Train model (modifies internal state)."""
import numpy as np
n_features = X.shape[1]
self.__weights = np.random.randn(n_features) * 0.01
self._internal_state['epochs'] = epochs
self._internal_state['losses'] = []
for epoch in range(epochs):
predictions = X.dot(self.__weights) + self.__bias
error = predictions - y
loss = (error ** 2).mean()
self._internal_state['losses'].append(loss)
# Gradient descent
self.__weights -= self.public_param * (2/X.shape[0]) * X.T.dot(error)
self.__bias -= self.public_param * (2/X.shape[0]) * error.sum()
return self
def predict(self, X):
"""Make predictions."""
if self.__weights is None:
raise ValueError("Model not trained. Call fit() first.")
return X.dot(self.__weights) + self.__bias
import numpy as np
np.random.seed(42)
X = np.random.randn(100, 3)
y = X.dot([1, 2, 3]) + np.random.randn(100) * 0.1
model = MLModel(learning_rate=0.1)
model.fit(X, y, epochs=50)
print(f"Learned weights: {model.weights}") # [~1, ~2, ~3]
print(f"Training loss: {model._internal_state['losses'][-1]:.4f}")
# Accessing private attribute raises AttributeError:
# model.__weights # AttributeError: 'MLModel' has no attribute '__weights'
The gradient descent update rule used in the MLModel's fit method, where Ξ± is the learning rate and βL is the loss gradient.
Inheritance
Inheritance allows creating new classes that extend or modify existing classes. The child class inherits all attributes and methods, and can override or add new behavior.
DfInheritance
A mechanism where a new class (subclass) derives attributes and methods from an existing class (superclass). The subclass inherits the interface of the superclass and can extend or override behavior. Formally, if B is a subclass of A, then every instance of B is also an instance of A (is-a relationship).
Base Class: Transformer Derived Classes:
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β fit(X) β β StandardScaler β
β transform(X) ββββββββΆβ fit(): compute ΞΌ, Ο β
β fit_transform(X) β β transform(): (X-ΞΌ)/Ο β
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
βββββββββββββββββββββββββββ
β MinMaxScaler β
β fit(): compute min,maxβ
β transform(): normalizeβ
βββββββββββββββββββββββββββ
import numpy as np
class Transformer:
"""Base transformer class (sklearn-like interface)."""
def __init__(self):
self.is_fitted = False
self._params = {}
def fit(self, X):
"""Learn parameters from data. Override in subclass."""
raise NotImplementedError("Subclasses must implement fit()")
def transform(self, X):
"""Apply transformation. Override in subclass."""
raise NotImplementedError("Subclasses must implement transform()")
def fit_transform(self, X):
"""Fit and transform in one step."""
return self.fit(X).transform(X)
def __repr__(self):
params = ', '.join(f'{k}={v}' for k, v in self._params.items())
return f"{self.__class__.__name__}({params})"
class StandardScaler(Transformer):
"""Standardize features by removing mean and scaling to unit variance.
Mathematical: z = (x - ΞΌ) / Ο
where:
- ΞΌ = mean of feature
- Ο = standard deviation of feature
"""
def fit(self, X):
self._mean = np.mean(X, axis=0)
self._std = np.std(X, axis=0)
self._std[self._std == 0] = 1 # Prevent division by zero
self.is_fitted = True
self._params = {'method': 'standard'}
return self
def transform(self, X):
if not self.is_fitted:
raise ValueError("Transformer not fitted. Call fit() first.")
return (X - self._mean) / self._std
class MinMaxScaler(Transformer):
"""Scale features to [0, 1] range.
Mathematical: x_scaled = (x - min) / (max - min)
"""
def fit(self, X):
self._min = np.min(X, axis=0)
self._max = np.max(X, axis=0)
self._range = self._max - self._min
self._range[self._range == 0] = 1 # Prevent division by zero
self.is_fitted = True
self._params = {'method': 'minmax'}
return self
def transform(self, X):
if not self.is_fitted:
raise ValueError("Transformer not fitted. Call fit() first.")
return (X - self._min) / self._range
class RobustScaler(Transformer):
"""Scale features using statistics robust to outliers.
Mathematical: x_scaled = (x - median) / IQR
where IQR = Q3 - Q1 (interquartile range)
"""
def fit(self, X):
self._median = np.median(X, axis=0)
q75 = np.percentile(X, 75, axis=0)
q25 = np.percentile(X, 25, axis=0)
self._iqr = q75 - q25
self._iqr[self._iqr == 0] = 1
self.is_fitted = True
self._params = {'method': 'robust'}
return self
def transform(self, X):
if not self.is_fitted:
raise ValueError("Transformer not fitted. Call fit() first.")
return (X - self._median) / self._iqr
# Usage
np.random.seed(42)
X = np.random.randn(100, 3) * [10, 1, 0.1] # Different scales
scaler = StandardScaler()
X_standard = scaler.fit_transform(X)
print(f"Standard: mean={X_standard.mean(axis=0).round(4)}, std={X_standard.std(axis=0).round(4)}")
# Standard: mean=[-0. -0. 0.], std=[1. 1. 1.]
minmax = MinMaxScaler()
X_minmax = minmax.fit_transform(X)
print(f"MinMax: range=[{X_minmax.min():.4f}, {X_minmax.max():.4f}]")
# MinMax: range=[0.0000, 1.0000]
robust = RobustScaler()
X_robust = robust.fit_transform(X)
print(f"Robust: median={np.median(X_robust, axis=0).round(4)}")
# Robust: median=[0. 0. 0.]
Standard Scaling
Here,
- =Standardized value
- =Original value
- =Mean of the feature
- =Standard deviation of the feature
Polymorphism
Polymorphism means the same interface works with different types. In Python, this is structural (duck typing) β objects are treated based on their methods, not their class hierarchy.
DfPolymorphism
The principle that a single interface can be used with different data types. In Python, duck typing implements structural polymorphism: objects are treated as instances of a type if they implement the required methods, regardless of their actual class hierarchy.
import numpy as np
# Polymorphic function: works with ANY object that has .fit() and .predict()
def evaluate(model, X_train, y_train, X_test, y_test):
"""Evaluate any model that implements fit() and predict().
This function doesn't care if model is:
- A LinearRegression
- A DecisionTree
- A Neural Network
- Any custom class with fit/predict
"""
model.fit(X_train, y_train)
predictions = model.predict(X_test)
mse = np.mean((predictions - y_test) ** 2)
return {'model': type(model).__name__, 'mse': mse}
class LinearRegression:
"""Simple linear regression: y = Xw + b"""
def fit(self, X, y):
# Normal equation: w = (X^T X)^{-1} X^T y
X_with_bias = np.column_stack([X, np.ones(len(X))])
self.w = np.linalg.lstsq(X_with_bias, y, rcond=None)[0]
return self
def predict(self, X):
X_with_bias = np.column_stack([X, np.ones(len(X))])
return X_with_bias.dot(self.w)
class PolynomialRegression:
"""Polynomial regression: y = XΒ²w + Xw + b"""
def __init__(self, degree=2):
self.degree = degree
def fit(self, X, y):
X_poly = self._transform(X)
self.w = np.linalg.lstsq(X_poly, y, rcond=None)[0]
return self
def predict(self, X):
return self._transform(X).dot(self.w)
def _transform(self, X):
return np.column_stack([X**i for i in range(1, self.degree+1)] +
[np.ones(len(X))])
# Generate data
np.random.seed(42)
X = np.random.randn(100, 2)
y = X.dot([3, -2]) + np.random.randn(100) * 0.1
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]
# Polymorphism in action: same function, different models
for model in [LinearRegression(), PolynomialRegression(degree=2)]:
result = evaluate(model, X_train, y_train, X_test, y_test)
print(f"{result['model']}: MSE = {result['mse']:.4f}")
# LinearRegression: MSE = 0.0123
# PolynomialRegression: MSE = 0.0118
Duck typing in Python means you don't need abstract base classes for polymorphism. As long as objects implement the same method signatures, they can be used interchangeably. This is more flexible than Java's class-based polymorphism but requires careful documentation.
Magic Methods (Dunder Methods)
Magic methods let you define how your objects behave with built-in Python operations:
DfMagic Method
A special method (double underscore prefix and suffix) that Python calls implicitly for built-in operations. For example, __add__ is called when using the + operator, __len__ for len(), and __str__ for str(). These enable operator overloading and integration with Python's built-in functions.
import numpy as np
class Vector:
"""N-dimensional vector with full operator overloading."""
def __init__(self, values):
self.values = np.array(values, dtype=float)
# String representations
def __repr__(self):
"""Developer: Vector([1, 2, 3])"""
return f"Vector({self.values.tolist()})"
def __str__(self):
"""User: [1.0, 2.0, 3.0]"""
return f"[{', '.join(f'{v:.2f}' for v in self.values)}]"
# Length and boolean
def __len__(self):
"""len(vector) β number of dimensions"""
return len(self.values)
def __bool__(self):
"""bool(vector) β True if non-zero"""
return bool(np.any(self.values))
# Indexing
def __getitem__(self, index):
"""vector[i] β scalar value"""
return self.values[index]
def __setitem__(self, index, value):
"""vector[i] = value"""
self.values[index] = value
# Arithmetic operators
def __add__(self, other):
"""vector + other"""
return Vector(self.values + other.values)
def __sub__(self, other):
"""vector - other"""
return Vector(self.values - other.values)
def __mul__(self, scalar):
"""vector * scalar (element-wise)"""
return Vector(self.values * scalar)
def __rmul__(self, scalar):
"""scalar * vector"""
return Vector(self.values * scalar)
def __truediv__(self, scalar):
"""vector / scalar"""
return Vector(self.values / scalar)
def __matmul__(self, other):
"""vector @ other (dot product)"""
return np.dot(self.values, other.values)
# Comparison
def __eq__(self, other):
"""vector == other"""
return np.allclose(self.values, other.values)
def __lt__(self, other):
"""vector < other (compare norms)"""
return np.linalg.norm(self.values) < np.linalg.norm(other.values)
# Hashing and containment
def __hash__(self):
return hash(tuple(self.values))
def __contains__(self, value):
"""value in vector"""
return value in self.values
# Context manager (for temporary operations)
def __enter__(self):
"""with vector as v: ..."""
self._backup = self.values.copy()
return self
def __exit__(self, exc_type, exc_val, exc_tb):
"""Restore original values on exception"""
if exc_type is not None:
self.values = self._backup
return False
v1 = Vector([1, 2, 3])
v2 = Vector([4, 5, 6])
print(repr(v1)) # Vector([1, 2, 3])
print(str(v1)) # [1.00, 2.00, 3.00]
print(len(v1)) # 3
print(v1 + v2) # [5.00, 7.00, 9.00]
print(v1 @ v2) # 32.0 (dot product)
print(2 * v1) # [2.00, 4.00, 6.00]
print(v1[0]) # 1.0
print(v1 == Vector([1,2,3])) # True
print(2 in v1) # True
Operator Overloading
Here,
- =Left operand (self in __add__)
- =Right operand (other in __add__)
Dataclasses: Modern Python
Dataclasses automatically generate __init__, __repr__, __eq__, and more β reducing boilerplate significantly.
from dataclasses import dataclass, field
from typing import List, Optional
import numpy as np
@dataclass
class DataSplit:
"""A train/validation/test split with metadata.
@dataclass automatically generates:
- __init__(self, X, y, name, ...)
- __repr__(self)
- __eq__(self, other)
"""
X: np.ndarray
y: np.ndarray
name: str = "split"
metadata: dict = field(default_factory=dict)
@property
def n_samples(self) -> int:
return len(self.X)
@property
def n_features(self) -> int:
return self.X.shape[1] if self.X.ndim > 1 else 1
def describe(self) -> str:
return (f"{self.name}: {self.n_samples} samples, "
f"{self.n_features} features")
@dataclass(frozen=True) # frozen=True makes it immutable
class ModelConfig:
"""Immutable model configuration.
frozen=True prevents:
- config.learning_rate = 0.01 # β AttributeError
"""
learning_rate: float = 0.01
epochs: int = 100
batch_size: int = 32
hidden_layers: tuple = (64, 32)
dropout: float = 0.2
random_seed: int = 42
def __post_init__(self):
"""Validate after initialization."""
if self.learning_rate <= 0:
raise ValueError("learning_rate must be positive")
if self.epochs <= 0:
raise ValueError("epochs must be positive")
@dataclass
class Experiment:
"""Track ML experiment results."""
name: str
config: ModelConfig
metrics: dict = field(default_factory=dict)
tags: List[str] = field(default_factory=list)
def log_metric(self, key: str, value: float):
self.metrics[key] = value
def summary(self) -> str:
metrics_str = ', '.join(f'{k}={v:.4f}' for k, v in self.metrics.items())
return f"{self.name}: {metrics_str}"
# Usage
config = ModelConfig(learning_rate=0.001, epochs=200)
print(config) # ModelConfig(learning_rate=0.001, epochs=200, batch_size=32, ...)
exp = Experiment("baseline", config, tags=["v1", "baseline"])
exp.log_metric("accuracy", 0.95)
exp.log_metric("loss", 0.08)
print(exp.summary()) # baseline: accuracy=0.9500, loss=0.0800
Dataclasses with frozen=True create immutable objects that can be used as dictionary keys or set members. This is ideal for configuration objects in ML pipelines where you want to ensure configurations don't change during execution.
Building a Custom sklearn-Compatible Transformer
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted
class TargetEncoder(BaseEstimator, TransformerMixin):
"""Encode categorical features using target statistics.
Mathematical: enc(x) = (count(x) * mean(y|x) + m * global_mean) /
(count(x) + m)
where:
- count(x) = number of times category x appears
- mean(y|x) = mean of target for category x
- m = smoothing parameter (prevents overfitting)
- global_mean = mean of all target values
This is Bayesian smoothing: categories with few observations
are pulled toward the global mean.
"""
def __init__(self, smoothing=10.0):
self.smoothing = smoothing
def fit(self, X, y):
"""Learn category β target mappings."""
X = np.asarray(X).ravel()
y = np.asarray(y).ravel()
self.global_mean_ = y.mean()
self.encoding_ = {}
for category in np.unique(X):
mask = X == category
count = mask.sum()
mean_target = y[mask].mean()
# Bayesian smoothing
self.encoding_[category] = (
(count * mean_target + self.smoothing * self.global_mean_) /
(count + self.smoothing)
)
self.is_fitted_ = True
return self
def transform(self, X):
"""Apply learned encoding."""
check_is_fitted(self, ['is_fitted_', 'encoding_'])
X = np.asarray(X).ravel()
result = np.array([self.encoding_.get(c, self.global_mean_) for c in X])
return result.reshape(-1, 1)
# Usage with sklearn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
# Sample data
np.random.seed(42)
df = pd.DataFrame({
'city': np.random.choice(['NYC', 'LA', 'Chicago', 'Houston'], 1000),
'income': np.random.normal(50000, 15000, 1000),
'target': np.random.randint(0, 2, 1000)
})
# Create pipeline
pipe = Pipeline([
('encoder', TargetEncoder(smoothing=5.0)),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Encode and evaluate
X_encoded = df[['city']].values
y = df['target'].values
encoder = TargetEncoder(smoothing=5.0)
X_transformed = encoder.fit_transform(X_encoded)
print("Original categories:", df['city'].unique())
print("Encoded values (sample):", X_transformed[:5].flatten().round(4))
# Encoded values: [0.5123, 0.4876, 0.5034, 0.4912, 0.5089]
Target Encoding
Here,
- =Category value
- =Number of times category x appears
- =Mean target for category x
- =Smoothing parameter
- =Global mean of all targets
Building a Dataset Class (PyTorch-Style)
import numpy as np
class Dataset:
"""A generic dataset class supporting indexing and slicing.
Similar to PyTorch's Dataset but without framework dependencies.
Supports:
- Dataset[i] β (features, target)
- Dataset[i:j] β subset
- len(dataset) β number of samples
- iteration over samples
"""
def __init__(self, features, targets=None, transform=None):
self.features = np.asarray(features, dtype=np.float32)
self.targets = np.asarray(targets, dtype=np.float32) if targets is not None else None
self.transform = transform
self._validate()
def _validate(self):
if self.targets is not None and len(self.features) != len(self.targets):
raise ValueError(f"Features ({len(self.features)}) and "
f"targets ({len(self.targets)}) length mismatch")
def __len__(self):
return len(self.features)
def __getitem__(self, idx):
if isinstance(idx, slice):
features = self.features[idx]
targets = self.targets[idx] if self.targets is not None else None
return Dataset(features, targets, self.transform)
x = self.features[idx]
if self.targets is not None:
y = self.targets[idx]
if self.transform:
x = self.transform(x)
return x, y
return self.transform(x) if self.transform else x
def __repr__(self):
return (f"Dataset(n_samples={len(self)}, "
f"n_features={self.features.shape[1]}, "
f"has_targets={self.targets is not None})")
@property
def shape(self):
return self.features.shape
def split(self, train_ratio=0.8, shuffle=True, seed=42):
"""Split into train/test datasets."""
n = len(self)
indices = np.random.RandomState(seed).permutation(n) if shuffle else np.arange(n)
split_idx = int(n * train_ratio)
train_idx, test_idx = indices[:split_idx], indices[split_idx:]
return self[train_idx], self[test_idx]
def batch(self, batch_size, shuffle=True):
"""Yield mini-batches for training."""
n = len(self)
indices = np.random.permutation(n) if shuffle else np.arange(n)
for start in range(0, n, batch_size):
batch_idx = indices[start:start + batch_size]
yield self[batch_idx]
# Usage
np.random.seed(42)
X = np.random.randn(500, 10)
y = (X[:, 0] + X[:, 1] > 0).astype(float)
dataset = Dataset(X, y)
print(dataset) # Dataset(n_samples=500, n_features=10, has_targets=True)
# Single sample
x, label = dataset[0]
print(f"Sample shape: {x.shape}, Label: {label}")
# Slice
subset = dataset[:100]
print(f"Subset: {subset}")
# Split
train, test = dataset.split(train_ratio=0.8)
print(f"Train: {train}, Test: {test}")
# Mini-batches
for batch_X, batch_y in dataset.batch(batch_size=32):
print(f"Batch X shape: {batch_X.shape}, Batch y shape: {batch_y.shape}")
break # Just show first batch
When to Use OOP vs Functional
| Criterion | Use OOP | Use Functional |
|---|---|---|
| State management needed | β | β |
| Multiple related operations | β | β |
| sklearn/PyTorch compatibility | β | β |
| One-off data transformation | β | β |
| Simple scripts | β | β |
| Pipeline composition | β | β |
| Testing (mocking) | β | β |
| Mathematical functions | β | β |
| Reusable components across projects | β | β |
Complete Example: DataPipeline Class
import numpy as np
from typing import List, Tuple, Optional, Dict, Any
from dataclasses import dataclass, field
@dataclass
class StepResult:
"""Result of a pipeline step."""
name: str
input_shape: Tuple
output_shape: Tuple
time_ms: float
metadata: Dict[str, Any] = field(default_factory=dict)
class DataPipeline:
"""A composable data processing pipeline.
Usage:
pipeline = DataPipeline()
pipeline.add_step('clean', clean_function)
pipeline.add_step('scale', scale_function)
pipeline.add_step('encode', encode_function)
result = pipeline.run(data)
"""
def __init__(self, name: str = "pipeline"):
self.name = name
self.steps: List[Tuple[str, callable]] = []
self.results: List[StepResult] = []
def add_step(self, name: str, func: callable):
"""Add a processing step."""
self.steps.append((name, func))
return self # Enable chaining
def run(self, data: Any, verbose: bool = False) -> Any:
"""Execute all pipeline steps sequentially."""
import time
self.results = []
current = data
for step_name, step_func in self.steps:
start = time.perf_counter()
input_shape = self._get_shape(current)
try:
current = step_func(current)
except Exception as e:
raise RuntimeError(f"Pipeline failed at step '{step_name}': {e}")
elapsed = (time.perf_counter() - start) * 1000
output_shape = self._get_shape(current)
result = StepResult(
name=step_name,
input_shape=input_shape,
output_shape=output_shape,
time_ms=elapsed
)
self.results.append(result)
if verbose:
print(f"[{step_name}] {input_shape} β {output_shape} ({elapsed:.1f}ms)")
return current
def _get_shape(self, data):
if hasattr(data, 'shape'):
return data.shape
if isinstance(data, (list, tuple)):
return (len(data),)
return None
def summary(self):
"""Print pipeline execution summary."""
print(f"\n{'='*60}")
print(f"Pipeline: {self.name}")
print(f"{'='*60}")
total_time = 0
for i, result in enumerate(self.results, 1):
print(f" {i}. {result.name}")
print(f" Input: {result.input_shape}")
print(f" Output: {result.output_shape}")
print(f" Time: {result.time_ms:.1f}ms")
total_time += result.time_ms
print(f"{'='*60}")
print(f"Total time: {total_time:.1f}ms")
print(f"Total steps: {len(self.results)}")
# Define processing functions
def remove_nulls(data):
"""Remove rows with null values."""
import pandas as pd
if isinstance(data, pd.DataFrame):
return data.dropna()
return data
def standardize(data):
"""Standardize numeric features."""
import pandas as pd
if isinstance(data, pd.DataFrame):
numeric = data.select_dtypes(include=[np.number])
data[numeric.columns] = (numeric - numeric.mean()) / numeric.std()
return data
def add_features(data):
"""Add engineered features."""
import pandas as pd
if isinstance(data, pd.DataFrame) and 'income' in data.columns:
data['income_log'] = np.log1p(data['income'])
data['income_bin'] = pd.cut(data['income'], bins=5, labels=False)
return data
# Build and run pipeline
pipeline = DataPipeline(name="customer_data_v1")
pipeline.add_step("remove_nulls", remove_nulls)
pipeline.add_step("standardize", standardize)
pipeline.add_step("add_features", add_features)
# Sample data
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({
'age': np.random.randint(18, 70, 1000),
'income': np.random.normal(50000, 15000, 1000),
'score': np.random.uniform(0, 100, 1000)
})
df.loc[np.random.choice(df.index, 50), 'age'] = np.nan
result = pipeline.run(df, verbose=True)
pipeline.summary()
Key Takeaways
πSummary: OOP for Data Science
- Classes define blueprints; objects are instances. Use
__init__to set up state, methods to define behavior. selfis how Python knows which object's data to operate on. It is passed automatically β you never pass it explicitly.- Encapsulation (
_and__) protects internal state. Use@propertyfor controlled access. - Inheritance enables code reuse and polymorphism. Always prefer composition over inheritance when possible.
- Magic methods (
__repr__,__getitem__,__add__) make your objects behave like built-in Python types. - Dataclasses eliminate boilerplate for data-holding classes. Use
frozen=Truefor immutable configs. - sklearn compatibility: Inherit from
BaseEstimatorandTransformerMixinto plug into sklearn Pipelines. - OOP is best for stateful components (models, pipelines), functional is best for stateless transformations.
Practice Exercises
Exercise 1: Build a Linear Regression Class
class LinearRegression:
"""Implement from scratch with fit/predict interface."""
def __init__(self, learning_rate=0.01, epochs=1000):
self.learning_rate = learning_rate
self.epochs = epochs
self.weights = None
self.bias = None
self.losses = []
def fit(self, X, y):
# Implement gradient descent
# Store losses in self.losses
pass
def predict(self, X):
# y = Xw + b
pass
def score(self, X, y):
# RΒ² score: 1 - SS_res / SS_tot
pass
Exercise 2: Create a Transformer Pipeline
# Create these transformers:
# 1. OneHotEncoder: encode categories as binary columns
# 2. PolynomialFeatures: create polynomial interaction features
# 3. SelectKBest: keep top k features by correlation
class OneHotEncoder:
def fit(self, X):
pass
def transform(self, X):
pass
class PolynomialFeatures:
def __init__(self, degree=2):
pass
def fit(self, X):
pass
def transform(self, X):
pass
class SelectKBest:
def __init__(self, k=5):
pass
def fit(self, X, y):
pass
def transform(self, X):
pass
Exercise 3: Magic Methods Challenge
# Create a Matrix class with all operators
class Matrix:
def __init__(self, data):
self.data = np.array(data)
# Implement:
# __add__ (matrix addition)
# __mul__ (element-wise multiplication)
# __matmul__ (matrix multiplication)
# __getitem__ (indexing)
# __setitem__ (assignment)
# __eq__ (equality check)
# __repr__ (string representation)
# @property determinant (compute determinant)
# @property inverse (compute inverse)
# @property transpose (transpose matrix)
Exercise 4: Build a Config System
# Create a hierarchical config system using dataclasses
@dataclass
class DatabaseConfig:
host: str = "localhost"
port: int = 5432
name: str = "mydb"
@dataclass
class ModelConfig:
learning_rate: float = 0.01
epochs: int = 100
batch_size: int = 32
@dataclass
class AppConfig:
db: DatabaseConfig = field(default_factory=DatabaseConfig)
model: ModelConfig = field(default_factory=ModelConfig)
debug: bool = False
# Implement:
# - to_dict() method (convert to dictionary)
# - from_dict() classmethod (create from dictionary)
# - save() method (save to JSON file)
# - load() classmethod (load from JSON file)