Object-Oriented Programming for Data Science

Why OOP for Data Science?

Object-Oriented Programming provides abstraction, encapsulation, and composability — essential properties for building complex data systems. While data science often uses functional patterns, OOP is the backbone of major frameworks like scikit-learn, PyTorch, and TensorFlow.

OOP vs Functional for Data Science

Architecture Diagram

Functional Approach:               OOP Approach:
┌─────────────────────┐            ┌─────────────────────┐
│ def clean(df):      │            │ class Cleaner:      │
│     return ...      │            │     def fit(X):     │
│                     │            │     def transform():│
│ def model(df):      │            │                     │
│     return ...      │            │ class Pipeline:     │
│                     │            │     [Cleaner, Model] │
│ clean(model(data))  │            │ pipe.fit_transform() │
└─────────────────────┘            └─────────────────────┘
✓ Simple for scripts               ✓ Composable systems
✗ Hard to reuse across             ✓ scikit-learn compatible
  projects                          ✓ State management
✗ No shared state                   ✓ Type checking

Classes and Objects

A class is a blueprint; an object is an instance. Formally, a class defines a type $T$ with methods $m_1, m_2, \ldots, m_n$ that operate on instances of $T$ .

DfClass

A blueprint for creating objects that defines a set of attributes (data) and methods (behavior). Formally, a class C is a tuple (A, M) where A is a set of attributes and M is a set of methods that operate on instances of C.

class Dataset:
    """A simple dataset class for data science.

    Mathematical representation:
    A dataset D is a tuple (X, y) where:
    - X ∈ ℝⁿˣᵈ (feature matrix, n samples, d features)
    - y ∈ ℝⁿ or {0,1,...,k-1}ⁿ (target vector)
    """

    # Class variable (shared across all instances)
    dataset_count = 0

    def __init__(self, features, targets, name="unnamed"):
        """Initialize dataset.

        Parameters
        ----------
        features : list of lists or np.array
            Feature matrix X of shape (n_samples, n_features)
        targets : list or np.array
            Target vector y of shape (n_samples,)
        name : str
            Dataset identifier
        """
        self.features = features    # Instance variable
        self.targets = targets
        self.name = name
        self.n_samples = len(features)
        self.n_features = len(features[0]) if features else 0
        Dataset.dataset_count += 1  # Modify class variable

    def __repr__(self):
        """Developer-friendly string representation."""
        return (f"Dataset(name='{self.name}', "
                f"n_samples={self.n_samples}, "
                f"n_features={self.n_features})")

    def __str__(self):
        """User-friendly string representation."""
        return f"Dataset '{self.name}': {self.n_samples} samples × {self.n_features} features"

    def summary(self):
        """Compute dataset statistics."""
        import numpy as np
        X = np.array(self.features)
        return {
            'mean': X.mean(axis=0).tolist(),
            'std': X.std(axis=0).tolist(),
            'min': X.min(axis=0).tolist(),
            'max': X.max(axis=0).tolist()
        }

# Creating objects (instances)
ds1 = Dataset([[1,2],[3,4],[5,6]], [0,1,0], name="iris")
ds2 = Dataset([[7,8],[9,10]], [1,0], name="wine")

print(ds1)          # Dataset 'iris': 3 samples × 2 features
print(repr(ds1))    # Dataset(name='iris', n_samples=3, n_features=2)
print(Dataset.dataset_count)  # 2 (class variable shared)

Class variables are shared across all instances and are defined outside __init__. Instance variables are unique to each object and are defined with self. in __init__. Understanding this distinction is critical for avoiding subtle bugs in data science codebases.

The `self` Parameter

self is a reference to the current instance of the class. It is how Python implements method dispatch — self.method() tells Python which object's data to operate on.

Architecture Diagram

┌────────────────────────────────────────────────────┐
│  Class: Dataset                                   │
│  ┌──────────────────────────────────────────────┐  │
│  │  Class Variables (shared)                    │  │
│  │  dataset_count = 2                           │  │
│  └──────────────────────────────────────────────┘  │
│                                                    │
│  Instance: ds1           Instance: ds2             │
│  ┌──────────────────┐    ┌──────────────────┐     │
│  │ self = ds1       │    │ self = ds2       │     │
│  │ self.features    │    │ self.features    │     │
│  │ self.targets     │    │ self.targets     │     │
│  │ self.name        │    │ self.name        │     │
│  └──────────────────┘    └──────────────────┘     │
│                                                    │
│  When you call ds1.summary():                      │
│    Python calls Dataset.summary(ds1)               │
│    self = ds1 inside the method                    │
└────────────────────────────────────────────────────┘

class Vector:
    """N-dimensional vector with mathematical operations."""

    def __init__(self, values):
        self.values = list(values)
        self.n = len(values)

    def dot(self, other):
        """Dot product: v · w = Σ(v_i * w_i)"""
        # self is the first vector, other is the second
        return sum(a * b for a, b in zip(self.values, other.values))

    def norm(self):
        """Euclidean norm: ||v|| = √(Σv_i²)"""
        import math
        return math.sqrt(sum(x**2 for x in self.values))

    def add(self, other):
        """Element-wise addition: (v + w)_i = v_i + w_i"""
        return Vector([a + b for a, b in zip(self.values, other.values)])

v1 = Vector([1, 2, 3])
v2 = Vector([4, 5, 6])

print(v1.dot(v2))      # 32 (1*4 + 2*5 + 3*6)
print(v1.norm())        # 3.74
print(v1.add(v2).values)  # [5, 7, 9]

# Why self matters:
# When v1.dot(v2) is called, Python translates to:
# Vector.dot(v1, v2)
# Inside dot(), self = v1, other = v2

Dot Product

\mathbf{v} \cdot \mathbf{w} = \sum_{i=1}^{n} v_i \cdot w_i

Here,

=First vector (self.values)
=Second vector (other.values)
=Number of dimensions

Encapsulation

Encapsulation controls access to internal state. Python uses naming conventions (not enforcement) to indicate access levels.

Convention	Prefix	Meaning	Example
Public	none	Accessible everywhere	`self.name`
Protected	`_`	Internal use (convention)	`self._data`
Private	`__`	Name-mangled (harder to access)	`self.__secret`

DfEncapsulation

The bundling of data and methods that operate on that data within a single unit (class), while restricting direct access to some components. This prevents external code from depending on implementation details, enabling independent modification of internal representation.

class MLModel:
    """Machine learning model with encapsulated state.

    Encapsulation ensures:
    1. Internal state cannot be corrupted
    2. Public interface is well-defined
    3. Implementation can change without breaking code
    """

    def __init__(self, learning_rate=0.01):
        self.public_param = learning_rate  # Public: accessible everywhere
        self._internal_state = {}          # Protected: convention-only barrier
        self.__weights = None              # Private: name-mangled
        self.__bias = 0.0

    @property
    def weights(self):
        """Read-only access to weights via property."""
        return self.__weights

    @weights.setter
    def weights(self, value):
        """Controlled write access with validation."""
        import numpy as np
        if not isinstance(value, np.ndarray):
            raise TypeError("Weights must be numpy array")
        self.__weights = value

    def fit(self, X, y, epochs=100):
        """Train model (modifies internal state)."""
        import numpy as np
        n_features = X.shape[1]
        self.__weights = np.random.randn(n_features) * 0.01

        self._internal_state['epochs'] = epochs
        self._internal_state['losses'] = []

        for epoch in range(epochs):
            predictions = X.dot(self.__weights) + self.__bias
            error = predictions - y
            loss = (error ** 2).mean()
            self._internal_state['losses'].append(loss)

            # Gradient descent
            self.__weights -= self.public_param * (2/X.shape[0]) * X.T.dot(error)
            self.__bias -= self.public_param * (2/X.shape[0]) * error.sum()

        return self

    def predict(self, X):
        """Make predictions."""
        if self.__weights is None:
            raise ValueError("Model not trained. Call fit() first.")
        return X.dot(self.__weights) + self.__bias

import numpy as np
np.random.seed(42)
X = np.random.randn(100, 3)
y = X.dot([1, 2, 3]) + np.random.randn(100) * 0.1

model = MLModel(learning_rate=0.1)
model.fit(X, y, epochs=50)

print(f"Learned weights: {model.weights}")       # [~1, ~2, ~3]
print(f"Training loss: {model._internal_state['losses'][-1]:.4f}")

# Accessing private attribute raises AttributeError:
# model.__weights  # AttributeError: 'MLModel' has no attribute '__weights'

\theta_{t+1} = \theta_t - \alpha \cdot \nabla_{\theta} L(\theta_t)

The gradient descent update rule used in the MLModel's fit method, where α is the learning rate and ∇L is the loss gradient.

Inheritance

Inheritance allows creating new classes that extend or modify existing classes. The child class inherits all attributes and methods, and can override or add new behavior.

DfInheritance

A mechanism where a new class (subclass) derives attributes and methods from an existing class (superclass). The subclass inherits the interface of the superclass and can extend or override behavior. Formally, if B is a subclass of A, then every instance of B is also an instance of A (is-a relationship).

Architecture Diagram

Base Class: Transformer            Derived Classes:
┌─────────────────────────┐       ┌─────────────────────────┐
│  fit(X)                 │       │  StandardScaler         │
│  transform(X)           │──────▶│    fit(): compute μ, σ   │
│  fit_transform(X)       │       │    transform(): (X-μ)/σ  │
└─────────────────────────┘       └─────────────────────────┘
                                  ┌─────────────────────────┐
                                  │  MinMaxScaler           │
                                  │    fit(): compute min,max│
                                  │    transform(): normalize│
                                  └─────────────────────────┘

import numpy as np

class Transformer:
    """Base transformer class (sklearn-like interface)."""

    def __init__(self):
        self.is_fitted = False
        self._params = {}

    def fit(self, X):
        """Learn parameters from data. Override in subclass."""
        raise NotImplementedError("Subclasses must implement fit()")

    def transform(self, X):
        """Apply transformation. Override in subclass."""
        raise NotImplementedError("Subclasses must implement transform()")

    def fit_transform(self, X):
        """Fit and transform in one step."""
        return self.fit(X).transform(X)

    def __repr__(self):
        params = ', '.join(f'{k}={v}' for k, v in self._params.items())
        return f"{self.__class__.__name__}({params})"


class StandardScaler(Transformer):
    """Standardize features by removing mean and scaling to unit variance.

    Mathematical: z = (x - μ) / σ

    where:
    - μ = mean of feature
    - σ = standard deviation of feature
    """

    def fit(self, X):
        self._mean = np.mean(X, axis=0)
        self._std = np.std(X, axis=0)
        self._std[self._std == 0] = 1  # Prevent division by zero
        self.is_fitted = True
        self._params = {'method': 'standard'}
        return self

    def transform(self, X):
        if not self.is_fitted:
            raise ValueError("Transformer not fitted. Call fit() first.")
        return (X - self._mean) / self._std


class MinMaxScaler(Transformer):
    """Scale features to [0, 1] range.

    Mathematical: x_scaled = (x - min) / (max - min)
    """

    def fit(self, X):
        self._min = np.min(X, axis=0)
        self._max = np.max(X, axis=0)
        self._range = self._max - self._min
        self._range[self._range == 0] = 1  # Prevent division by zero
        self.is_fitted = True
        self._params = {'method': 'minmax'}
        return self

    def transform(self, X):
        if not self.is_fitted:
            raise ValueError("Transformer not fitted. Call fit() first.")
        return (X - self._min) / self._range


class RobustScaler(Transformer):
    """Scale features using statistics robust to outliers.

    Mathematical: x_scaled = (x - median) / IQR

    where IQR = Q3 - Q1 (interquartile range)
    """

    def fit(self, X):
        self._median = np.median(X, axis=0)
        q75 = np.percentile(X, 75, axis=0)
        q25 = np.percentile(X, 25, axis=0)
        self._iqr = q75 - q25
        self._iqr[self._iqr == 0] = 1
        self.is_fitted = True
        self._params = {'method': 'robust'}
        return self

    def transform(self, X):
        if not self.is_fitted:
            raise ValueError("Transformer not fitted. Call fit() first.")
        return (X - self._median) / self._iqr


# Usage
np.random.seed(42)
X = np.random.randn(100, 3) * [10, 1, 0.1]  # Different scales

scaler = StandardScaler()
X_standard = scaler.fit_transform(X)
print(f"Standard: mean={X_standard.mean(axis=0).round(4)}, std={X_standard.std(axis=0).round(4)}")
# Standard: mean=[-0. -0.  0.], std=[1. 1. 1.]

minmax = MinMaxScaler()
X_minmax = minmax.fit_transform(X)
print(f"MinMax: range=[{X_minmax.min():.4f}, {X_minmax.max():.4f}]")
# MinMax: range=[0.0000, 1.0000]

robust = RobustScaler()
X_robust = robust.fit_transform(X)
print(f"Robust: median={np.median(X_robust, axis=0).round(4)}")
# Robust: median=[0. 0. 0.]

Standard Scaling

z = \frac{x - \mu}{\sigma}

Here,

=Standardized value
=Original value
=Mean of the feature
=Standard deviation of the feature

Polymorphism

Polymorphism means the same interface works with different types. In Python, this is structural (duck typing) — objects are treated based on their methods, not their class hierarchy.

DfPolymorphism

The principle that a single interface can be used with different data types. In Python, duck typing implements structural polymorphism: objects are treated as instances of a type if they implement the required methods, regardless of their actual class hierarchy.

import numpy as np

# Polymorphic function: works with ANY object that has .fit() and .predict()
def evaluate(model, X_train, y_train, X_test, y_test):
    """Evaluate any model that implements fit() and predict().

    This function doesn't care if model is:
    - A LinearRegression
    - A DecisionTree
    - A Neural Network
    - Any custom class with fit/predict
    """
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    mse = np.mean((predictions - y_test) ** 2)
    return {'model': type(model).__name__, 'mse': mse}


class LinearRegression:
    """Simple linear regression: y = Xw + b"""

    def fit(self, X, y):
        # Normal equation: w = (X^T X)^{-1} X^T y
        X_with_bias = np.column_stack([X, np.ones(len(X))])
        self.w = np.linalg.lstsq(X_with_bias, y, rcond=None)[0]
        return self

    def predict(self, X):
        X_with_bias = np.column_stack([X, np.ones(len(X))])
        return X_with_bias.dot(self.w)


class PolynomialRegression:
    """Polynomial regression: y = X²w + Xw + b"""

    def __init__(self, degree=2):
        self.degree = degree

    def fit(self, X, y):
        X_poly = self._transform(X)
        self.w = np.linalg.lstsq(X_poly, y, rcond=None)[0]
        return self

    def predict(self, X):
        return self._transform(X).dot(self.w)

    def _transform(self, X):
        return np.column_stack([X**i for i in range(1, self.degree+1)] +
                               [np.ones(len(X))])


# Generate data
np.random.seed(42)
X = np.random.randn(100, 2)
y = X.dot([3, -2]) + np.random.randn(100) * 0.1

X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]

# Polymorphism in action: same function, different models
for model in [LinearRegression(), PolynomialRegression(degree=2)]:
    result = evaluate(model, X_train, y_train, X_test, y_test)
    print(f"{result['model']}: MSE = {result['mse']:.4f}")

# LinearRegression: MSE = 0.0123
# PolynomialRegression: MSE = 0.0118

Duck typing in Python means you don't need abstract base classes for polymorphism. As long as objects implement the same method signatures, they can be used interchangeably. This is more flexible than Java's class-based polymorphism but requires careful documentation.

Magic Methods (Dunder Methods)

Magic methods let you define how your objects behave with built-in Python operations:

DfMagic Method

A special method (double underscore prefix and suffix) that Python calls implicitly for built-in operations. For example, __add__ is called when using the + operator, __len__ for len(), and __str__ for str(). These enable operator overloading and integration with Python's built-in functions.

import numpy as np

class Vector:
    """N-dimensional vector with full operator overloading."""

    def __init__(self, values):
        self.values = np.array(values, dtype=float)

    # String representations
    def __repr__(self):
        """Developer: Vector([1, 2, 3])"""
        return f"Vector({self.values.tolist()})"

    def __str__(self):
        """User: [1.0, 2.0, 3.0]"""
        return f"[{', '.join(f'{v:.2f}' for v in self.values)}]"

    # Length and boolean
    def __len__(self):
        """len(vector) → number of dimensions"""
        return len(self.values)

    def __bool__(self):
        """bool(vector) → True if non-zero"""
        return bool(np.any(self.values))

    # Indexing
    def __getitem__(self, index):
        """vector[i] → scalar value"""
        return self.values[index]

    def __setitem__(self, index, value):
        """vector[i] = value"""
        self.values[index] = value

    # Arithmetic operators
    def __add__(self, other):
        """vector + other"""
        return Vector(self.values + other.values)

    def __sub__(self, other):
        """vector - other"""
        return Vector(self.values - other.values)

    def __mul__(self, scalar):
        """vector * scalar (element-wise)"""
        return Vector(self.values * scalar)

    def __rmul__(self, scalar):
        """scalar * vector"""
        return Vector(self.values * scalar)

    def __truediv__(self, scalar):
        """vector / scalar"""
        return Vector(self.values / scalar)

    def __matmul__(self, other):
        """vector @ other (dot product)"""
        return np.dot(self.values, other.values)

    # Comparison
    def __eq__(self, other):
        """vector == other"""
        return np.allclose(self.values, other.values)

    def __lt__(self, other):
        """vector < other (compare norms)"""
        return np.linalg.norm(self.values) < np.linalg.norm(other.values)

    # Hashing and containment
    def __hash__(self):
        return hash(tuple(self.values))

    def __contains__(self, value):
        """value in vector"""
        return value in self.values

    # Context manager (for temporary operations)
    def __enter__(self):
        """with vector as v: ..."""
        self._backup = self.values.copy()
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        """Restore original values on exception"""
        if exc_type is not None:
            self.values = self._backup
        return False


v1 = Vector([1, 2, 3])
v2 = Vector([4, 5, 6])

print(repr(v1))           # Vector([1, 2, 3])
print(str(v1))            # [1.00, 2.00, 3.00]
print(len(v1))            # 3
print(v1 + v2)            # [5.00, 7.00, 9.00]
print(v1 @ v2)            # 32.0 (dot product)
print(2 * v1)             # [2.00, 4.00, 6.00]
print(v1[0])              # 1.0
print(v1 == Vector([1,2,3]))  # True
print(2 in v1)            # True

Operator Overloading

\text{If } v \text{ implements } \texttt{\__add\__}, \text{ then } v + w \text{ calls } v.\texttt{\__add\__}(w)

Here,

=Left operand (self in __add__)
=Right operand (other in __add__)

Dataclasses: Modern Python

Dataclasses automatically generate __init__, __repr__, __eq__, and more — reducing boilerplate significantly.

from dataclasses import dataclass, field
from typing import List, Optional
import numpy as np

@dataclass
class DataSplit:
    """A train/validation/test split with metadata.

    @dataclass automatically generates:
    - __init__(self, X, y, name, ...)
    - __repr__(self)
    - __eq__(self, other)
    """
    X: np.ndarray
    y: np.ndarray
    name: str = "split"
    metadata: dict = field(default_factory=dict)

    @property
    def n_samples(self) -> int:
        return len(self.X)

    @property
    def n_features(self) -> int:
        return self.X.shape[1] if self.X.ndim > 1 else 1

    def describe(self) -> str:
        return (f"{self.name}: {self.n_samples} samples, "
                f"{self.n_features} features")


@dataclass(frozen=True)  # frozen=True makes it immutable
class ModelConfig:
    """Immutable model configuration.

    frozen=True prevents:
    - config.learning_rate = 0.01  # ❌ AttributeError
    """
    learning_rate: float = 0.01
    epochs: int = 100
    batch_size: int = 32
    hidden_layers: tuple = (64, 32)
    dropout: float = 0.2
    random_seed: int = 42

    def __post_init__(self):
        """Validate after initialization."""
        if self.learning_rate <= 0:
            raise ValueError("learning_rate must be positive")
        if self.epochs <= 0:
            raise ValueError("epochs must be positive")


@dataclass
class Experiment:
    """Track ML experiment results."""
    name: str
    config: ModelConfig
    metrics: dict = field(default_factory=dict)
    tags: List[str] = field(default_factory=list)

    def log_metric(self, key: str, value: float):
        self.metrics[key] = value

    def summary(self) -> str:
        metrics_str = ', '.join(f'{k}={v:.4f}' for k, v in self.metrics.items())
        return f"{self.name}: {metrics_str}"


# Usage
config = ModelConfig(learning_rate=0.001, epochs=200)
print(config)  # ModelConfig(learning_rate=0.001, epochs=200, batch_size=32, ...)

exp = Experiment("baseline", config, tags=["v1", "baseline"])
exp.log_metric("accuracy", 0.95)
exp.log_metric("loss", 0.08)
print(exp.summary())  # baseline: accuracy=0.9500, loss=0.0800

Dataclasses with frozen=True create immutable objects that can be used as dictionary keys or set members. This is ideal for configuration objects in ML pipelines where you want to ensure configurations don't change during execution.

Building a Custom sklearn-Compatible Transformer

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted

class TargetEncoder(BaseEstimator, TransformerMixin):
    """Encode categorical features using target statistics.

    Mathematical: enc(x) = (count(x) * mean(y|x) + m * global_mean) /
                           (count(x) + m)

    where:
    - count(x) = number of times category x appears
    - mean(y|x) = mean of target for category x
    - m = smoothing parameter (prevents overfitting)
    - global_mean = mean of all target values

    This is Bayesian smoothing: categories with few observations
    are pulled toward the global mean.
    """

    def __init__(self, smoothing=10.0):
        self.smoothing = smoothing

    def fit(self, X, y):
        """Learn category → target mappings."""
        X = np.asarray(X).ravel()
        y = np.asarray(y).ravel()

        self.global_mean_ = y.mean()
        self.encoding_ = {}

        for category in np.unique(X):
            mask = X == category
            count = mask.sum()
            mean_target = y[mask].mean()
            # Bayesian smoothing
            self.encoding_[category] = (
                (count * mean_target + self.smoothing * self.global_mean_) /
                (count + self.smoothing)
            )

        self.is_fitted_ = True
        return self

    def transform(self, X):
        """Apply learned encoding."""
        check_is_fitted(self, ['is_fitted_', 'encoding_'])
        X = np.asarray(X).ravel()
        result = np.array([self.encoding_.get(c, self.global_mean_) for c in X])
        return result.reshape(-1, 1)


# Usage with sklearn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd

# Sample data
np.random.seed(42)
df = pd.DataFrame({
    'city': np.random.choice(['NYC', 'LA', 'Chicago', 'Houston'], 1000),
    'income': np.random.normal(50000, 15000, 1000),
    'target': np.random.randint(0, 2, 1000)
})

# Create pipeline
pipe = Pipeline([
    ('encoder', TargetEncoder(smoothing=5.0)),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Encode and evaluate
X_encoded = df[['city']].values
y = df['target'].values

encoder = TargetEncoder(smoothing=5.0)
X_transformed = encoder.fit_transform(X_encoded)

print("Original categories:", df['city'].unique())
print("Encoded values (sample):", X_transformed[:5].flatten().round(4))
# Encoded values: [0.5123, 0.4876, 0.5034, 0.4912, 0.5089]

Target Encoding

\text{enc}(x) = \frac{\text{count}(x) \cdot \bar{y}_x + m \cdot \bar{y}_{\text{global}}}{\text{count}(x) + m}

Here,

=Category value
=Number of times category x appears
=Mean target for category x
=Smoothing parameter
=Global mean of all targets

Building a Dataset Class (PyTorch-Style)

import numpy as np

class Dataset:
    """A generic dataset class supporting indexing and slicing.

    Similar to PyTorch's Dataset but without framework dependencies.
    Supports:
    - Dataset[i] → (features, target)
    - Dataset[i:j] → subset
    - len(dataset) → number of samples
    - iteration over samples
    """

    def __init__(self, features, targets=None, transform=None):
        self.features = np.asarray(features, dtype=np.float32)
        self.targets = np.asarray(targets, dtype=np.float32) if targets is not None else None
        self.transform = transform
        self._validate()

    def _validate(self):
        if self.targets is not None and len(self.features) != len(self.targets):
            raise ValueError(f"Features ({len(self.features)}) and "
                           f"targets ({len(self.targets)}) length mismatch")

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        if isinstance(idx, slice):
            features = self.features[idx]
            targets = self.targets[idx] if self.targets is not None else None
            return Dataset(features, targets, self.transform)

        x = self.features[idx]
        if self.targets is not None:
            y = self.targets[idx]
            if self.transform:
                x = self.transform(x)
            return x, y
        return self.transform(x) if self.transform else x

    def __repr__(self):
        return (f"Dataset(n_samples={len(self)}, "
                f"n_features={self.features.shape[1]}, "
                f"has_targets={self.targets is not None})")

    @property
    def shape(self):
        return self.features.shape

    def split(self, train_ratio=0.8, shuffle=True, seed=42):
        """Split into train/test datasets."""
        n = len(self)
        indices = np.random.RandomState(seed).permutation(n) if shuffle else np.arange(n)
        split_idx = int(n * train_ratio)

        train_idx, test_idx = indices[:split_idx], indices[split_idx:]
        return self[train_idx], self[test_idx]

    def batch(self, batch_size, shuffle=True):
        """Yield mini-batches for training."""
        n = len(self)
        indices = np.random.permutation(n) if shuffle else np.arange(n)

        for start in range(0, n, batch_size):
            batch_idx = indices[start:start + batch_size]
            yield self[batch_idx]


# Usage
np.random.seed(42)
X = np.random.randn(500, 10)
y = (X[:, 0] + X[:, 1] > 0).astype(float)

dataset = Dataset(X, y)
print(dataset)  # Dataset(n_samples=500, n_features=10, has_targets=True)

# Single sample
x, label = dataset[0]
print(f"Sample shape: {x.shape}, Label: {label}")

# Slice
subset = dataset[:100]
print(f"Subset: {subset}")

# Split
train, test = dataset.split(train_ratio=0.8)
print(f"Train: {train}, Test: {test}")

# Mini-batches
for batch_X, batch_y in dataset.batch(batch_size=32):
    print(f"Batch X shape: {batch_X.shape}, Batch y shape: {batch_y.shape}")
    break  # Just show first batch

When to Use OOP vs Functional

Criterion	Use OOP	Use Functional
State management needed	✓	✗
Multiple related operations	✓	✗
sklearn/PyTorch compatibility	✓	✗
One-off data transformation	✗	✓
Simple scripts	✗	✓
Pipeline composition	✓	✓
Testing (mocking)	✓	✗
Mathematical functions	✗	✓
Reusable components across projects	✓	✗

Complete Example: DataPipeline Class

import numpy as np
from typing import List, Tuple, Optional, Dict, Any
from dataclasses import dataclass, field

@dataclass
class StepResult:
    """Result of a pipeline step."""
    name: str
    input_shape: Tuple
    output_shape: Tuple
    time_ms: float
    metadata: Dict[str, Any] = field(default_factory=dict)


class DataPipeline:
    """A composable data processing pipeline.

    Usage:
        pipeline = DataPipeline()
        pipeline.add_step('clean', clean_function)
        pipeline.add_step('scale', scale_function)
        pipeline.add_step('encode', encode_function)
        result = pipeline.run(data)
    """

    def __init__(self, name: str = "pipeline"):
        self.name = name
        self.steps: List[Tuple[str, callable]] = []
        self.results: List[StepResult] = []

    def add_step(self, name: str, func: callable):
        """Add a processing step."""
        self.steps.append((name, func))
        return self  # Enable chaining

    def run(self, data: Any, verbose: bool = False) -> Any:
        """Execute all pipeline steps sequentially."""
        import time

        self.results = []
        current = data

        for step_name, step_func in self.steps:
            start = time.perf_counter()
            input_shape = self._get_shape(current)

            try:
                current = step_func(current)
            except Exception as e:
                raise RuntimeError(f"Pipeline failed at step '{step_name}': {e}")

            elapsed = (time.perf_counter() - start) * 1000
            output_shape = self._get_shape(current)

            result = StepResult(
                name=step_name,
                input_shape=input_shape,
                output_shape=output_shape,
                time_ms=elapsed
            )
            self.results.append(result)

            if verbose:
                print(f"[{step_name}] {input_shape} → {output_shape} ({elapsed:.1f}ms)")

        return current

    def _get_shape(self, data):
        if hasattr(data, 'shape'):
            return data.shape
        if isinstance(data, (list, tuple)):
            return (len(data),)
        return None

    def summary(self):
        """Print pipeline execution summary."""
        print(f"\n{'='*60}")
        print(f"Pipeline: {self.name}")
        print(f"{'='*60}")
        total_time = 0
        for i, result in enumerate(self.results, 1):
            print(f"  {i}. {result.name}")
            print(f"     Input:  {result.input_shape}")
            print(f"     Output: {result.output_shape}")
            print(f"     Time:   {result.time_ms:.1f}ms")
            total_time += result.time_ms
        print(f"{'='*60}")
        print(f"Total time: {total_time:.1f}ms")
        print(f"Total steps: {len(self.results)}")


# Define processing functions
def remove_nulls(data):
    """Remove rows with null values."""
    import pandas as pd
    if isinstance(data, pd.DataFrame):
        return data.dropna()
    return data

def standardize(data):
    """Standardize numeric features."""
    import pandas as pd
    if isinstance(data, pd.DataFrame):
        numeric = data.select_dtypes(include=[np.number])
        data[numeric.columns] = (numeric - numeric.mean()) / numeric.std()
    return data

def add_features(data):
    """Add engineered features."""
    import pandas as pd
    if isinstance(data, pd.DataFrame) and 'income' in data.columns:
        data['income_log'] = np.log1p(data['income'])
        data['income_bin'] = pd.cut(data['income'], bins=5, labels=False)
    return data

# Build and run pipeline
pipeline = DataPipeline(name="customer_data_v1")
pipeline.add_step("remove_nulls", remove_nulls)
pipeline.add_step("standardize", standardize)
pipeline.add_step("add_features", add_features)

# Sample data
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({
    'age': np.random.randint(18, 70, 1000),
    'income': np.random.normal(50000, 15000, 1000),
    'score': np.random.uniform(0, 100, 1000)
})
df.loc[np.random.choice(df.index, 50), 'age'] = np.nan

result = pipeline.run(df, verbose=True)
pipeline.summary()

Key Takeaways

📋Summary: OOP for Data Science

Classes define blueprints; objects are instances. Use __init__ to set up state, methods to define behavior.
self is how Python knows which object's data to operate on. It is passed automatically — you never pass it explicitly.
Encapsulation (_ and __) protects internal state. Use @property for controlled access.
Inheritance enables code reuse and polymorphism. Always prefer composition over inheritance when possible.
Magic methods (__repr__, __getitem__, __add__) make your objects behave like built-in Python types.
Dataclasses eliminate boilerplate for data-holding classes. Use frozen=True for immutable configs.
sklearn compatibility: Inherit from BaseEstimator and TransformerMixin to plug into sklearn Pipelines.
OOP is best for stateful components (models, pipelines), functional is best for stateless transformations.

Practice Exercises

Exercise 1: Build a Linear Regression Class

class LinearRegression:
    """Implement from scratch with fit/predict interface."""

    def __init__(self, learning_rate=0.01, epochs=1000):
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.weights = None
        self.bias = None
        self.losses = []

    def fit(self, X, y):
        # Implement gradient descent
        # Store losses in self.losses
        pass

    def predict(self, X):
        # y = Xw + b
        pass

    def score(self, X, y):
        # R² score: 1 - SS_res / SS_tot
        pass

Exercise 2: Create a Transformer Pipeline

# Create these transformers:
# 1. OneHotEncoder: encode categories as binary columns
# 2. PolynomialFeatures: create polynomial interaction features
# 3. SelectKBest: keep top k features by correlation

class OneHotEncoder:
    def fit(self, X):
        pass
    def transform(self, X):
        pass

class PolynomialFeatures:
    def __init__(self, degree=2):
        pass
    def fit(self, X):
        pass
    def transform(self, X):
        pass

class SelectKBest:
    def __init__(self, k=5):
        pass
    def fit(self, X, y):
        pass
    def transform(self, X):
        pass

Exercise 3: Magic Methods Challenge

# Create a Matrix class with all operators
class Matrix:
    def __init__(self, data):
        self.data = np.array(data)

    # Implement:
    # __add__ (matrix addition)
    # __mul__ (element-wise multiplication)
    # __matmul__ (matrix multiplication)
    # __getitem__ (indexing)
    # __setitem__ (assignment)
    # __eq__ (equality check)
    # __repr__ (string representation)
    # @property determinant (compute determinant)
    # @property inverse (compute inverse)
    # @property transpose (transpose matrix)

Exercise 4: Build a Config System

# Create a hierarchical config system using dataclasses

@dataclass
class DatabaseConfig:
    host: str = "localhost"
    port: int = 5432
    name: str = "mydb"

@dataclass
class ModelConfig:
    learning_rate: float = 0.01
    epochs: int = 100
    batch_size: int = 32

@dataclass
class AppConfig:
    db: DatabaseConfig = field(default_factory=DatabaseConfig)
    model: ModelConfig = field(default_factory=ModelConfig)
    debug: bool = False

    # Implement:
    # - to_dict() method (convert to dictionary)
    # - from_dict() classmethod (create from dictionary)
    # - save() method (save to JSON file)
    # - load() classmethod (load from JSON file)

Object-Oriented Programming for Data Science

Why OOP for Data Science?

OOP vs Functional for Data Science

Classes and Objects

DfClass

The self Parameter

Dot Product

Encapsulation

DfEncapsulation

Inheritance

DfInheritance

Standard Scaling

Polymorphism

DfPolymorphism

Magic Methods (Dunder Methods)

DfMagic Method

Operator Overloading

Dataclasses: Modern Python

Building a Custom sklearn-Compatible Transformer

Target Encoding

Building a Dataset Class (PyTorch-Style)

When to Use OOP vs Functional

Complete Example: DataPipeline Class

Key Takeaways

📋Summary: OOP for Data Science

Practice Exercises

Exercise 1: Build a Linear Regression Class

Exercise 2: Create a Transformer Pipeline

Exercise 3: Magic Methods Challenge

Exercise 4: Build a Config System

Need Expert Data Science Help?

The `self` Parameter