Data Versioning with DVC

💡 Data Version Control (DVC) versions datasets, models, and pipelines alongside your Git code. It enables reproducible ML workflows without storing large files in Git. This lesson covers DVC fundamentals, pipelines, remote storage, and best practices.

1. Why DVC?

The Problem

Architecture Diagram

Git Repository with ML Project
├── data/
│   ├── training_data.parquet  (2.3 GB)  ← Git can't handle
│   └── test_data.parquet      (450 MB)  ← Git can't handle
├── models/
│   └── model.pkl              (89 MB)   ← Git can't handle
├── notebooks/
│   └── train.ipynb            (2 MB)    ← Git handles
├── src/
│   ├── train.py               (5 KB)    ← Git handles
│   └── evaluate.py            (3 KB)    ← Git handles
└── requirements.txt           (1 KB)    ← Git handles

DVC Solution

Architecture Diagram

Git Repository + DVC
├── data/
│   ├── training_data.parquet  ← DVC tracking (.dvc file)
│   ├── training_data.parquet.dvc  ← Git tracks this small file
│   └── test_data.parquet.dvc  ← Git tracks this small file
├── models/
│   └── model.pkl.dvc          ← Git tracks this small file
├── .dvc/                      ← DVC config
│   └── config
├── src/
│   └── train.py
└── dvc.yaml                   ← Pipeline definition

DfData Versioning

Data versioning is the practice of tracking and managing changes to datasets over time, similar to how Git tracks code changes. It enables reproducibility by allowing you to retrieve the exact dataset used for any experiment or model version.

ℹ️ Why Not Just Git LFS?

Git LFS (Large File Storage) stores large files in Git, but it doesn't support pipelines, dependency tracking, or experiment management. DVC adds these ML-specific features while keeping Git for code versioning. DVC is purpose-built for ML workflows.

2. DVC Setup

Installation & Initialization

# Install DVC
pip install dvc

# Initialize in Git repo
dvc init

# Check DVC status
dvc doctor

Tracking Data Files

# Track a file
dvc add data/training_data.parquet

# This creates:
# 1. data/training_data.parquet.dvc (small tracking file)
# 2. .dvc/cache/ (actual data stored here)

# Track multiple files
dvc add data/test_data.parquet models/model.pkl

# List tracked files
dvc list .

DVC Tracking File

# data/training_data.parquet.dvc
md5: a1b2c3d4e5f678901234567890123456
size: 2415513600
hash: md5

💡 How DVC Tracking Works

DVC stores a small .dvc file in Git that contains the MD5 hash and size of the tracked file. The actual data is stored in .dvc/cache locally and can be pushed to remote storage (S3, GCS, etc.). When you checkout a Git commit, DVC uses the .dvc file to fetch the corresponding data version.

3. Remote Storage

Configure Remote

# Add remote (S3)
dvc remote add -d myremote s3://my-bucket/dvc-storage
dvc remote modify myremote access_key_id <AWS_KEY>
dvc remote modify myremote secret_access_key <AWS_SECRET>

# Add remote (GCS)
dvc remote add -d myremote gs://my-bucket/dvc-storage

# Add remote (Azure)
dvc remote add -d myremote azure://my-container/dvc-storage

# Add remote (local)
dvc remote add -d myremote /mnt/shared/dvc-storage

# List remotes
dvc remote list

Push & Pull Data

# Push data to remote
dvc push

# Pull data from remote
dvc pull

# Push specific files
dvc push data/training_data.parquet.dvc

# Check status
dvc status

ℹ️ Remote Storage Strategy

Use a shared remote storage (S3, GCS) for team collaboration. Each developer pulls data on demand, keeping local storage minimal. For CI/CD, configure the remote in secrets and pull data before training.

4. DVC Pipelines

Basic Pipeline

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw_data.csv
    params:
      - prepare.test_size: 0.2
      - prepare.random_state: 42
    outs:
      - data/training_data.parquet
      - data/test_data.parquet

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/training_data.parquet
    params:
      - train.n_estimators: 100
      - train.max_depth: 5
      - train.learning_rate: 0.01
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/test_data.parquet
    metrics:
      - metrics.json:
          cache: false
    plots:
      - plots/confusion_matrix.json:
          cache: false

Parameters File

// params.json
{
  "prepare": {
    "test_size": 0.2,
    "random_state": 42
  },
  "train": {
    "n_estimators": 100,
    "max_depth": 5,
    "learning_rate": 0.01
  }
}

Run Pipeline

# Run entire pipeline
dvc repro

# Run specific stage
dvc repro train

# Show pipeline DAG
dvc dag

# Show pipeline status
dvc status

DfDVC Pipeline

A DVC pipeline is a directed acyclic graph (DAG) of stages that defines the ML workflow. Each stage has dependencies (inputs), commands (processing), and outputs (results). DVC tracks all changes and only re-runs stages when dependencies change.

💡 Pipeline Reproducibility

DVC pipelines ensure reproducibility by tracking: (1) code dependencies, (2) data dependencies, (3) parameters. When any dependency changes, DVC automatically re-runs affected stages. This guarantees that results are always reproducible.

5. Experiments

Experiment Tracking

# Run experiment with modified params
dvc exp run -S train.n_estimators=200 -S train.max_depth=10

# Run multiple experiments
dvc exp run -S train.n_estimators=50 -n exp_50_trees
dvc exp run -S train.n_estimators=100 -n exp_100_trees
dvc exp run -S train.n_estimators=200 -n exp_200_trees

# List experiments
dvc exp list

# Show experiment diff
dvc exp diff exp_200_trees

# Compare experiments
dvc exp show

# Apply best experiment
dvc exp apply exp_200_trees

Hyperparameter Sweeps

# dvc.yaml (with matrix)
stages:
  train:
    cmd: python src/train.py
    params:
      - train.n_estimators:
          - 50
          - 100
          - 200
      - train.max_depth:
          - 3
          - 5
          - 10

# Run all combinations
dvc exp run --queue -S train.n_estimators=50 -S train.max_depth=3
dvc exp run --queue -S train.n_estimators=50 -S train.max_depth=5
dvc exp run --queue -S train.n_estimators=100 -S train.max_depth=3
# ... etc

# Run queued experiments
dvc exp run --run-all

6. Metrics & Plots

Tracking Metrics

# src/train.py
import json
from sklearn.metrics import accuracy_score, f1_score

# ... training code ...

# Save metrics
metrics = {
    "accuracy": accuracy_score(y_test, y_pred),
    "f1_score": f1_score(y_test, y_pred, average="weighted"),
    "train_time": training_time,
}

with open("metrics.json", "w") as f:
    json.dump(metrics, f, indent=2)

Visualizing Metrics

# Show metrics comparison
dvc metrics show

# Compare metrics between commits
dvc metrics diff HEAD~1

# Show plots
dvc plots show plots/confusion_matrix.json

# Compare plots
dvc plots diff HEAD~1

Plotting Training Curves

# src/train.py
import csv

# Save training curves
with open("plots/training_curves.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["step", "train_loss", "val_loss", "learning_rate"])

    for epoch in range(n_epochs):
        train_loss = train(model, train_loader)
        val_loss = evaluate(model, val_loader)
        writer.writerow([epoch, train_loss, val_loss, lr])

# dvc.yaml (plots section)
stages:
  train:
    plots:
      - plots/training_curves.csv:
          x: step
          y:
            train_loss: train_loss
            val_loss: val_loss

ℹ️ Metrics vs Plots

Use metrics.json for scalar values (accuracy, loss) that you want to compare across experiments. Use plots for time-series data (training curves) that benefit from visualization. DVC supports both and can diff them across commits.

7. Best Practices

Project Structure

Architecture Diagram

my-ml-project/
├── .dvc/
│   └── config
├── data/
│   ├── raw/
│   │   └── data.csv.dvc
│   ├── processed/
│   │   ├── train.csv.dvc
│   │   └── test.csv.dvc
│   └── external/
│       └── large_dataset.dvc
├── models/
│   └── model.pkl.dvc
├── notebooks/
├── src/
│   ├── prepare.py
│   ├── train.py
│   └── evaluate.py
├── plots/
├── dvc.yaml
├── params.json
└── .gitignore

.gitignore for DVC

# DVC tracked files
/data/raw/
/data/processed/
/models/
/plots/

# DVC cache (if not pushing to remote)
.dvc/cache/

# Keep .dvc files
!*.dvc

Handling Large Datasets

# For very large files, use partial download
dvc pull data/large_dataset.parquet --recursive

# Use externals for shared datasets
dvc add /shared/datasets/my_dataset
dvc remote add -d shared /shared/dvc-storage

8. Integration with Other Tools

MLflow + DVC

import mlflow
import json

# Use DVC for data versioning
# Use MLflow for experiment tracking

with mlflow.start_run():
    # Log DVC-tracked data version
    dvc_version = subprocess.check_output(["dvc", "ls", "data/", "--dvc-only"])
    mlflow.log_param("data_version", dvc_version)

    # Log metrics
    with open("metrics.json") as f:
        metrics = json.load(f)
    mlflow.log_metrics(metrics)

    # Log model
    mlflow.sklearn.log_model(model, "model")

GitHub Actions + DVC

# .github/workflows/train.yml
name: Train Model

on:
  push:
    branches: [main]

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
        with:
          fetch-depth: 0

      - name: Pull DVC data
        uses: iterative/setup-dvc@v2
        with:
          version: "2.30.0"

      - name: Pull data
        run: dvc pull

      - name: Run pipeline
        run: dvc repro

      - name: Push updated data
        run: dvc push

ℹ️ CI/CD with DVC

In CI/CD pipelines, store DVC remote credentials in secrets (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY). The workflow pulls data before training and pushes updated data/metrics after training. This keeps the full pipeline reproducible.

9. Key Takeaways

📋Summary: Data Versioning with DVC

DVC versions large files (data, models) alongside Git code
Remote storage (S3, GCS, Azure) stores actual data; Git tracks small .dvc files
Pipelines define multi-stage workflows with dependencies and parameters
Experiments track hyperparameter changes and compare results
Metrics & plots visualize training progress and model performance
Integration: Combine with MLflow, GitHub Actions for full MLOps pipeline
Start simple: Track data → Add pipeline → Add experiments → Add remote storage
DVC uses content-addressable storage (MD5 hashes) for efficient caching
Pipelines ensure reproducibility by tracking all dependencies

10. Practice Exercises

Exercise 1: Basic DVC Setup

# TODO: Initialize DVC in your project
# Track your dataset with dvc add
# Set up a remote storage (S3, GCS, or local)
# Push and pull data

Exercise 2: DVC Pipeline

# TODO: Create a 3-stage DVC pipeline
# Stages: prepare, train, evaluate
# Add parameters and metrics
# Test reproducibility with dvc repro

Exercise 3: Experiment Tracking

# TODO: Run 5+ experiments with different hyperparameters
# Use: dvc exp run with different param values
# Compare: dvc exp show
# Apply: best experiment to main branch

Exercise 4: Full MLOps Pipeline

# TODO: Combine DVC with GitHub Actions
# Automated: data pull → train → evaluate → push
# Add: metrics visualization
# Deploy: model to staging environment

Data Versioning with DVC