Data Versioning with DVC

DVC (Data Version Control) extends Git to handle large files, datasets, and ML models. It versions data and pipelines alongside code.

DVC Data Versioning Flow Git Code + .dvc DVC Cache MD5 hashes Remote Storage S3 / GCS / Azure Pipeline: deps → cmd → outs prepare featurize train evaluate dvc repro → detects changes → re-runs affected stages

DVC Architecture

DVC Architecture
Git Repository
.git/
Code + .dvc files
Params + metrics
Pipeline definitions
DVC Metadata
.dvc files
dvc.lock
dvc.yaml
Pins to cache
Remote Storage
S3 / GCS / Azure Blob
SSH / HDFS / HTTP
Content-addressable
Deduplication via hashes
Local Cache
.dvc/cache/
Content-addressed storage
Files by MD5 hash
Pipeline Stage
dvc.yaml defines stages
deps → cmd → outs
Cacheable + reproducible
Git tracks .dvc files; DVC tracks actual data

1. Basic Versioning

# Initialize DVC in a Git repo
dvc init

# Track a large file
dvc add data/train.csv

# This creates:
# - data/train.csv.dvc (pointer file tracked by Git)
# - .dvc/cache/ (actual data stored here)

# Commit the pointer to Git
git add data/train.csv.dvc .gitignore
git commit -m "Add training data"

# Connect to remote storage
dvc remote add -d storage s3://my-bucket/dvc-storage
dvc push

2. DVC Pipelines

prepare
featurize
train
evaluate
DVC Pipeline DAG
Each stage: deps (inputs) → cmd (script) → outs (outputs)
dvc repro detects changes and re-runs affected stages

Pipeline Definition

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw.csv
    outs:
      - data/prepared.csv

  featurize:
    cmd: python src/features.py
    deps:
      - src/features.py
      - data/prepared.csv
    params:
      - features.n_components:
    outs:
      - data/features.npy

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/features.npy
    params:
      - model.n_estimators:
      - model.max_depth:
    outs:
      - model.pkl

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - model.pkl
      - data/features.npy
    metrics:
      - metrics.json:
          cache: false

3. Parameters and Metrics

# params.yaml
features:
  n_components: 50
  scaling: standard

model:
  n_estimators: 200
  max_depth: 10
  learning_rate: 0.1

evaluation:
  threshold: 0.5

# src/train.py
import yaml
import joblib
import mlflow
from sklearn.ensemble import GradientBoostingClassifier

params = yaml.safe_load(open("params.yaml"))["model"]

model = GradientBoostingClassifier(**params)
model.fit(X_train, y_train)
joblib.dump(model, "model.pkl")

# DVC metrics
import json
metrics = {"accuracy": 0.92, "f1": 0.89}
json.dump(metrics, open("metrics.json", "w"))

4. Experiment Comparison

# Run experiment
dvc exp run -S model.n_estimators=300 -S model.max_depth=15

# Compare experiments
dvc exp show
dvc exp diff HEAD~1

# Compare across branches
dvc exp branch <experiment> branch-name

5. Data Lineage

# Show dependency graph
dvc dag

# Show what changed
dvc status

# Reproduce entire pipeline
dvc repro

# Force full re-run
dvc repro --force

6. Remote Storage Configuration

# S3
dvc remote add -d storage s3://bucket/path
dvc remote modify storage region us-east-1

# GCS
dvc remote add -d storage gs://bucket/path

# Azure
dvc remote add -d storage azure://container/path

# SSH
dvc remote add -d storage ssh://user@host/path

7. Git Integration Workflow

# Typical workflow
git add .
git commit -m "Update feature engineering"

dvc add data/processed.csv
git add data/processed.csv.dvc

git push
dvc push

# Teammate pulls
git pull
dvc pull

Key Takeaways

Git tracks code, parameters, and .dvc pointer files
DVC tracks large data files in remote storage via content hashing
Pipelines (dvc.yaml) define reproducible DAGs with caching
Experiments are tracked via params and metrics, compared with dvc exp

Data Versioning with DVC

Data Versioning with DVC

DVC Architecture

1. Basic Versioning

2. DVC Pipelines

Pipeline Definition

3. Parameters and Metrics

4. Experiment Comparison

5. Data Lineage

6. Remote Storage Configuration

7. Git Integration Workflow

Key Takeaways

Need Expert Data Science Help?