Data Versioning with DVC
DVC (Data Version Control) extends Git to handle large files, datasets, and ML models. It versions data and pipelines alongside code.
DVC Architecture
1. Basic Versioning
# Initialize DVC in a Git repo
dvc init
# Track a large file
dvc add data/train.csv
# This creates:
# - data/train.csv.dvc (pointer file tracked by Git)
# - .dvc/cache/ (actual data stored here)
# Commit the pointer to Git
git add data/train.csv.dvc .gitignore
git commit -m "Add training data"
# Connect to remote storage
dvc remote add -d storage s3://my-bucket/dvc-storage
dvc push
2. DVC Pipelines
Pipeline Definition
# dvc.yaml
stages:
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
- data/raw.csv
outs:
- data/prepared.csv
featurize:
cmd: python src/features.py
deps:
- src/features.py
- data/prepared.csv
params:
- features.n_components:
outs:
- data/features.npy
train:
cmd: python src/train.py
deps:
- src/train.py
- data/features.npy
params:
- model.n_estimators:
- model.max_depth:
outs:
- model.pkl
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- model.pkl
- data/features.npy
metrics:
- metrics.json:
cache: false
3. Parameters and Metrics
# params.yaml
features:
n_components: 50
scaling: standard
model:
n_estimators: 200
max_depth: 10
learning_rate: 0.1
evaluation:
threshold: 0.5
# src/train.py
import yaml
import joblib
import mlflow
from sklearn.ensemble import GradientBoostingClassifier
params = yaml.safe_load(open("params.yaml"))["model"]
model = GradientBoostingClassifier(**params)
model.fit(X_train, y_train)
joblib.dump(model, "model.pkl")
# DVC metrics
import json
metrics = {"accuracy": 0.92, "f1": 0.89}
json.dump(metrics, open("metrics.json", "w"))
4. Experiment Comparison
# Run experiment
dvc exp run -S model.n_estimators=300 -S model.max_depth=15
# Compare experiments
dvc exp show
dvc exp diff HEAD~1
# Compare across branches
dvc exp branch <experiment> branch-name
5. Data Lineage
# Show dependency graph
dvc dag
# Show what changed
dvc status
# Reproduce entire pipeline
dvc repro
# Force full re-run
dvc repro --force
6. Remote Storage Configuration
# S3
dvc remote add -d storage s3://bucket/path
dvc remote modify storage region us-east-1
# GCS
dvc remote add -d storage gs://bucket/path
# Azure
dvc remote add -d storage azure://container/path
# SSH
dvc remote add -d storage ssh://user@host/path
7. Git Integration Workflow
# Typical workflow
git add .
git commit -m "Update feature engineering"
dvc add data/processed.csv
git add data/processed.csv.dvc
git push
dvc push
# Teammate pulls
git pull
dvc pull
Key Takeaways
- Git tracks code, parameters, and
.dvcpointer files - DVC tracks large data files in remote storage via content hashing
- Pipelines (
dvc.yaml) define reproducible DAGs with caching - Experiments are tracked via params and metrics, compared with
dvc exp