Data Versioning with DVC
š” Data Version Control (DVC) versions datasets, models, and pipelines alongside your Git code. It enables reproducible ML workflows without storing large files in Git. This lesson covers DVC fundamentals, pipelines, remote storage, and best practices.
1. Why DVC?
The Problem
Git Repository with ML Project
āāā data/
ā āāā training_data.parquet (2.3 GB) ā Git can't handle
ā āāā test_data.parquet (450 MB) ā Git can't handle
āāā models/
ā āāā model.pkl (89 MB) ā Git can't handle
āāā notebooks/
ā āāā train.ipynb (2 MB) ā Git handles
āāā src/
ā āāā train.py (5 KB) ā Git handles
ā āāā evaluate.py (3 KB) ā Git handles
āāā requirements.txt (1 KB) ā Git handles
DVC Solution
Git Repository + DVC
āāā data/
ā āāā training_data.parquet ā DVC tracking (.dvc file)
ā āāā training_data.parquet.dvc ā Git tracks this small file
ā āāā test_data.parquet.dvc ā Git tracks this small file
āāā models/
ā āāā model.pkl.dvc ā Git tracks this small file
āāā .dvc/ ā DVC config
ā āāā config
āāā src/
ā āāā train.py
āāā dvc.yaml ā Pipeline definition
DfData Versioning
Data versioning is the practice of tracking and managing changes to datasets over time, similar to how Git tracks code changes. It enables reproducibility by allowing you to retrieve the exact dataset used for any experiment or model version.
ā¹ļø Why Not Just Git LFS?
Git LFS (Large File Storage) stores large files in Git, but it doesn't support pipelines, dependency tracking, or experiment management. DVC adds these ML-specific features while keeping Git for code versioning. DVC is purpose-built for ML workflows.
2. DVC Setup
Installation & Initialization
# Install DVC
pip install dvc
# Initialize in Git repo
dvc init
# Check DVC status
dvc doctor
Tracking Data Files
# Track a file
dvc add data/training_data.parquet
# This creates:
# 1. data/training_data.parquet.dvc (small tracking file)
# 2. .dvc/cache/ (actual data stored here)
# Track multiple files
dvc add data/test_data.parquet models/model.pkl
# List tracked files
dvc list .
DVC Tracking File
# data/training_data.parquet.dvc
md5: a1b2c3d4e5f678901234567890123456
size: 2415513600
hash: md5
š” How DVC Tracking Works
DVC stores a small .dvc file in Git that contains the MD5 hash and size of the tracked file. The actual data is stored in .dvc/cache locally and can be pushed to remote storage (S3, GCS, etc.). When you checkout a Git commit, DVC uses the .dvc file to fetch the corresponding data version.
3. Remote Storage
Configure Remote
# Add remote (S3)
dvc remote add -d myremote s3://my-bucket/dvc-storage
dvc remote modify myremote access_key_id <AWS_KEY>
dvc remote modify myremote secret_access_key <AWS_SECRET>
# Add remote (GCS)
dvc remote add -d myremote gs://my-bucket/dvc-storage
# Add remote (Azure)
dvc remote add -d myremote azure://my-container/dvc-storage
# Add remote (local)
dvc remote add -d myremote /mnt/shared/dvc-storage
# List remotes
dvc remote list
Push & Pull Data
# Push data to remote
dvc push
# Pull data from remote
dvc pull
# Push specific files
dvc push data/training_data.parquet.dvc
# Check status
dvc status
ā¹ļø Remote Storage Strategy
Use a shared remote storage (S3, GCS) for team collaboration. Each developer pulls data on demand, keeping local storage minimal. For CI/CD, configure the remote in secrets and pull data before training.
4. DVC Pipelines
Basic Pipeline
# dvc.yaml
stages:
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
- data/raw_data.csv
params:
- prepare.test_size: 0.2
- prepare.random_state: 42
outs:
- data/training_data.parquet
- data/test_data.parquet
train:
cmd: python src/train.py
deps:
- src/train.py
- data/training_data.parquet
params:
- train.n_estimators: 100
- train.max_depth: 5
- train.learning_rate: 0.01
outs:
- models/model.pkl
metrics:
- metrics.json:
cache: false
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.pkl
- data/test_data.parquet
metrics:
- metrics.json:
cache: false
plots:
- plots/confusion_matrix.json:
cache: false
Parameters File
// params.json
{
"prepare": {
"test_size": 0.2,
"random_state": 42
},
"train": {
"n_estimators": 100,
"max_depth": 5,
"learning_rate": 0.01
}
}
Run Pipeline
# Run entire pipeline
dvc repro
# Run specific stage
dvc repro train
# Show pipeline DAG
dvc dag
# Show pipeline status
dvc status
DfDVC Pipeline
A DVC pipeline is a directed acyclic graph (DAG) of stages that defines the ML workflow. Each stage has dependencies (inputs), commands (processing), and outputs (results). DVC tracks all changes and only re-runs stages when dependencies change.
š” Pipeline Reproducibility
DVC pipelines ensure reproducibility by tracking: (1) code dependencies, (2) data dependencies, (3) parameters. When any dependency changes, DVC automatically re-runs affected stages. This guarantees that results are always reproducible.
5. Experiments
Experiment Tracking
# Run experiment with modified params
dvc exp run -S train.n_estimators=200 -S train.max_depth=10
# Run multiple experiments
dvc exp run -S train.n_estimators=50 -n exp_50_trees
dvc exp run -S train.n_estimators=100 -n exp_100_trees
dvc exp run -S train.n_estimators=200 -n exp_200_trees
# List experiments
dvc exp list
# Show experiment diff
dvc exp diff exp_200_trees
# Compare experiments
dvc exp show
# Apply best experiment
dvc exp apply exp_200_trees
Hyperparameter Sweeps
# dvc.yaml (with matrix)
stages:
train:
cmd: python src/train.py
params:
- train.n_estimators:
- 50
- 100
- 200
- train.max_depth:
- 3
- 5
- 10
# Run all combinations
dvc exp run --queue -S train.n_estimators=50 -S train.max_depth=3
dvc exp run --queue -S train.n_estimators=50 -S train.max_depth=5
dvc exp run --queue -S train.n_estimators=100 -S train.max_depth=3
# ... etc
# Run queued experiments
dvc exp run --run-all
6. Metrics & Plots
Tracking Metrics
# src/train.py
import json
from sklearn.metrics import accuracy_score, f1_score
# ... training code ...
# Save metrics
metrics = {
"accuracy": accuracy_score(y_test, y_pred),
"f1_score": f1_score(y_test, y_pred, average="weighted"),
"train_time": training_time,
}
with open("metrics.json", "w") as f:
json.dump(metrics, f, indent=2)
Visualizing Metrics
# Show metrics comparison
dvc metrics show
# Compare metrics between commits
dvc metrics diff HEAD~1
# Show plots
dvc plots show plots/confusion_matrix.json
# Compare plots
dvc plots diff HEAD~1
Plotting Training Curves
# src/train.py
import csv
# Save training curves
with open("plots/training_curves.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["step", "train_loss", "val_loss", "learning_rate"])
for epoch in range(n_epochs):
train_loss = train(model, train_loader)
val_loss = evaluate(model, val_loader)
writer.writerow([epoch, train_loss, val_loss, lr])
# dvc.yaml (plots section)
stages:
train:
plots:
- plots/training_curves.csv:
x: step
y:
train_loss: train_loss
val_loss: val_loss
ā¹ļø Metrics vs Plots
Use metrics.json for scalar values (accuracy, loss) that you want to compare across experiments. Use plots for time-series data (training curves) that benefit from visualization. DVC supports both and can diff them across commits.
7. Best Practices
Project Structure
my-ml-project/
āāā .dvc/
ā āāā config
āāā data/
ā āāā raw/
ā ā āāā data.csv.dvc
ā āāā processed/
ā ā āāā train.csv.dvc
ā ā āāā test.csv.dvc
ā āāā external/
ā āāā large_dataset.dvc
āāā models/
ā āāā model.pkl.dvc
āāā notebooks/
āāā src/
ā āāā prepare.py
ā āāā train.py
ā āāā evaluate.py
āāā plots/
āāā dvc.yaml
āāā params.json
āāā .gitignore
.gitignore for DVC
# DVC tracked files
/data/raw/
/data/processed/
/models/
/plots/
# DVC cache (if not pushing to remote)
.dvc/cache/
# Keep .dvc files
!*.dvc
Handling Large Datasets
# For very large files, use partial download
dvc pull data/large_dataset.parquet --recursive
# Use externals for shared datasets
dvc add /shared/datasets/my_dataset
dvc remote add -d shared /shared/dvc-storage
8. Integration with Other Tools
MLflow + DVC
import mlflow
import json
# Use DVC for data versioning
# Use MLflow for experiment tracking
with mlflow.start_run():
# Log DVC-tracked data version
dvc_version = subprocess.check_output(["dvc", "ls", "data/", "--dvc-only"])
mlflow.log_param("data_version", dvc_version)
# Log metrics
with open("metrics.json") as f:
metrics = json.load(f)
mlflow.log_metrics(metrics)
# Log model
mlflow.sklearn.log_model(model, "model")
GitHub Actions + DVC
# .github/workflows/train.yml
name: Train Model
on:
push:
branches: [main]
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Pull DVC data
uses: iterative/setup-dvc@v2
with:
version: "2.30.0"
- name: Pull data
run: dvc pull
- name: Run pipeline
run: dvc repro
- name: Push updated data
run: dvc push
ā¹ļø CI/CD with DVC
In CI/CD pipelines, store DVC remote credentials in secrets (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY). The workflow pulls data before training and pushes updated data/metrics after training. This keeps the full pipeline reproducible.
9. Key Takeaways
šSummary: Data Versioning with DVC
- DVC versions large files (data, models) alongside Git code
- Remote storage (S3, GCS, Azure) stores actual data; Git tracks small
.dvcfiles - Pipelines define multi-stage workflows with dependencies and parameters
- Experiments track hyperparameter changes and compare results
- Metrics & plots visualize training progress and model performance
- Integration: Combine with MLflow, GitHub Actions for full MLOps pipeline
- Start simple: Track data ā Add pipeline ā Add experiments ā Add remote storage
- DVC uses content-addressable storage (MD5 hashes) for efficient caching
- Pipelines ensure reproducibility by tracking all dependencies
10. Practice Exercises
Exercise 1: Basic DVC Setup
# TODO: Initialize DVC in your project
# Track your dataset with dvc add
# Set up a remote storage (S3, GCS, or local)
# Push and pull data
Exercise 2: DVC Pipeline
# TODO: Create a 3-stage DVC pipeline
# Stages: prepare, train, evaluate
# Add parameters and metrics
# Test reproducibility with dvc repro
Exercise 3: Experiment Tracking
# TODO: Run 5+ experiments with different hyperparameters
# Use: dvc exp run with different param values
# Compare: dvc exp show
# Apply: best experiment to main branch
Exercise 4: Full MLOps Pipeline
# TODO: Combine DVC with GitHub Actions
# Automated: data pull ā train ā evaluate ā push
# Add: metrics visualization
# Deploy: model to staging environment