CW

Data Versioning with DVC

Module 15: Data Engineering & MLOpsFree Lesson

Advertisement

Data Versioning with DVC

DVC (Data Version Control) extends Git to handle large files, datasets, and ML models. It versions data and pipelines alongside code.

DVC Architecture

DVC ArchitectureGit Repository.git/Code + .dvc filesParams + metricsPipeline definitionsDVC Metadata.dvc filesdvc.lockdvc.yamlPins to cacheRemote StorageS3 / GCS / Azure BlobSSH / HDFS / HTTPContent-addressableDeduplication via hashesLocal Cache.dvc/cache/Content-addressed storageFiles by MD5 hashPipeline Stagedvc.yaml defines stagesdeps → cmd → outsCacheable + reproducibleGit tracks .dvc files; DVC tracks actual data

1. Basic Versioning

# Initialize DVC in a Git repo
dvc init

# Track a large file
dvc add data/train.csv

# This creates:
# - data/train.csv.dvc (pointer file tracked by Git)
# - .dvc/cache/ (actual data stored here)

# Commit the pointer to Git
git add data/train.csv.dvc .gitignore
git commit -m "Add training data"

# Connect to remote storage
dvc remote add -d storage s3://my-bucket/dvc-storage
dvc push

2. DVC Pipelines

preparefeaturizetrainevaluateDVC Pipeline DAGEach stage: deps (inputs) → cmd (script) → outs (outputs)dvc repro detects changes and re-runs affected stages

Pipeline Definition

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw.csv
    outs:
      - data/prepared.csv

  featurize:
    cmd: python src/features.py
    deps:
      - src/features.py
      - data/prepared.csv
    params:
      - features.n_components:
    outs:
      - data/features.npy

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/features.npy
    params:
      - model.n_estimators:
      - model.max_depth:
    outs:
      - model.pkl

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - model.pkl
      - data/features.npy
    metrics:
      - metrics.json:
          cache: false

3. Parameters and Metrics

# params.yaml
features:
  n_components: 50
  scaling: standard

model:
  n_estimators: 200
  max_depth: 10
  learning_rate: 0.1

evaluation:
  threshold: 0.5
# src/train.py
import yaml
import joblib
import mlflow
from sklearn.ensemble import GradientBoostingClassifier

params = yaml.safe_load(open("params.yaml"))["model"]

model = GradientBoostingClassifier(**params)
model.fit(X_train, y_train)
joblib.dump(model, "model.pkl")

# DVC metrics
import json
metrics = {"accuracy": 0.92, "f1": 0.89}
json.dump(metrics, open("metrics.json", "w"))

4. Experiment Comparison

# Run experiment
dvc exp run -S model.n_estimators=300 -S model.max_depth=15

# Compare experiments
dvc exp show
dvc exp diff HEAD~1

# Compare across branches
dvc exp branch <experiment> branch-name

5. Data Lineage

# Show dependency graph
dvc dag

# Show what changed
dvc status

# Reproduce entire pipeline
dvc repro

# Force full re-run
dvc repro --force

6. Remote Storage Configuration

# S3
dvc remote add -d storage s3://bucket/path
dvc remote modify storage region us-east-1

# GCS
dvc remote add -d storage gs://bucket/path

# Azure
dvc remote add -d storage azure://container/path

# SSH
dvc remote add -d storage ssh://user@host/path

7. Git Integration Workflow

# Typical workflow
git add .
git commit -m "Update feature engineering"

dvc add data/processed.csv
git add data/processed.csv.dvc

git push
dvc push

# Teammate pulls
git pull
dvc pull

Key Takeaways

  • Git tracks code, parameters, and .dvc pointer files
  • DVC tracks large data files in remote storage via content hashing
  • Pipelines (dvc.yaml) define reproducible DAGs with caching
  • Experiments are tracked via params and metrics, compared with dvc exp

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement