Version Control with Git for Data Engineers

Module 1: FoundationsGit and Version ControlFree Lesson

Advertisement

Why Git Matters for Data Engineers

Version control is fundamental for managing data pipeline code, configuration files, SQL transformations, and infrastructure-as-code. Git enables collaboration, rollback, and audit trails.

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                  GIT USE CASES IN DATA ENG                    │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│  Pipeline Code        │  Python scripts, DAGs               │
│  SQL Transformations  │  dbt models, views, functions       │
│  Infrastructure       │  Terraform, CloudFormation          │
│  Configuration        │  YAML, JSON, env files              │
│  Documentation        │  README, runbooks, data catalogs    │
│  Tests                │  Unit tests, integration tests      │
│  CI/CD                │  GitHub Actions, GitLab CI           │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Git Basics

Essential Commands

# Initialize repository
git init

# Clone repository
git clone https://github.com/org/data-pipeline.git

# Check status
git status

# Add files to staging
git add file.py                    # Add specific file
git add .                          # Add all changes
git add -p                         # Interactive staging (patch mode)

# Commit
git commit -m "feat: add order processing pipeline"
git commit --amend                 # Amend last commit

# View history
git log --oneline -10              # Last 10 commits
git log --graph --oneline          # Visual branch history
git log --stat                     # Show changed files

# Diff
git diff                           # Unstaged changes
git diff --staged                  # Staged changes
git diff main..feature-branch      # Compare branches

# Remote operations
git remote add origin https://github.com/org/repo.git
git push origin main
git pull origin main
git fetch origin

Undoing Changes

# Discard unstaged changes
git checkout -- file.py
git restore file.py                # Git 2.23+

# Unstage a file
git reset HEAD file.py
git restore --staged file.py       # Git 2.23+

# Amend last commit (before push)
git commit --amend -m "new message"

# Revert a commit (creates new commit)
git revert abc123

# Reset to specific commit (DANGEROUS)
git reset --hard abc123            # Discard all changes
git reset --soft HEAD~1            # Keep changes, undo commit

Branching Strategies

Git Flow

main ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
              │                  │                  │
develop ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā”€ā”€
              │     │     │     │        │     │
feature/1 ā”€ā”€ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā—     │     │        │     │
                          │     │        │     │
feature/2 ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā—        │     │
                                            │     │
hotfix/1 ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā—
# Create feature branch
git checkout -b feature/order-pipeline

# Work on feature
git add .
git commit -m "feat: implement order extraction"

# Push to remote
git push -u origin feature/order-pipeline

# Create pull request (via GitHub/GitLab CLI)
gh pr create --title "Add order pipeline" --body "Implements..."

# Merge to develop
git checkout develop
git merge feature/order-pipeline
git push origin develop

# Delete branch
git branch -d feature/order-pipeline
git push origin --delete feature/order-pipeline

Trunk-Based Development

main ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€
        │      │      │      │      │      │      │
feat/a ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā”¤      │      │      │      │      │
              │      │      │      │      │      │
feat/b ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā”€ā”¤      │      │      │      │
                        │      │      │      │      │
feat/c ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā”€ā”¤      │      │      │
                              │      │      │      │
feat/d ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā”€ā”¤      │      │
                                    │      │      │
feat/e ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā”€ā”¤      │
                                          │      │
feat/f ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā—ā”€ā”€ā”€ā”¤
                                                │
feat/g ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—

Commit Message Conventions

Conventional Commits

<type>[optional scope]: <description>

[optional body]

[optional footer(s)]

Types

TypeDescriptionExample
featNew featurefeat(pipeline): add order extraction
fixBug fixfix: correct date parsing in transform
docsDocumentationdocs: update README with setup instructions
styleFormattingstyle: apply black formatting
refactorCode restructuringrefactor: extract common utilities
testAdding teststest: add unit tests for transform
choreMaintenancechore: update dependencies
perfPerformanceperf: optimize bulk insert
ciCI/CDci: add GitHub Actions workflow
buildBuild systembuild: add Dockerfile

Examples

# Simple commit
git commit -m "feat: add daily sales aggregation pipeline"

# With scope
git commit -m "feat(orders): implement CDC ingestion from PostgreSQL"

# With body
git commit -m "fix(transform): handle null values in amount column

Previously, null amounts would cause the aggregation to fail.
This fix adds a coalesce to default null amounts to 0.

Closes #123"

# Breaking change
git commit -m "feat!: change order schema to include customer segment

BREAKING CHANGE: order table now requires customer_segment column"

Pull Request Best Practices

PR Template

## Description
Brief description of changes

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests pass
- [ ] Manual testing completed

## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings

Review Checklist for Data Engineers

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│              CODE REVIEW CHECKLIST                            │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│  ā–” Code follows project style guidelines                    │
│  ā–” SQL queries are optimized and indexed                    │
│  ā–” Error handling is comprehensive                          │
│  ā–” Logging is appropriate and informative                   │
│  ā–” Data quality checks are included                         │
│  ā–” Tests cover edge cases                                   │
│  ā–” Documentation is updated                                 │
│  ā–” No hardcoded credentials or secrets                      │
│  ā–” Backward compatibility maintained                        │
│  ā–” Performance implications considered                      │
│  ā–” Rollback strategy documented                             │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

CI/CD Integration

GitHub Actions Workflow

# .github/workflows/data-pipeline.yml
name: Data Pipeline CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install black flake8 mypy
      
      - name: Run black
        run: black --check .
      
      - name: Run flake8
        run: flake8 .
      
      - name: Run mypy
        run: mypy .

  test:
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Run tests
        run: pytest tests/ -v --cov=src --cov-report=xml
      
      - name: Upload coverage
        uses: codecov/codecov-action@v3

  deploy:
    runs-on: ubuntu-latest
    needs: test
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      
      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
      
      - name: Deploy to S3
        run: |
          aws s3 sync src/ s3://my-bucket/pipeline/
      
      - name: Notify deployment
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d '{"text": "Pipeline deployed to production"}'

Git Hooks

Pre-commit Hooks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-json
      - id: check-merge-conflict
      - id: detect-private-key
  
  - repo: https://github.com/psf/black
    rev: 24.1.0
    hooks:
      - id: black
  
  - repo: https://github.com/pycqa/flake8
    rev: 7.0.0
    hooks:
      - id: flake8
  
  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.8.0
    hooks:
      - id: mypy

# Custom hook: check for secrets
# .git/hooks/pre-commit
#!/bin/bash
if git diff --cached --name-only | xargs grep -l "password\|secret\|api_key" 2>/dev/null; then
    echo "ERROR: Potential secrets detected in staged files!"
    exit 1
fi

Managing Pipeline Code

Project Structure

data-pipeline/
ā”œā”€ā”€ .github/
│   └── workflows/
│       ā”œā”€ā”€ ci.yml
│       └── deploy.yml
ā”œā”€ā”€ src/
│   ā”œā”€ā”€ extractors/
│   │   ā”œā”€ā”€ __init__.py
│   │   ā”œā”€ā”€ postgres.py
│   │   └── api.py
│   ā”œā”€ā”€ transformers/
│   │   ā”œā”€ā”€ __init__.py
│   │   ā”œā”€ā”€ orders.py
│   │   └── customers.py
│   ā”œā”€ā”€ loaders/
│   │   ā”œā”€ā”€ __init__.py
│   │   └── warehouse.py
│   └── utils/
│       ā”œā”€ā”€ __init__.py
│       └── config.py
ā”œā”€ā”€ tests/
│   ā”œā”€ā”€ unit/
│   └── integration/
ā”œā”€ā”€ sql/
│   ā”œā”€ā”€ migrations/
│   └── models/
ā”œā”€ā”€ configs/
│   ā”œā”€ā”€ dev.yaml
│   ā”œā”€ā”€ staging.yaml
│   └── prod.yaml
ā”œā”€ā”€ docker/
│   └── Dockerfile
ā”œā”€ā”€ docs/
│   └── architecture.md
ā”œā”€ā”€ README.md
ā”œā”€ā”€ requirements.txt
└── pyproject.toml

dbt Project Structure

dbt_project/
ā”œā”€ā”€ models/
│   ā”œā”€ā”€ staging/
│   │   ā”œā”€ā”€ stg_orders.sql
│   │   ā”œā”€ā”€ stg_customers.sql
│   │   └── stg_products.sql
│   ā”œā”€ā”€ intermediate/
│   │   ā”œā”€ā”€ int_orders_enriched.sql
│   │   └── int_customer_metrics.sql
│   ā”œā”€ā”€ marts/
│   │   ā”œā”€ā”€ fct_orders.sql
│   │   ā”œā”€ā”€ dim_customers.sql
│   │   └── dim_products.sql
│   └── sources/
│       └── source.yml
ā”œā”€ā”€ tests/
│   ā”œā”€ā”€ assert_positive_amount.sql
│   └── test_unique_orders.sql
ā”œā”€ā”€ macros/
│   └── generate_schema_name.sql
ā”œā”€ā”€ snapshots/
│   └── scd_customers.sql
ā”œā”€ā”€ seeds/
│   └── country_codes.csv
ā”œā”€ā”€ dbt_project.yml
└── profiles.yml

Monorepo vs Polyrepo

Monorepo

organization/
ā”œā”€ā”€ data-pipeline/
ā”œā”€ā”€ data-warehouse/
ā”œā”€ā”€ ml-models/
ā”œā”€ā”€ dashboards/
ā”œā”€ā”€ shared-libs/
└── documentation/

Pros:

  • Atomic commits across dependencies
  • Easier refactoring
  • Single CI/CD pipeline
  • Shared code visibility

Cons:

  • Larger repository size
  • Complex CI/CD
  • Access control harder

Polyrepo

ā”œā”€ā”€ order-pipeline/        (repo)
ā”œā”€ā”€ customer-pipeline/     (repo)
ā”œā”€ā”€ warehouse-transforms/  (repo)
ā”œā”€ā”€ ml-features/           (repo)
└── shared-utils/          (repo)

Pros:

  • Clear ownership boundaries
  • Independent deployment
  • Smaller, focused repos
  • Fine-grained access control

Cons:

  • Dependency management complexity
  • Cross-repo changes harder
  • Multiple CI/CD pipelines

Comparison

FactorMonorepoPolyrepo
Dependency managementSimpleComplex
Atomic changesYesNo
Access controlCoarseFine-grained
CI/CD complexityHigherLower
Code discoverabilityEasyHarder
Team scalingBetter for large teamsBetter for distributed teams

Key Takeaways

  1. Git is essential — master the core commands and workflows
  2. Use conventional commits — clear, consistent commit messages
  3. Branch strategically — Git Flow for releases, trunk-based for continuous deployment
  4. PRs enable review — use templates and checklists for quality
  5. CI/CD automates quality — lint, test, and deploy automatically
  6. Git hooks prevent issues — catch problems before they're committed
  7. Structure matters — organize pipeline code for maintainability
  8. Choose repo strategy wisely — monorepo for tight coupling, polyrepo for independence

Practice Exercises

  1. Git workflow: Set up a Git repository with Git Flow branching strategy. Create a feature branch, make changes, create a PR, and merge.

  2. CI/CD pipeline: Create a GitHub Actions workflow that runs linting, tests, and deploys on merge to main.

  3. Pre-commit hooks: Set up pre-commit hooks that run black, flake8, and check for secrets.

  4. Code review: Review a teammate's PR using the data engineering checklist. Document your findings.

  5. Repository structure: Design a project structure for a data pipeline that handles 3 different data sources and loads to a data warehouse.

Advertisement

Need Expert Data Engineering Help?

Professional DE consulting, pipeline architecture, and data platform services.

Advertisement