Version Control with Git for Data Engineers

Why Git Matters for Data Engineers

Version control is fundamental for managing data pipeline code, configuration files, SQL transformations, and infrastructure-as-code. Git enables collaboration, rollback, and audit trails.

Architecture Diagram

+-------------------------------------------------------------+
|                  GIT USE CASES IN DATA ENG                  |
+-------------------------------------------------------------+
|  Pipeline Code        |  Python scripts, DAGs               |
|  SQL Transformations  |  dbt models, views, functions       |
|  Infrastructure       |  Terraform, CloudFormation          |
|  Configuration        |  YAML, JSON, env files              |
|  Documentation        |  README, runbooks, data catalogs    |
|  Tests                |  Unit tests, integration tests      |
|  CI/CD                |  GitHub Actions, GitLab CI          |
+-------------------------------------------------------------+

Theory: How Git Works Internally

Git stores data as a directed acyclic graph (DAG) of commits. Each commit points to its parent(s) and contains a snapshot of all tracked files.

Blob: Stores file content (compressed).
Tree: Directory structure mapping file names to blobs.
Commit: Points to a tree, author, message, and parent commit(s).
HEAD: A pointer to the current branch (which points to a commit).

Git Object Model

Git Basics

Essential Commands

# Initialize repository
git init

# Clone repository
git clone https://github.com/org/data-pipeline.git

# Check status
git status

# Add files to staging
git add file.py                    # Add specific file
git add .                          # Add all changes
git add -p                         # Interactive staging (patch mode)

# Commit
git commit -m "feat: add order processing pipeline"
git commit --amend                 # Amend last commit

# View history
git log --oneline -10              # Last 10 commits
git log --graph --oneline          # Visual branch history
git log --stat                     # Show changed files

# Diff
git diff                           # Unstaged changes
git diff --staged                  # Staged changes
git diff main..feature-branch      # Compare branches

# Remote operations
git remote add origin https://github.com/org/repo.git
git push origin main
git pull origin main
git fetch origin

Undoing Changes

# Discard unstaged changes
git checkout -- file.py
git restore file.py                # Git 2.23+

# Unstage a file
git reset HEAD file.py
git restore --staged file.py       # Git 2.23+

# Amend last commit (before push)
git commit --amend -m "new message"

# Revert a commit (creates new commit)
git revert abc123

# Reset to specific commit (DANGEROUS)
git reset --hard abc123            # Discard all changes
git reset --soft HEAD~1            # Keep changes, undo commit

Undo Operations Reference

Command	Effect	Safety
`git restore <file>`	Discard working dir changes	Safe
`git restore --staged <file>`	Unstage a file	Safe
`git commit --amend`	Modify last commit	Safe (before push)
`git revert <sha>`	Create undo commit	Safe (preserves history)
`git reset --soft HEAD~1`	Undo commit, keep changes	Safe
`git reset --mixed HEAD~1`	Undo commit, unstage changes	Safe
`git reset --hard HEAD~1`	Discard everything	DANGEROUS

Branching Strategies

Git Flow

Git Flow Branching Strategy:

Branch	Purpose	Lifetime
`main`	Production-ready code	Permanent
`develop`	Integration branch	Permanent
`feature/*`	New features	Temporary
`release/*`	Release preparation	Temporary
`hotfix/*`	Critical fixes	Temporary

Git Flow Workflow:

Create develop from main
Create feature/* branches from develop
Merge features back to develop
Create release/* from develop when ready
Merge release to main and tag
Create hotfix/* from main for critical fixes

# Create feature branch
git checkout -b feature/order-pipeline

# Work on feature
git add .
git commit -m "feat: implement order extraction"

# Push to remote
git push -u origin feature/order-pipeline

# Create pull request (via GitHub/GitLab CLI)
gh pr create --title "Add order pipeline" --body "Implements..."

# Merge to develop
git checkout develop
git merge feature/order-pipeline
git push origin develop

# Delete branch
git branch -d feature/order-pipeline
git push origin --delete feature/order-pipeline

Trunk-Based Development

Trunk-Based Development Workflow:

Step	Action	Description
1	Create short-lived branch	Branch from `main` for feature
2	Commit frequently	Small, atomic commits
3	Open PR quickly	Keep PRs small and focused
4	Review and merge	Fast review cycles
5	Delete branch	Clean up after merge

Trunk-Based Best Practices:

Keep branches short-lived (hours to days)
Use feature flags for incomplete features
Automate testing and deployment
Merge to main at least daily

Branching Strategy Comparison

Factor	Git Flow	Trunk-Based
Complexity	High (5 branch types)	Low (main + short-lived branches)
Release cadence	Scheduled releases	Continuous deployment
Best for	Projects with versioned releases	CI/CD-heavy environments
Merge conflicts	More (long-lived branches)	Fewer (short-lived branches)
Data pipeline fit	Good for batch pipelines	Good for streaming/real-time

Commit Message Conventions

Conventional Commits

Architecture Diagram

<type>[optional scope]: <description>

[optional body]

[optional footer(s)]

Types

Type	Description	Example
feat	New feature	`feat(pipeline): add order extraction`
fix	Bug fix	`fix: correct date parsing in transform`
docs	Documentation	`docs: update README with setup instructions`
style	Formatting	`style: apply black formatting`
refactor	Code restructuring	`refactor: extract common utilities`
test	Adding tests	`test: add unit tests for transform`
chore	Maintenance	`chore: update dependencies`
perf	Performance	`perf: optimize bulk insert`
ci	CI/CD	`ci: add GitHub Actions workflow`
build	Build system	`build: add Dockerfile`

Examples

# Simple commit
git commit -m "feat: add daily sales aggregation pipeline"

# With scope
git commit -m "feat(orders): implement CDC ingestion from PostgreSQL"

# With body
git commit -m "fix(transform): handle null values in amount column

Previously, null amounts would cause the aggregation to fail.
This fix adds a coalesce to default null amounts to 0.

Closes #123"

# Breaking change
git commit -m "feat!: change order schema to include customer segment

BREAKING CHANGE: order table now requires customer_segment column"

Pull Request Best Practices

PR Template

## Description
Brief description of changes

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests pass
- [ ] Manual testing completed

## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings

Review Checklist for Data Engineers

Architecture Diagram

+-------------------------------------------------------------+
|              CODE REVIEW CHECKLIST                          |
+-------------------------------------------------------------+
|  Code follows project style guidelines                      |
|  SQL queries are optimized and indexed                      |
|  Error handling is comprehensive                            |
|  Logging is appropriate and informative                     |
|  Data quality checks are included                           |
|  Tests cover edge cases                                     |
|  Documentation is updated                                   |
|  No hardcoded credentials or secrets                        |
|  Backward compatibility maintained                          |
|  Performance implications considered                        |
|  Rollback strategy documented                               |
+-------------------------------------------------------------+

CI/CD Integration

GitHub Actions Workflow

# .github/workflows/data-pipeline.yml
name: Data Pipeline CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install black flake8 mypy
      
      - name: Run black
        run: black --check .
      
      - name: Run flake8
        run: flake8 .
      
      - name: Run mypy
        run: mypy .

  test:
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Run tests
        run: pytest tests/ -v --cov=src --cov-report=xml
      
      - name: Upload coverage
        uses: codecov/codecov-action@v3

  deploy:
    runs-on: ubuntu-latest
    needs: test
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      
      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
      
      - name: Deploy to S3
        run: |
          aws s3 sync src/ s3://my-bucket/pipeline/
      
      - name: Notify deployment
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d '{"text": "Pipeline deployed to production"}'

CI/CD Pipeline Flow for Data Engineering

Git Hooks

Pre-commit Hooks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-json
      - id: check-merge-conflict
      - id: detect-private-key
  
  - repo: https://github.com/psf/black
    rev: 24.1.0
    hooks:
      - id: black
  
  - repo: https://github.com/pycqa/flake8
    rev: 7.0.0
    hooks:
      - id: flake8
  
  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.8.0
    hooks:
      - id: mypy

Managing Pipeline Code

Project Structure

Architecture Diagram

data-pipeline/
+-- .github/
|   +-- workflows/
|       +-- ci.yml
|       +-- deploy.yml
+-- src/
|   +-- extractors/
|   |   +-- __init__.py
|   |   +-- postgres.py
|   |   +-- api.py
|   +-- transformers/
|   |   +-- __init__.py
|   |   +-- orders.py
|   |   +-- customers.py
|   +-- loaders/
|   |   +-- __init__.py
|   |   +-- warehouse.py
|   +-- utils/
|       +-- __init__.py
|       +-- config.py
+-- tests/
|   +-- unit/
|   +-- integration/
+-- sql/
|   +-- migrations/
|   +-- models/
+-- configs/
|   +-- dev.yaml
|   +-- staging.yaml
|   +-- prod.yaml
+-- docker/
|   +-- Dockerfile
+-- docs/
|   +-- architecture.md
+-- README.md
+-- requirements.txt
+-- pyproject.toml

dbt Project Structure

Architecture Diagram

dbt_project/
+-- models/
|   +-- staging/
|   |   +-- stg_orders.sql
|   |   +-- stg_customers.sql
|   |   +-- stg_products.sql
|   +-- intermediate/
|   |   +-- int_orders_enriched.sql
|   |   +-- int_customer_metrics.sql
|   +-- marts/
|   |   +-- fct_orders.sql
|   |   +-- dim_customers.sql
|   |   +-- dim_products.sql
|   +-- sources/
|       +-- source.yml
+-- tests/
|   +-- assert_positive_amount.sql
|   +-- test_unique_orders.sql
+-- macros/
|   +-- generate_schema_name.sql
+-- snapshots/
|   +-- scd_customers.sql
+-- seeds/
|   +-- country_codes.csv
+-- dbt_project.yml
+-- profiles.yml

Monorepo vs Polyrepo

Comparison

Factor	Monorepo	Polyrepo
Dependency management	Simple	Complex
Atomic changes	Yes	No
Access control	Coarse	Fine-grained
CI/CD complexity	Higher	Lower
Code discoverability	Easy	Harder
Team scaling	Better for large teams	Better for distributed teams

Decision Framework

Best Practices for Data Engineering Repos

Practice	Rationale
Separate config from code	Different env configs (dev/staging/prod) without code changes
Version SQL migrations	Database schema changes tracked in Git
Use `.gitignore` for data	Never commit large CSV/Parquet files
Tag releases	`git tag v1.2.0` for rollback and auditing
Protect main branch	Require PR reviews and CI checks before merge
Automate with CI/CD	Lint, test, and deploy on every merge
Document architecture	README with diagrams and data flow

MathSummary Takeaways

Git is essential — master the core commands (add, commit, push, pull, branch, merge, rebase).
Use conventional commits — clear, consistent commit messages (feat:, fix:, docs:) improve changelogs and traceability.
Branch strategically — Git Flow for scheduled releases, trunk-based for continuous deployment.
PRs enable review — use templates and checklists for quality; require at least one reviewer.
CI/CD automates quality — lint, test, and deploy automatically on merge to main.
Git hooks prevent issues — catch formatting errors, secrets, and type issues before they reach the repository.
Structure matters — organize pipeline code with clear separation of extractors, transformers, and loaders.
Choose repo strategy wisely — monorepo for tight coupling and shared code, polyrepo for independent team ownership.

Practice Exercises

Git workflow: Set up a Git repository with Git Flow branching strategy. Create a feature branch, make changes, create a PR, and merge.
CI/CD pipeline: Create a GitHub Actions workflow that runs linting, tests, and deploys on merge to main.
Pre-commit hooks: Set up pre-commit hooks that run black, flake8, and check for secrets.
Code review: Review a teammate's PR using the data engineering checklist. Document your findings.
Repository structure: Design a project structure for a data pipeline that handles 3 different data sources and loads to a data warehouse.

Version Control with Git for Data Engineers

Why Git Matters for Data Engineers

Theory: How Git Works Internally

Git Object Model

Git Basics

Essential Commands

Undoing Changes

Undo Operations Reference

Branching Strategies

Git Flow

Trunk-Based Development

Branching Strategy Comparison

Commit Message Conventions

Conventional Commits

Types

Examples

Pull Request Best Practices

PR Template

Review Checklist for Data Engineers

CI/CD Integration

GitHub Actions Workflow

CI/CD Pipeline Flow for Data Engineering

Git Hooks

Pre-commit Hooks

Managing Pipeline Code

Project Structure

dbt Project Structure

Monorepo vs Polyrepo

Comparison

Decision Framework

Best Practices for Data Engineering Repos

MathSummary Takeaways

See Also

Practice Exercises

Need Expert Data Engineering Help?