Why Git Matters for Data Engineers
Version control is fundamental for managing data pipeline code, configuration files, SQL transformations, and infrastructure-as-code. Git enables collaboration, rollback, and audit trails.
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā GIT USE CASES IN DATA ENG ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā Pipeline Code ā Python scripts, DAGs ā
ā SQL Transformations ā dbt models, views, functions ā
ā Infrastructure ā Terraform, CloudFormation ā
ā Configuration ā YAML, JSON, env files ā
ā Documentation ā README, runbooks, data catalogs ā
ā Tests ā Unit tests, integration tests ā
ā CI/CD ā GitHub Actions, GitLab CI ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Git Basics
Essential Commands
# Initialize repository
git init
# Clone repository
git clone https://github.com/org/data-pipeline.git
# Check status
git status
# Add files to staging
git add file.py # Add specific file
git add . # Add all changes
git add -p # Interactive staging (patch mode)
# Commit
git commit -m "feat: add order processing pipeline"
git commit --amend # Amend last commit
# View history
git log --oneline -10 # Last 10 commits
git log --graph --oneline # Visual branch history
git log --stat # Show changed files
# Diff
git diff # Unstaged changes
git diff --staged # Staged changes
git diff main..feature-branch # Compare branches
# Remote operations
git remote add origin https://github.com/org/repo.git
git push origin main
git pull origin main
git fetch origin
Undoing Changes
# Discard unstaged changes
git checkout -- file.py
git restore file.py # Git 2.23+
# Unstage a file
git reset HEAD file.py
git restore --staged file.py # Git 2.23+
# Amend last commit (before push)
git commit --amend -m "new message"
# Revert a commit (creates new commit)
git revert abc123
# Reset to specific commit (DANGEROUS)
git reset --hard abc123 # Discard all changes
git reset --soft HEAD~1 # Keep changes, undo commit
Branching Strategies
Git Flow
main āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā ā
develop āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā ā ā ā ā
feature/1 āāāāāāāāāāā ā ā ā ā
ā ā ā ā
feature/2 āāāāāāāāāāāāāāāāāāāāāāā ā ā
ā ā
hotfix/1 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
# Create feature branch
git checkout -b feature/order-pipeline
# Work on feature
git add .
git commit -m "feat: implement order extraction"
# Push to remote
git push -u origin feature/order-pipeline
# Create pull request (via GitHub/GitLab CLI)
gh pr create --title "Add order pipeline" --body "Implements..."
# Merge to develop
git checkout develop
git merge feature/order-pipeline
git push origin develop
# Delete branch
git branch -d feature/order-pipeline
git push origin --delete feature/order-pipeline
Trunk-Based Development
main āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā ā ā ā ā ā ā
feat/a āāāāāāāā⤠ā ā ā ā ā
ā ā ā ā ā ā
feat/b āāāāāāāāāāāāāā⤠ā ā ā ā
ā ā ā ā ā
feat/c āāāāāāāāāāāāāāāāāāāāāāā⤠ā ā ā
ā ā ā ā
feat/d āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā⤠ā ā
ā ā ā
feat/e āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā⤠ā
ā ā
feat/f āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā
feat/g āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Commit Message Conventions
Conventional Commits
<type>[optional scope]: <description>
[optional body]
[optional footer(s)]
Types
| Type | Description | Example |
|---|---|---|
| feat | New feature | feat(pipeline): add order extraction |
| fix | Bug fix | fix: correct date parsing in transform |
| docs | Documentation | docs: update README with setup instructions |
| style | Formatting | style: apply black formatting |
| refactor | Code restructuring | refactor: extract common utilities |
| test | Adding tests | test: add unit tests for transform |
| chore | Maintenance | chore: update dependencies |
| perf | Performance | perf: optimize bulk insert |
| ci | CI/CD | ci: add GitHub Actions workflow |
| build | Build system | build: add Dockerfile |
Examples
# Simple commit
git commit -m "feat: add daily sales aggregation pipeline"
# With scope
git commit -m "feat(orders): implement CDC ingestion from PostgreSQL"
# With body
git commit -m "fix(transform): handle null values in amount column
Previously, null amounts would cause the aggregation to fail.
This fix adds a coalesce to default null amounts to 0.
Closes #123"
# Breaking change
git commit -m "feat!: change order schema to include customer segment
BREAKING CHANGE: order table now requires customer_segment column"
Pull Request Best Practices
PR Template
## Description
Brief description of changes
## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update
## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests pass
- [ ] Manual testing completed
## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings
Review Checklist for Data Engineers
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā CODE REVIEW CHECKLIST ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā ā” Code follows project style guidelines ā
ā ā” SQL queries are optimized and indexed ā
ā ā” Error handling is comprehensive ā
ā ā” Logging is appropriate and informative ā
ā ā” Data quality checks are included ā
ā ā” Tests cover edge cases ā
ā ā” Documentation is updated ā
ā ā” No hardcoded credentials or secrets ā
ā ā” Backward compatibility maintained ā
ā ā” Performance implications considered ā
ā ā” Rollback strategy documented ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
CI/CD Integration
GitHub Actions Workflow
# .github/workflows/data-pipeline.yml
name: Data Pipeline CI/CD
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install black flake8 mypy
- name: Run black
run: black --check .
- name: Run flake8
run: flake8 .
- name: Run mypy
run: mypy .
test:
runs-on: ubuntu-latest
needs: lint
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Run tests
run: pytest tests/ -v --cov=src --cov-report=xml
- name: Upload coverage
uses: codecov/codecov-action@v3
deploy:
runs-on: ubuntu-latest
needs: test
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Configure AWS
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Deploy to S3
run: |
aws s3 sync src/ s3://my-bucket/pipeline/
- name: Notify deployment
run: |
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-d '{"text": "Pipeline deployed to production"}'
Git Hooks
Pre-commit Hooks
# .pre-commit-config.yaml
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-json
- id: check-merge-conflict
- id: detect-private-key
- repo: https://github.com/psf/black
rev: 24.1.0
hooks:
- id: black
- repo: https://github.com/pycqa/flake8
rev: 7.0.0
hooks:
- id: flake8
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.8.0
hooks:
- id: mypy
# Custom hook: check for secrets
# .git/hooks/pre-commit
#!/bin/bash
if git diff --cached --name-only | xargs grep -l "password\|secret\|api_key" 2>/dev/null; then
echo "ERROR: Potential secrets detected in staged files!"
exit 1
fi
Managing Pipeline Code
Project Structure
data-pipeline/
āāā .github/
ā āāā workflows/
ā āāā ci.yml
ā āāā deploy.yml
āāā src/
ā āāā extractors/
ā ā āāā __init__.py
ā ā āāā postgres.py
ā ā āāā api.py
ā āāā transformers/
ā ā āāā __init__.py
ā ā āāā orders.py
ā ā āāā customers.py
ā āāā loaders/
ā ā āāā __init__.py
ā ā āāā warehouse.py
ā āāā utils/
ā āāā __init__.py
ā āāā config.py
āāā tests/
ā āāā unit/
ā āāā integration/
āāā sql/
ā āāā migrations/
ā āāā models/
āāā configs/
ā āāā dev.yaml
ā āāā staging.yaml
ā āāā prod.yaml
āāā docker/
ā āāā Dockerfile
āāā docs/
ā āāā architecture.md
āāā README.md
āāā requirements.txt
āāā pyproject.toml
dbt Project Structure
dbt_project/
āāā models/
ā āāā staging/
ā ā āāā stg_orders.sql
ā ā āāā stg_customers.sql
ā ā āāā stg_products.sql
ā āāā intermediate/
ā ā āāā int_orders_enriched.sql
ā ā āāā int_customer_metrics.sql
ā āāā marts/
ā ā āāā fct_orders.sql
ā ā āāā dim_customers.sql
ā ā āāā dim_products.sql
ā āāā sources/
ā āāā source.yml
āāā tests/
ā āāā assert_positive_amount.sql
ā āāā test_unique_orders.sql
āāā macros/
ā āāā generate_schema_name.sql
āāā snapshots/
ā āāā scd_customers.sql
āāā seeds/
ā āāā country_codes.csv
āāā dbt_project.yml
āāā profiles.yml
Monorepo vs Polyrepo
Monorepo
organization/
āāā data-pipeline/
āāā data-warehouse/
āāā ml-models/
āāā dashboards/
āāā shared-libs/
āāā documentation/
Pros:
- Atomic commits across dependencies
- Easier refactoring
- Single CI/CD pipeline
- Shared code visibility
Cons:
- Larger repository size
- Complex CI/CD
- Access control harder
Polyrepo
āāā order-pipeline/ (repo)
āāā customer-pipeline/ (repo)
āāā warehouse-transforms/ (repo)
āāā ml-features/ (repo)
āāā shared-utils/ (repo)
Pros:
- Clear ownership boundaries
- Independent deployment
- Smaller, focused repos
- Fine-grained access control
Cons:
- Dependency management complexity
- Cross-repo changes harder
- Multiple CI/CD pipelines
Comparison
| Factor | Monorepo | Polyrepo |
|---|---|---|
| Dependency management | Simple | Complex |
| Atomic changes | Yes | No |
| Access control | Coarse | Fine-grained |
| CI/CD complexity | Higher | Lower |
| Code discoverability | Easy | Harder |
| Team scaling | Better for large teams | Better for distributed teams |
Key Takeaways
- Git is essential ā master the core commands and workflows
- Use conventional commits ā clear, consistent commit messages
- Branch strategically ā Git Flow for releases, trunk-based for continuous deployment
- PRs enable review ā use templates and checklists for quality
- CI/CD automates quality ā lint, test, and deploy automatically
- Git hooks prevent issues ā catch problems before they're committed
- Structure matters ā organize pipeline code for maintainability
- Choose repo strategy wisely ā monorepo for tight coupling, polyrepo for independence
Practice Exercises
-
Git workflow: Set up a Git repository with Git Flow branching strategy. Create a feature branch, make changes, create a PR, and merge.
-
CI/CD pipeline: Create a GitHub Actions workflow that runs linting, tests, and deploys on merge to main.
-
Pre-commit hooks: Set up pre-commit hooks that run black, flake8, and check for secrets.
-
Code review: Review a teammate's PR using the data engineering checklist. Document your findings.
-
Repository structure: Design a project structure for a data pipeline that handles 3 different data sources and loads to a data warehouse.