dbt & Snowflake Interview

ELT vs ETL: Modern Data Stack Architecture

Understanding data transformation paradigms

Interview Question

"Your company is migrating from a traditional ETL system to a modern ELT architecture. Compare ETL and ELT, explain the tradeoffs, and design the new architecture. Include: (1) tool selection, (2) data modeling approach, (3) transformation strategy, (4) testing and monitoring."

Difficulty: Medium-Hard | Frequently asked at dbt, Snowflake, Fivetran, Airbyte

Theoretical Foundation

ETL (Extract, Transform, Load)

Data is transformed before loading into the target system.

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    ETL Architecture                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Source Systems ──▶ Extract ──▶ Transform ──▶ Load ──▶ Target│
│                                                             │
│  Example:                                                   │
│  PostgreSQL ──▶ Extract rows ──▶ Clean/Validate ──▶ Redshift│
│                                                             │
│  Transformation happens in:                                 │
│  - ETL tool (Informatica, Talend)                          │
│  - Custom code (Python, Java)                              │
│  - Middleware (Apache Camel)                                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Characteristics:

Transform before loading
Requires staging area
Fixed schema (schema-on-write)
Batch processing
Complex error handling

Pros:

Data is clean when loaded
Reduced load on target system
Well-established patterns

Cons:

Expensive transform infrastructure
Long development cycles
Inflexible schema changes
Limited ad-hoc analysis

ELT (Extract, Load, Transform)

Data is loaded first, then transformed within the target system.

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    ELT Architecture                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Source Systems ──▶ Extract ──▶ Load ──▶ Transform ──▶ Target│
│                                                             │
│  Example:                                                   │
│  PostgreSQL ──▶ Extract rows ──▶ Snowflake ──▶ dbt models   │
│                                                             │
│  Transformation happens in:                                 │
│  - Target warehouse (Snowflake, BigQuery)                  │
│  - dbt (data build tool)                                   │
│  - Spark on warehouse                                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Characteristics:

Load raw data first
Transform within warehouse
Flexible schema (schema-on-read)
SQL-based transformations
Version controlled

Pros:

Faster time to insights
Cheaper storage
Flexible schema evolution
SQL-based (easier to learn)
Version controlled transformations

Cons:

Raw data in warehouse
Higher warehouse compute costs
Requires warehouse optimization

Modern Data Stack

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│              Modern Data Stack (ELT)                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Ingestion: Fivetran, Airbyte, Stitch                      │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Extract from 200+ sources                          │   │
│  │  Load raw data to warehouse                         │   │
│  │  Managed connectors (no maintenance)                │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Storage: Snowflake, BigQuery, Redshift, Databricks        │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Scalable compute and storage                       │   │
│  │  SQL interface                                      │   │
│  │  Time travel, cloning                               │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Transformation: dbt, Dataform, SQLMesh                    │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  SQL-based transformations                          │   │
│  │  Version controlled (Git)                           │   │
│  │  Testing and documentation                          │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Orchestration: Airflow, Dagster, Prefect                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Schedule and monitor pipelines                     │   │
│  │  Dependency management                              │   │
│  │  Alerting                                           │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  BI: Tableau, Looker, Power BI, Metabase                   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Visualization and dashboards                       │   │
│  │  Self-service analytics                             │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Comparison Matrix

Aspect	ETL	ELT
Transform Location	External tool	Target warehouse
Data Movement	Source → Staging → Target	Source → Target
Schema	On-write (fixed)	On-read (flexible)
Development	Custom code	SQL + dbt
Version Control	Difficult	Easy (Git)
Testing	Manual	Automated (dbt tests)
Time to Insights	Weeks/Months	Days/Hours
Cost	High (ETL infrastructure)	Lower (warehouse compute)
Flexibility	Low	High
Skill Requirement	ETL developers	SQL + analytics

Data Modeling Approaches

Inmon (Top-Down)

Kimball (Bottom-Up)

Modern (dbt Approach)

Code Implementation

ETL Example (Traditional)

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
    .appName("ETL_Example") \
    .getOrCreate()

# ============================================================
# ETL: Extract, Transform, Load
# ============================================================

# EXTRACT: Read from source
source_data = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://source:5432/production") \
    .option("dbtable", "orders") \
    .option("user", "reader") \
    .option("password", "password") \
    .load()

# TRANSFORM: Clean and enrich (in Spark)
transformed_data = source_data \
    .filter(F.col("status") != "cancelled") \
    .withColumn("order_date", F.to_date("created_at")) \
    .withColumn("amount_clean", F.when(F.col("amount") < 0, 0).otherwise(F.col("amount"))) \
    .join(customer_data, "customer_id") \
    .groupBy("order_date", "customer_segment") \
    .agg(
        F.count("*").alias("order_count"),
        F.sum("amount_clean").alias("total_amount")
    )

# LOAD: Write to target warehouse
transformed_data.write \
    .format("parquet") \
    .mode("overwrite") \
    .partitionBy("order_date") \
    .save("s3://warehouse/order_summary/")

ELT Example (Modern with dbt)

-- ============================================================
-- ELT: Extract, Load, Transform (using dbt)
-- ============================================================

-- models/staging/stg_orders.sql
-- Source: Raw data loaded by Fivetran/Airbyte
-- Purpose: Clean and standardize raw data

SELECT
    id AS order_id,
    customer_id,
    amount,
    status,
    created_at,
    updated_at,
    _fivetran_synced AS _loaded_at
FROM {{ source('production', 'orders') }}
WHERE status != 'deleted'

-- models/marts/fct_orders.sql
-- Purpose: Business-level order facts

WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

customers AS (
    SELECT * FROM {{ ref('stg_customers') }}
),

order_summary AS (
    SELECT
        o.order_id,
        o.customer_id,
        c.customer_name,
        c.segment AS customer_segment,
        o.amount,
        o.status,
        DATE_TRUNC('day', o.created_at) AS order_date,
        o.created_at,
        o.updated_at
    FROM orders o
    LEFT JOIN customers c ON o.customer_id = c.customer_id
)

SELECT * FROM order_summary

-- models/marts/rpt_daily_revenue.sql
-- Purpose: Daily revenue report

WITH orders AS (
    SELECT * FROM {{ ref('fct_orders') }}
),

daily_summary AS (
    SELECT
        order_date,
        customer_segment,
        COUNT(DISTINCT order_id) AS order_count,
        COUNT(DISTINCT customer_id) AS customer_count,
        SUM(amount) AS total_revenue,
        AVG(amount) AS avg_order_value
    FROM orders
    WHERE status = 'completed'
    GROUP BY order_date, customer_segment
)

SELECT * FROM daily_summary

dbt Project Structure

Architecture Diagram

my_dbt_project/
├── dbt_project.yml
├── packages.yml
├── models/
│   ├── staging/
│   │   ├── _staging__models.yml
│   │   ├── stg_orders.sql
│   │   ├── stg_customers.sql
│   │   └── stg_products.sql
│   ├── marts/
│   │   ├── _marts__models.yml
│   │   ├── fct_orders.sql
│   │   ├── dim_customers.sql
│   │   └── rpt_daily_revenue.sql
│   └── intermediate/
│       ├── int_orders_enriched.sql
│       └── int_customer_lifetime.sql
├── tests/
│   ├── test_order_amount_positive.sql
│   └── test_customer_unique.sql
├── macros/
│   ├── generate_schema_name.sql
│   └── materializations.sql
└── seeds/
    ├── country_codes.csv
    └── product_categories.csv

dbt Configuration

# dbt_project.yml
name: 'my_project'
version: '1.0.0'
config-version: 2

profile: 'snowflake'

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

clean-targets:
  - "target"
  - "dbt_packages"

models:
  my_project:
    staging:
      +materialized: view
      +schema: staging
    marts:
      +materialized: table
      +schema: analytics
    intermediate:
      +materialized: ephemeral

vars:
  start_date: '2024-01-01'
  end_date: '2024-12-31'

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: ">=1.0.0"
  - package: dbt-labs/codegen
    version: ">=0.11.0"
  - package: calogica/dbt_expectations
    version: ">=0.10.0"

dbt Tests

# models/staging/_staging__models.yml
version: 2

models:
  - name: stg_orders
    description: "Cleaned orders from production database"
    columns:
      - name: order_id
        description: "Primary key"
        tests:
          - unique
          - not_null
      - name: customer_id
        description: "Foreign key to customers"
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id
      - name: amount
        description: "Order amount"
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000
      - name: status
        description: "Order status"
        tests:
          - accepted_values:
              values: ['pending', 'completed', 'cancelled', 'refunded']

# models/marts/_marts__models.yml
version: 2

models:
  - name: fct_orders
    description: "Order facts with customer information"
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: order_date
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: "2020-01-01"
              max_value: "2030-12-31"

dbt Macros

-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
    {%- set default_schema = target.schema -%}
    {%- if custom_schema_name is none -%}
        {{ default_schema }}
    {%- else -%}
        {{ default_schema }}_{{ custom_schema_name }}
    {%- endif -%}
{%- endmacro %}

-- macros/insert_audit_columns.sql
{% macro insert_audit_columns() %}
    , CURRENT_TIMESTAMP AS _loaded_at
    , '{{ invocation_id }}' AS _dbt_invocation_id
    , '{{ run_started_at }}' AS _dbt_run_started_at
{% endmacro %}

dbt Snapshots (SCD Type 2)

-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}

{{
    config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='timestamp',
        updated_at='updated_at',
    )
}}

SELECT * FROM {{ source('production', 'customers') }}

{% endsnapshot %}

Airflow Integration

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago

with DAG(
    'dbt_pipeline',
    schedule_interval='0 6 * * *',
    start_date=days_ago(1),
    catchup=False,
) as dag:

    # Run dbt staging models
    dbt_staging = BashOperator(
        task_id='dbt_staging',
        bash_command='cd /opt/dbt && dbt run --select staging',
    )

    # Run dbt tests
    dbt_test = BashOperator(
        task_id='dbt_test',
        bash_command='cd /opt/dbt && dbt test --select staging',
    )

    # Run dbt marts
    dbt_marts = BashOperator(
        task_id='dbt_marts',
        bash_command='cd /opt/dbt && dbt run --select marts',
    )

    # Run dbt tests on marts
    dbt_test_marts = BashOperator(
        task_id='dbt_test_marts',
        bash_command='cd /opt/dbt && dbt test --select marts',
    )

    # Dependencies
    dbt_staging >> dbt_test >> dbt_marts >> dbt_test_marts

💡

Production Tip: dbt is the de facto standard for ELT transformations. It provides: (1) version control for SQL, (2) automated testing, (3) documentation generation, (4) incremental models, and (5) a rich package ecosystem. Always use dbt for SQL-based transformations.

Common Follow-Up Questions

Q1: When would you choose ETL over ELT?

Choose ETL when:

Target system can't handle raw data (e.g., legacy systems)
Data is extremely sensitive and must be masked before loading
Transformations are complex and require custom code
Source systems have limited data

Choose ELT when:

Target is a modern warehouse (Snowflake, BigQuery)
You need flexibility for ad-hoc analysis
You want faster time to insights
Your team knows SQL

Q2: How do you handle late-arriving data in ELT?

-- dbt incremental model with late-arriving data
{{
    config(
        materialized='incremental',
        unique_key='order_id',
        incremental_strategy='merge'
    )
}}

SELECT
    order_id,
    customer_id,
    amount,
    status,
    created_at,
    updated_at
FROM {{ source('production', 'orders') }}

{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

Q3: How do you test data quality in ELT?

-- dbt tests
-- Schema tests
tests:
  - unique
  - not_null
  - accepted_values
  - relationships

-- Data tests
SELECT
    order_id
FROM {{ ref('fct_orders') }}
WHERE amount < 0
-- Should return 0 rows

Q4: How do you handle schema evolution in ELT?

-- dbt handles schema evolution automatically
-- New columns are added to the target table
-- Use {{ config(materialized='incremental') }} for automatic schema evolution

⚠️

Critical Consideration: ELT requires a powerful warehouse to handle transformations. If your warehouse is slow or expensive, ETL might be more cost-effective. Always benchmark both approaches for your specific workload.

Company-Specific Tips

dbt Interview Tips

Discuss dbt and SQLMesh for transformations
Explain incremental models and materializations
Mention testing and documentation
Talk about packages and macros

Snowflake Interview Tips

Focus on ELT with Snowflake
Discuss Snowflake features (time travel, cloning)
Mention Snowflake optimization (clustering, search optimization)
Talk about Snowflake cost management

Fivetran/Airbyte Interview Tips

Discuss managed connectors for ingestion
Explain CDC and incremental sync
Mention schema evolution handling
Talk about data freshness monitoring

ℹ️

Final Takeaway: ELT is the modern standard for data transformation. It provides flexibility, speed, and cost-effectiveness. Use dbt for SQL transformations, Snowflake/BigQuery for storage, and Fivetran/Airbyte for ingestion. Always consider your team's skills and budget when choosing between ETL and ELT.