ELT vs ETL: Modern Data Stack Architecture
Understanding data transformation paradigms
Interview Question
"Your company is migrating from a traditional ETL system to a modern ELT architecture. Compare ETL and ELT, explain the tradeoffs, and design the new architecture. Include: (1) tool selection, (2) data modeling approach, (3) transformation strategy, (4) testing and monitoring."
Difficulty: Medium-Hard | Frequently asked at dbt, Snowflake, Fivetran, Airbyte
Theoretical Foundation
ETL (Extract, Transform, Load)
Data is transformed before loading into the target system.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ETL Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Source Systems βββΆ Extract βββΆ Transform βββΆ Load βββΆ Targetβ
β β
β Example: β
β PostgreSQL βββΆ Extract rows βββΆ Clean/Validate βββΆ Redshiftβ
β β
β Transformation happens in: β
β - ETL tool (Informatica, Talend) β
β - Custom code (Python, Java) β
β - Middleware (Apache Camel) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Characteristics:
- Transform before loading
- Requires staging area
- Fixed schema (schema-on-write)
- Batch processing
- Complex error handling
Pros:
- Data is clean when loaded
- Reduced load on target system
- Well-established patterns
Cons:
- Expensive transform infrastructure
- Long development cycles
- Inflexible schema changes
- Limited ad-hoc analysis
ELT (Extract, Load, Transform)
Data is loaded first, then transformed within the target system.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ELT Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Source Systems βββΆ Extract βββΆ Load βββΆ Transform βββΆ Targetβ
β β
β Example: β
β PostgreSQL βββΆ Extract rows βββΆ Snowflake βββΆ dbt models β
β β
β Transformation happens in: β
β - Target warehouse (Snowflake, BigQuery) β
β - dbt (data build tool) β
β - Spark on warehouse β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Characteristics:
- Load raw data first
- Transform within warehouse
- Flexible schema (schema-on-read)
- SQL-based transformations
- Version controlled
Pros:
- Faster time to insights
- Cheaper storage
- Flexible schema evolution
- SQL-based (easier to learn)
- Version controlled transformations
Cons:
- Raw data in warehouse
- Higher warehouse compute costs
- Requires warehouse optimization
Modern Data Stack
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Modern Data Stack (ELT) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Ingestion: Fivetran, Airbyte, Stitch β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Extract from 200+ sources β β
β β Load raw data to warehouse β β
β β Managed connectors (no maintenance) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Storage: Snowflake, BigQuery, Redshift, Databricks β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Scalable compute and storage β β
β β SQL interface β β
β β Time travel, cloning β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Transformation: dbt, Dataform, SQLMesh β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SQL-based transformations β β
β β Version controlled (Git) β β
β β Testing and documentation β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Orchestration: Airflow, Dagster, Prefect β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Schedule and monitor pipelines β β
β β Dependency management β β
β β Alerting β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β BI: Tableau, Looker, Power BI, Metabase β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Visualization and dashboards β β
β β Self-service analytics β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Comparison Matrix
| Aspect | ETL | ELT |
|---|---|---|
| Transform Location | External tool | Target warehouse |
| Data Movement | Source β Staging β Target | Source β Target |
| Schema | On-write (fixed) | On-read (flexible) |
| Development | Custom code | SQL + dbt |
| Version Control | Difficult | Easy (Git) |
| Testing | Manual | Automated (dbt tests) |
| Time to Insights | Weeks/Months | Days/Hours |
| Cost | High (ETL infrastructure) | Lower (warehouse compute) |
| Flexibility | Low | High |
| Skill Requirement | ETL developers | SQL + analytics |
Data Modeling Approaches
Inmon (Top-Down)
Kimball (Bottom-Up)
Modern (dbt Approach)
Code Implementation
ETL Example (Traditional)
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName("ETL_Example") \
.getOrCreate()
# ============================================================
# ETL: Extract, Transform, Load
# ============================================================
# EXTRACT: Read from source
source_data = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://source:5432/production") \
.option("dbtable", "orders") \
.option("user", "reader") \
.option("password", "password") \
.load()
# TRANSFORM: Clean and enrich (in Spark)
transformed_data = source_data \
.filter(F.col("status") != "cancelled") \
.withColumn("order_date", F.to_date("created_at")) \
.withColumn("amount_clean", F.when(F.col("amount") < 0, 0).otherwise(F.col("amount"))) \
.join(customer_data, "customer_id") \
.groupBy("order_date", "customer_segment") \
.agg(
F.count("*").alias("order_count"),
F.sum("amount_clean").alias("total_amount")
)
# LOAD: Write to target warehouse
transformed_data.write \
.format("parquet") \
.mode("overwrite") \
.partitionBy("order_date") \
.save("s3://warehouse/order_summary/")
ELT Example (Modern with dbt)
-- ============================================================
-- ELT: Extract, Load, Transform (using dbt)
-- ============================================================
-- models/staging/stg_orders.sql
-- Source: Raw data loaded by Fivetran/Airbyte
-- Purpose: Clean and standardize raw data
SELECT
id AS order_id,
customer_id,
amount,
status,
created_at,
updated_at,
_fivetran_synced AS _loaded_at
FROM {{ source('production', 'orders') }}
WHERE status != 'deleted'
-- models/marts/fct_orders.sql
-- Purpose: Business-level order facts
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
customers AS (
SELECT * FROM {{ ref('stg_customers') }}
),
order_summary AS (
SELECT
o.order_id,
o.customer_id,
c.customer_name,
c.segment AS customer_segment,
o.amount,
o.status,
DATE_TRUNC('day', o.created_at) AS order_date,
o.created_at,
o.updated_at
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
)
SELECT * FROM order_summary
-- models/marts/rpt_daily_revenue.sql
-- Purpose: Daily revenue report
WITH orders AS (
SELECT * FROM {{ ref('fct_orders') }}
),
daily_summary AS (
SELECT
order_date,
customer_segment,
COUNT(DISTINCT order_id) AS order_count,
COUNT(DISTINCT customer_id) AS customer_count,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value
FROM orders
WHERE status = 'completed'
GROUP BY order_date, customer_segment
)
SELECT * FROM daily_summary
dbt Project Structure
my_dbt_project/
βββ dbt_project.yml
βββ packages.yml
βββ models/
β βββ staging/
β β βββ _staging__models.yml
β β βββ stg_orders.sql
β β βββ stg_customers.sql
β β βββ stg_products.sql
β βββ marts/
β β βββ _marts__models.yml
β β βββ fct_orders.sql
β β βββ dim_customers.sql
β β βββ rpt_daily_revenue.sql
β βββ intermediate/
β βββ int_orders_enriched.sql
β βββ int_customer_lifetime.sql
βββ tests/
β βββ test_order_amount_positive.sql
β βββ test_customer_unique.sql
βββ macros/
β βββ generate_schema_name.sql
β βββ materializations.sql
βββ seeds/
βββ country_codes.csv
βββ product_categories.csv
dbt Configuration
# dbt_project.yml
name: 'my_project'
version: '1.0.0'
config-version: 2
profile: 'snowflake'
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
clean-targets:
- "target"
- "dbt_packages"
models:
my_project:
staging:
+materialized: view
+schema: staging
marts:
+materialized: table
+schema: analytics
intermediate:
+materialized: ephemeral
vars:
start_date: '2024-01-01'
end_date: '2024-12-31'
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: ">=1.0.0"
- package: dbt-labs/codegen
version: ">=0.11.0"
- package: calogica/dbt_expectations
version: ">=0.10.0"
dbt Tests
# models/staging/_staging__models.yml
version: 2
models:
- name: stg_orders
description: "Cleaned orders from production database"
columns:
- name: order_id
description: "Primary key"
tests:
- unique
- not_null
- name: customer_id
description: "Foreign key to customers"
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
- name: amount
description: "Order amount"
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000
- name: status
description: "Order status"
tests:
- accepted_values:
values: ['pending', 'completed', 'cancelled', 'refunded']
# models/marts/_marts__models.yml
version: 2
models:
- name: fct_orders
description: "Order facts with customer information"
columns:
- name: order_id
tests:
- unique
- not_null
- name: order_date
tests:
- not_null
- dbt_utils.accepted_range:
min_value: "2020-01-01"
max_value: "2030-12-31"
dbt Macros
-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ default_schema }}_{{ custom_schema_name }}
{%- endif -%}
{%- endmacro %}
-- macros/insert_audit_columns.sql
{% macro insert_audit_columns() %}
, CURRENT_TIMESTAMP AS _loaded_at
, '{{ invocation_id }}' AS _dbt_invocation_id
, '{{ run_started_at }}' AS _dbt_run_started_at
{% endmacro %}
dbt Snapshots (SCD Type 2)
-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at',
)
}}
SELECT * FROM {{ source('production', 'customers') }}
{% endsnapshot %}
Airflow Integration
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
with DAG(
'dbt_pipeline',
schedule_interval='0 6 * * *',
start_date=days_ago(1),
catchup=False,
) as dag:
# Run dbt staging models
dbt_staging = BashOperator(
task_id='dbt_staging',
bash_command='cd /opt/dbt && dbt run --select staging',
)
# Run dbt tests
dbt_test = BashOperator(
task_id='dbt_test',
bash_command='cd /opt/dbt && dbt test --select staging',
)
# Run dbt marts
dbt_marts = BashOperator(
task_id='dbt_marts',
bash_command='cd /opt/dbt && dbt run --select marts',
)
# Run dbt tests on marts
dbt_test_marts = BashOperator(
task_id='dbt_test_marts',
bash_command='cd /opt/dbt && dbt test --select marts',
)
# Dependencies
dbt_staging >> dbt_test >> dbt_marts >> dbt_test_marts
π‘
Production Tip: dbt is the de facto standard for ELT transformations. It provides: (1) version control for SQL, (2) automated testing, (3) documentation generation, (4) incremental models, and (5) a rich package ecosystem. Always use dbt for SQL-based transformations.
Common Follow-Up Questions
Q1: When would you choose ETL over ELT?
Choose ETL when:
- Target system can't handle raw data (e.g., legacy systems)
- Data is extremely sensitive and must be masked before loading
- Transformations are complex and require custom code
- Source systems have limited data
Choose ELT when:
- Target is a modern warehouse (Snowflake, BigQuery)
- You need flexibility for ad-hoc analysis
- You want faster time to insights
- Your team knows SQL
Q2: How do you handle late-arriving data in ELT?
-- dbt incremental model with late-arriving data
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge'
)
}}
SELECT
order_id,
customer_id,
amount,
status,
created_at,
updated_at
FROM {{ source('production', 'orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
Q3: How do you test data quality in ELT?
-- dbt tests
-- Schema tests
tests:
- unique
- not_null
- accepted_values
- relationships
-- Data tests
SELECT
order_id
FROM {{ ref('fct_orders') }}
WHERE amount < 0
-- Should return 0 rows
Q4: How do you handle schema evolution in ELT?
-- dbt handles schema evolution automatically
-- New columns are added to the target table
-- Use {{ config(materialized='incremental') }} for automatic schema evolution
β οΈ
Critical Consideration: ELT requires a powerful warehouse to handle transformations. If your warehouse is slow or expensive, ETL might be more cost-effective. Always benchmark both approaches for your specific workload.
Company-Specific Tips
dbt Interview Tips
- Discuss dbt and SQLMesh for transformations
- Explain incremental models and materializations
- Mention testing and documentation
- Talk about packages and macros
Snowflake Interview Tips
- Focus on ELT with Snowflake
- Discuss Snowflake features (time travel, cloning)
- Mention Snowflake optimization (clustering, search optimization)
- Talk about Snowflake cost management
Fivetran/Airbyte Interview Tips
- Discuss managed connectors for ingestion
- Explain CDC and incremental sync
- Mention schema evolution handling
- Talk about data freshness monitoring
βΉοΈ
Final Takeaway: ELT is the modern standard for data transformation. It provides flexibility, speed, and cost-effectiveness. Use dbt for SQL transformations, Snowflake/BigQuery for storage, and Fivetran/Airbyte for ingestion. Always consider your team's skills and budget when choosing between ETL and ELT.