πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Dataform: SQLX, Compilation, Assertions & Scheduling

GCP Data EngineeringDataform⭐ Premium

Advertisement

Dataform Deep Dive

Master Dataform including SQLX transformations, compilation, assertions, scheduling, and BigQuery integration.

16 min readIntermediate

Dataform Architecture

πŸ—οΈ GCP Data Engineering Reference Architecture
DATA SOURCESπŸ—ƒοΈOn-Prem DB☁️SaaS APIsπŸ“‘IoT SensorsπŸ“±Mobile AppsπŸ”ŒREST APIsINGESTION LAYERDataflow (CDC)Pub/SubCloud TasksStorage TransferTransfer ApplianceRAW DATA ZONE (Cloud Storage)landing/Ingested databronze/Unvalidatedarchive/Historicalraw/Original formatstaging/Temp processingPROCESSING LAYERDataflowStream + BatchDataprocSpark/HadoopCloud FunctionsEvent-drivenData PrepVisual ETLCloud ComposerOrchestrateCURATED DATA ZONEsilver/Cleaned, validatedgold/Business-readyaggregates/Pre-computedfeatures/ML featuresBigQuery (Warehouse)Looker (BI)Vertex AI (ML)Data StudioDataplex
Interview Tip: GCP's data engineering stack is serverless-first. Dataflow (Apache Beam) handles both streaming and batch. BigQuery is the flagship analytics service.

SQLX Transformations

-- models/staging/stg_orders.sqlx
config {
  type: "view",
  schema: "staging",
  description: "Staged orders data"
}

WITH source AS (
  SELECT * FROM {{ ref("raw_orders") }}
),

transformed AS (
  SELECT
    order_id,
    customer_id,
    SAFE_CAST(amount AS FLOAT64) as amount,
    order_date,
    status,
    CURRENT_TIMESTAMP() as loaded_at
  FROM source
  WHERE order_id IS NOT NULL
)

SELECT * FROM transformed
-- models/marts/daily_sales.sqlx
config {
  type: "table",
  schema: "marts",
  description: "Daily sales summary",
  tags: ["daily", "sales"]
}

WITH orders AS (
  SELECT * FROM {{ ref("stg_orders") }}
),

daily_summary AS (
  SELECT
    DATE(order_date) as order_date,
    COUNT(*) as order_count,
    SUM(amount) as total_revenue,
    AVG(amount) as avg_order_value,
    COUNT(DISTINCT customer_id) as unique_customers
  FROM orders
  WHERE status = 'completed'
  GROUP BY 1
)

SELECT * FROM daily_summary

Assertions

-- assertions/order_quality.sqlx
config {
  type: "assertion",
  dataset: "marts",
  description: "Validate daily sales data quality"
}

SELECT *
FROM {{ ref("daily_sales") }}
WHERE total_revenue < 0
   OR order_count = 0
   OR unique_customers < 0

Scheduling

# Schedule Dataform workflow via API
from google.cloud import dataform_v1beta1

client = dataform_v1beta1.DataformClient()

# Create workflow invocation
invocation = client.create_workflow_invocation(
    request={
        "parent": "projects/my-project/locations/us-central1/repositories/my-repo/workspaces/main",
        "workflow_invocation": {
            "workflow_config": {
                "name": "daily_etl"
            }
        }
    }
)

✨

Best Practice: Use SQLX for modular, reusable transformations. Implement assertions for data quality checks. Use ref() for dependency management. Schedule workflows via Cloud Composer or Cloud Scheduler. Version control all SQLX files.

πŸ’¬

Common Interview Questions

Q1: What is SQLX in Dataform?

Answer: SQLX is Dataform's templating language for SQL transformations. It supports Jinja templating, ref() for dependencies, config blocks for metadata, and assertions for data quality. SQLX compiles to standard SQL for BigQuery execution.

Q2: How does Dataform handle dependencies?

Answer: Dataform automatically resolves dependencies using the ref() function. When you reference {{ ref("model_name") }, Dataform compiles this to the correct table/view reference. Dependencies are visualized in the DAG.

Q3: What are assertions in Dataform?

Answer: Assertions are data quality checks that validate your transformations. They return rows that violate the assertion. If any rows are returned, the assertion fails. Use assertions to check completeness, validity, and business rules.

Q4: How do you schedule Dataform workflows?

Answer: Use Cloud Composer (Airflow) with the Dataform operator, Cloud Scheduler with the Dataform API, or Dataform's built-in scheduling. Cloud Composer is recommended for complex orchestration with dependencies.

Q5: What is the benefit of Dataform over raw SQL?

Answer: 1) Version control via Git, 2) Dependency management with ref(), 3) Built-in data quality assertions, 4) Compilation and validation, 5) Documentation generation, 6) Integration with BigQuery.

Advertisement