πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Data Quality: Dataform, Dataplex & Validation Patterns

GCP Data EngineeringData Quality⭐ Premium

Advertisement

Data Quality on GCP

Master data quality on GCP including Dataform, Dataplex quality rules, validation patterns, and monitoring.

16 min readIntermediate

Data Quality Framework

πŸ—οΈ GCP Data Engineering Reference Architecture
DATA SOURCESπŸ—ƒοΈOn-Prem DB☁️SaaS APIsπŸ“‘IoT SensorsπŸ“±Mobile AppsπŸ”ŒREST APIsINGESTION LAYERDataflow (CDC)Pub/SubCloud TasksStorage TransferTransfer ApplianceRAW DATA ZONE (Cloud Storage)landing/Ingested databronze/Unvalidatedarchive/Historicalraw/Original formatstaging/Temp processingPROCESSING LAYERDataflowStream + BatchDataprocSpark/HadoopCloud FunctionsEvent-drivenData PrepVisual ETLCloud ComposerOrchestrateCURATED DATA ZONEsilver/Cleaned, validatedgold/Business-readyaggregates/Pre-computedfeatures/ML featuresBigQuery (Warehouse)Looker (BI)Vertex AI (ML)Data StudioDataplex
Interview Tip: GCP's data engineering stack is serverless-first. Dataflow (Apache Beam) handles both streaming and batch. BigQuery is the flagship analytics service.

Dataplex Data Quality

from google.cloud import dataplex_v1

client = dataplex_v1.DataplexServiceClient()

# Create data quality scan
scan = client.create_data_scan(
    request={
        "parent": "projects/my-project/locations/us-central1",
        "data_scan_id": "sales-quality-scan",
        "data_scan": {
            "display_name": "Sales Data Quality Scan",
            "data": {
                "resource": "projects/my-project/datasets/analytics/tables/sales"
            },
            "execution_spec": {
                "trigger": {
                    "schedule": {"cron": "0 6 * * *"}
                }
            },
            "data_quality_spec": {
                "rules": [
                    {
                        "dimension": "COMPLETENESS",
                        "column": "order_id",
                        "threshold": 1.0,
                        "non_null_expectation": {}
                    },
                    {
                        "dimension": "UNIQUENESS",
                        "column": "order_id",
                        "threshold": 1.0,
                        "uniqueness_expectation": {}
                    },
                    {
                        "dimension": "VALIDITY",
                        "column": "amount",
                        "threshold": 0.99,
                        "range_expectation": {
                            "min_value": "0",
                            "max_value": "1000000",
                            "strict_min_enabled": False,
                            "strict_max_enabled": False
                        }
                    },
                    {
                        "dimension": "TIMELINESS",
                        "column": "order_date",
                        "threshold": 0.99,
                        "freshness_expectation": {
                            "column": "order_date",
                            "max_expectation_interval": "1d"
                        }
                    }
                ]
            }
        }
    }
)

Dataform Quality

-- Dataform assertion for data quality
config {
  type: "assertion",
  database: "project",
  schema: "analytics",
  description: "Sales data quality assertions"
}

-- Check for null order IDs
SELECT *
FROM ${ref("sales")}
WHERE order_id IS NULL

-- Check for negative amounts
SELECT *
FROM ${ref("sales")}
WHERE amount < 0

-- Check for duplicate order IDs
SELECT
  order_id,
  COUNT(*) as duplicate_count
FROM ${ref("sales")}
GROUP BY 1
HAVING COUNT(*) > 1

✨

Best Practice: Implement data quality checks at each layer: Bronze (schema validation), Silver (completeness, accuracy), Gold (business rules). Use Dataplex for automated scanning and Dataform for transformation-time validation. Set up alerts for quality violations.

πŸ’¬

Common Interview Questions

Q1: What are the six dimensions of data quality?

Answer: 1) Completeness (all required data present), 2) Accuracy (correctly represents reality), 3) Consistency (matches across systems), 4) Timeliness (available when needed), 5) Validity (conforms to formats/rules), 6) Uniqueness (no duplicates).

Q2: How do you implement data quality checks in BigQuery?

Answer: Use SQL assertions in Dataform, CREATE ASSERTION statements, or validation queries. Check for nulls, duplicates, ranges, referential integrity, and business rules. Automate with Cloud Scheduler and alert on failures.

Q3: What is the difference between Dataplex and Dataform for quality?

Answer: Dataplex provides automated, scheduled data quality scans with pre-built rules. Dataform allows custom SQL-based assertions integrated into transformation pipelines. Use Dataplex for broad monitoring, Dataform for transformation-time validation.

Q4: How do you handle data quality failures?

Answer: 1) Log failures with details, 2) Alert data owners, 3) Quarantine bad records, 4) Implement retry logic for transient issues, 5) Track quality metrics over time, 6) Root cause analysis, 7) Update quality rules as needed.

Q5: What is the cost of poor data quality?

Answer: Poor data quality leads to: 1) Incorrect analytics and decisions, 2) Lost revenue from bad customer data, 3) Compliance violations, 4) Wasted compute resources processing bad data, 5) Loss of trust in data systems.

Advertisement