Cloud Data Prep for Data Engineering

Master Google Cloud Data Prep for visual data preparation including wrangling, profiling, transformation recipes, and integration patterns.

12 min readBeginner

Cloud Data Prep Overview

Cloud Data Prep is a visual data preparation tool built on Apache Cloud Dataprep (Trifacta). It allows data engineers and analysts to explore, clean, and transform data without writing code.

Key Features

📊 BigQuery Architecture for Data Engineering

Interview Tip: BigQuery separates storage and compute. Queries are charged by slots (compute) + bytes scanned. Always partition and cluster tables to reduce costs.

Data Preparation Workflow

📊 BigQuery Architecture for Data Engineering

Interview Tip: BigQuery separates storage and compute. Queries are charged by slots (compute) + bytes scanned. Always partition and cluster tables to reduce costs.

Transformation Recipes

Common Transformations

// Data Prep recipe steps (TRQL - Trifacta Recipe Language)

// 1. Rename columns
rename col: `old_name` to: `new_name`

// 2. Change data types
settype col: `amount` as: float

// 3. Filter rows
filter rowtype: `status` == "active"

// 4. Extract date parts
derive col: `year` as: year(`order_date`)
derive col: `month` as: month(`order_date`)

// 5. Conditional transformation
derive col: `category` as:
  if(`amount` > 1000, "premium",
    if(`amount` > 100, "standard", "basic"))

// 6. String operations
derive col: `domain` as: extract(`email`, "@(.+)$")

// 7. Aggregation
aggregate value: sum(`amount`) group: `product_category`

// 8. Join datasets
join col: `user_id` with: `users` on: `user_id`

Data Quality Rules

// Validation rules
validate col: `email` regex: `^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`

validate col: `phone` regex: `^\+?[1-9]\d{1,14}$`

validate col: `amount` range: 0 to 1000000

validate col: `date` format: "yyyy-MM-dd"

// Custom validation
validate col: `status` in: ["active", "inactive", "pending"]

// Row-level validation
validate row: `amount` > 0 and `quantity` > 0

Integration with Data Engineering

Export to BigQuery

# Data Prep API for programmatic access
import requests

def create_dataprep_job(api_key, recipe_id, output_table):
    """Create Data Prep job via API."""
    headers = {
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }

    job_config = {
        "wrangledDataset": {
            "id": recipe_id
        },
        "execution": {
            "runsOn": {
                "type": "dataproc",
                "projectId": "my-project",
                "region": "us-central1"
            }
        },
        "outputs": [{
            "type": "bigquery",
            "table": output_table
        }]
    }

    response = requests.post(
        "https://api.dataprep.trifacta.com/v4/job",
        headers=headers,
        json=job_config
    )

    return response.json()

⚠️ Cost Alert

Always monitor your BigQuery costs using INFORMATION_SCHEMA. Set up budget alerts at 50%, 80%, and 100% thresholds.

Cost Optimization

# Data Prep pricing components
pricing = {
    "dataprep_units": {
        "description": "DPUs consumed per transformation",
        "cost_per_unit": "$0.05 per DPU-hour"
    },
    "dataflow_workers": {
        "description": "Compute for running recipes",
        "cost": "Standard Dataflow pricing"
    },
    "storage": {
        "description": "GCS for intermediate data",
        "cost": "Standard GCS pricing"
    }
}

# Cost optimization strategies
strategies = {
    "profile_before_transform": "Understand data quality before processing",
    "use_sampling": "Profile on samples, run full on production",
    "incremental_processing": "Process only new/changed data",
    "right_size_workers": "Don't over-provision Dataflow workers",
    "schedule_off_peak": "Run jobs during off-peak hours for lower costs"
}

ℹ️

Cost Tip: Data Prep charges per DPU-hour (Data Preparation Unit). Profile your data first to understand complexity, then optimize transformations. Use sampling for initial exploration and schedule production jobs during off-peak hours.

💬

Common Interview Questions

Q1: When would you use Cloud Data Prep vs. writing code?

Answer: Cloud Data Prep is ideal for ad-hoc data exploration, business analyst self-service, and quick prototyping. Use code (Python/SQL) for production pipelines requiring version control, testing, and complex logic. Data Prep excels at visual data profiling and interactive transformations.

Q2: How does Data Prep integrate with BigQuery?

Answer: Data Prep can read directly from BigQuery tables and write transformed data back. Use the BigQuery connection for seamless integration. For large datasets, export to GCS first, then load into BigQuery for optimal performance.

Q3: What is the DPU model in Data Prep?

Answer: DPU (Data Preparation Unit) measures computational effort per transformation. Complex operations like joins consume more DPUs than simple filters. Understanding DPU consumption helps optimize costs and plan resource allocation.

Q4: How do you version control Data Prep recipes?

Answer: Data Prep maintains built-in version history for all recipe changes. For external version control, export recipes as JSON and store in Git. Use the Data Prep API to automate recipe deployment across environments.

Q5: What data formats does Data Prep support?

Answer: Data Prep supports CSV, JSON, Parquet, Avro, Excel, and fixed-width formats. For best performance with large datasets, use columnar formats (Parquet/Avro). For web applications, JSON is commonly used.

Cloud Data Prep: Visual Data Preparation