πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

ADF Mapping Data Flows: Transformations & Debugging

Azure Data EngineeringADF Data Flows⭐ Premium

Advertisement

ADF Mapping Data Flows: Transformations & Debugging

Master ADF Mapping Data Flows with transformations, debugging, and performance optimization

Data Flow Transformations

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DATA FLOW TRANSFORMATION TYPES                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  SOURCE TRANSFORMATIONS      ROW TRANSFORMATIONS                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚  β”‚ Source              β”‚    β”‚ Filter              β”‚               β”‚
β”‚  β”‚ Delta Source        β”‚    β”‚ Derived Column      β”‚               β”‚
β”‚  β”‚ Blob/ADLS Source    β”‚    β”‚ Conditional Split   β”‚               β”‚
β”‚  β”‚ SQL Source          β”‚    β”‚ Surrogate Key       β”‚               β”‚
β”‚  β”‚ Cosmos DB Source    β”‚    β”‚ Exists              β”‚               β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ Does Not Exist      β”‚               β”‚
β”‚                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚  ROW TRANSFORMATIONS (cont)  AGGREGATE TRANSFORMATIONS             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚  β”‚ Pivot               β”‚    β”‚ Aggregate           β”‚               β”‚
β”‚  β”‚ Unpivot             β”‚    β”‚ Window              β”‚               β”‚
β”‚  β”‚ Sort                β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚  β”‚ Rank                β”‚                                          β”‚
β”‚  β”‚ Window              β”‚    SINK TRANSFORMATIONS                   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚                              β”‚ Sink                β”‚               β”‚
β”‚  STRUCTURAL TRANSFORMATIONS  β”‚ Delta Sink          β”‚               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚ Blob/ADLS Sink      β”‚               β”‚
β”‚  β”‚ Union               β”‚    β”‚ SQL Sink            β”‚               β”‚
β”‚  β”‚ Join                β”‚    β”‚ Cosmos DB Sink      β”‚               β”‚
β”‚  β”‚ Lookup              β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚  β”‚ Split               β”‚                                          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Complex Data Flow Example

{
  "name": "df_sales_transformation",
  "properties": {
    "type": "MappingDataFlow",
    "typeProperties": {
      "sources": [
        {
          "name": "source_sales",
          "dataset": { "referenceName": "ds_raw_sales" }
        },
        {
          "name": "source_customers",
          "dataset": { "referenceName": "ds_customers" }
        }
      ],
      "sinks": [
        {
          "name": "sink_fact_sales",
          "dataset": { "referenceName": "ds_curated_fact_sales" }
        },
        {
          "name": "sink_invalid",
          "dataset": { "referenceName": "ds_quarantine" }
        }
      ],
      "transformations": [
        {
          "name": "filter_valid",
          "type": "Filter",
          "typeProperties": {
            "filterExpression": "sale_id IS NOT NULL AND amount > 0"
          }
        },
        {
          "name": "join_customer",
          "type": "Join",
          "typeProperties": {
            "joinType": "LeftOuter",
            "joinCondition": {
              "left": "customer_id",
              "right": "customer_id"
            }
          }
        },
        {
          "name": "derive_columns",
          "type": "DerivedColumn",
          "typeProperties": {
            "columns": [
              { "name": "total_amount", "expression": "quantity * unit_price" },
              { "name": "sale_year", "expression": "year(sale_date)" },
              { "name": "sale_month", "expression": "month(sale_date)" }
            ]
          }
        }
      ]
    }
  }
}

Debugging Data Flows

# Monitor Data Flow execution
from azure.identity import DefaultAzureCredential
from azure.mgmt.datafactory import DataFactoryManagementClient

credential = DefaultAzureCredential()
adf_client = DataFactoryManagementClient(credential, subscription_id)

# Get Data Flow run details
run = adf_client.data_flow_runs.get(
    resource_group_name="rg-dataengineering-prod",
    factory_name="adf-prod",
    data_flow_run_id="run-id"
)

print(f"Status: {run.status}")
print(f"Duration: {run.run_duration}")
print(f"Cluster size: {run.cluster_size}")

# Get row counts per transformation
for transformation in run.transformations:
    print(f"Transformation: {transformation.name}")
    print(f"  Input rows: {transformation.input_rows}")
    print(f"  Output rows: {transformation.output_rows}")

Performance Tuning

{
  "compute": {
    "coreCount": 8,
    "computeType": "GeneralPurpose",
    "orchestration": "Pipeline",
    "parallel": 4
  },
  "sparkConfig": {
    "spark.dynamicAllocation.enabled": "true",
    "spark.dynamicAllocation.minExecutors": "2",
    "spark.dynamicAllocation.maxExecutors": "8"
  }
}

ℹ️

Pro Tip: Use Debug mode in ADF Studio to test Data Flows with sample data. Inspect row-level output at each transformation to identify data quality issues before production deployment.

Interview Questions

Q1: When would you use Data Flow Lookup vs Join transformation? A: Lookup for one-time reference data retrieval (cached in memory). Join for merging two streaming datasets. Lookup is more efficient for small reference datasets; Join for large dataset merges.

Q2: How do you optimize Data Flow performance? A: 1) Right-size cluster (coreCount, computeType), 2) Enable auto-scaling, 3) Optimize partitioning, 4) Use appropriate join strategies, 5) Enable schema drift only when needed, 6) Monitor cluster utilization.

Q3: What is the difference between Data Flow and Synapse Mapping Data Flow? A: They are the same technology. Synapse Mapping Data Flows are ADF Data Flows running on Synapse Spark pools, providing the same capabilities with Synapse workspace integration.

Advertisement