ADF Mapping Data Flows: Transformations & Debugging

Master ADF Mapping Data Flows with transformations, debugging, and performance optimization

Data Flow Transformations

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                    DATA FLOW TRANSFORMATION TYPES                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  SOURCE TRANSFORMATIONS      ROW TRANSFORMATIONS                    │
│  ┌─────────────────────┐    ┌─────────────────────┐               │
│  │ Source              │    │ Filter              │               │
│  │ Delta Source        │    │ Derived Column      │               │
│  │ Blob/ADLS Source    │    │ Conditional Split   │               │
│  │ SQL Source          │    │ Surrogate Key       │               │
│  │ Cosmos DB Source    │    │ Exists              │               │
│  └─────────────────────┘    │ Does Not Exist      │               │
│                              └─────────────────────┘               │
│  ROW TRANSFORMATIONS (cont)  AGGREGATE TRANSFORMATIONS             │
│  ┌─────────────────────┐    ┌─────────────────────┐               │
│  │ Pivot               │    │ Aggregate           │               │
│  │ Unpivot             │    │ Window              │               │
│  │ Sort                │    └─────────────────────┘               │
│  │ Rank                │                                          │
│  │ Window              │    SINK TRANSFORMATIONS                   │
│  └─────────────────────┘    ┌─────────────────────┐               │
│                              │ Sink                │               │
│  STRUCTURAL TRANSFORMATIONS  │ Delta Sink          │               │
│  ┌─────────────────────┐    │ Blob/ADLS Sink      │               │
│  │ Union               │    │ SQL Sink            │               │
│  │ Join                │    │ Cosmos DB Sink      │               │
│  │ Lookup              │    └─────────────────────┘               │
│  │ Split               │                                          │
│  └─────────────────────┘                                          │
└─────────────────────────────────────────────────────────────────────┘

Complex Data Flow Example

{
  "name": "df_sales_transformation",
  "properties": {
    "type": "MappingDataFlow",
    "typeProperties": {
      "sources": [
        {
          "name": "source_sales",
          "dataset": { "referenceName": "ds_raw_sales" }
        },
        {
          "name": "source_customers",
          "dataset": { "referenceName": "ds_customers" }
        }
      ],
      "sinks": [
        {
          "name": "sink_fact_sales",
          "dataset": { "referenceName": "ds_curated_fact_sales" }
        },
        {
          "name": "sink_invalid",
          "dataset": { "referenceName": "ds_quarantine" }
        }
      ],
      "transformations": [
        {
          "name": "filter_valid",
          "type": "Filter",
          "typeProperties": {
            "filterExpression": "sale_id IS NOT NULL AND amount > 0"
          }
        },
        {
          "name": "join_customer",
          "type": "Join",
          "typeProperties": {
            "joinType": "LeftOuter",
            "joinCondition": {
              "left": "customer_id",
              "right": "customer_id"
            }
          }
        },
        {
          "name": "derive_columns",
          "type": "DerivedColumn",
          "typeProperties": {
            "columns": [
              { "name": "total_amount", "expression": "quantity * unit_price" },
              { "name": "sale_year", "expression": "year(sale_date)" },
              { "name": "sale_month", "expression": "month(sale_date)" }
            ]
          }
        }
      ]
    }
  }
}

Debugging Data Flows

# Monitor Data Flow execution
from azure.identity import DefaultAzureCredential
from azure.mgmt.datafactory import DataFactoryManagementClient

credential = DefaultAzureCredential()
adf_client = DataFactoryManagementClient(credential, subscription_id)

# Get Data Flow run details
run = adf_client.data_flow_runs.get(
    resource_group_name="rg-dataengineering-prod",
    factory_name="adf-prod",
    data_flow_run_id="run-id"
)

print(f"Status: {run.status}")
print(f"Duration: {run.run_duration}")
print(f"Cluster size: {run.cluster_size}")

# Get row counts per transformation
for transformation in run.transformations:
    print(f"Transformation: {transformation.name}")
    print(f"  Input rows: {transformation.input_rows}")
    print(f"  Output rows: {transformation.output_rows}")

Performance Tuning

{
  "compute": {
    "coreCount": 8,
    "computeType": "GeneralPurpose",
    "orchestration": "Pipeline",
    "parallel": 4
  },
  "sparkConfig": {
    "spark.dynamicAllocation.enabled": "true",
    "spark.dynamicAllocation.minExecutors": "2",
    "spark.dynamicAllocation.maxExecutors": "8"
  }
}

ℹ️

Pro Tip: Use Debug mode in ADF Studio to test Data Flows with sample data. Inspect row-level output at each transformation to identify data quality issues before production deployment.

Interview Questions

Q1: When would you use Data Flow Lookup vs Join transformation? A: Lookup for one-time reference data retrieval (cached in memory). Join for merging two streaming datasets. Lookup is more efficient for small reference datasets; Join for large dataset merges.

Q2: How do you optimize Data Flow performance? A: 1) Right-size cluster (coreCount, computeType), 2) Enable auto-scaling, 3) Optimize partitioning, 4) Use appropriate join strategies, 5) Enable schema drift only when needed, 6) Monitor cluster utilization.

Q3: What is the difference between Data Flow and Synapse Mapping Data Flow? A: They are the same technology. Synapse Mapping Data Flows are ADF Data Flows running on Synapse Spark pools, providing the same capabilities with Synapse workspace integration.