ADF Mapping Data Flows: Transformations & Debugging
Master ADF Mapping Data Flows with transformations, debugging, and performance optimization
Data Flow Transformations
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA FLOW TRANSFORMATION TYPES β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β SOURCE TRANSFORMATIONS ROW TRANSFORMATIONS β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β Source β β Filter β β
β β Delta Source β β Derived Column β β
β β Blob/ADLS Source β β Conditional Split β β
β β SQL Source β β Surrogate Key β β
β β Cosmos DB Source β β Exists β β
β βββββββββββββββββββββββ β Does Not Exist β β
β βββββββββββββββββββββββ β
β ROW TRANSFORMATIONS (cont) AGGREGATE TRANSFORMATIONS β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β Pivot β β Aggregate β β
β β Unpivot β β Window β β
β β Sort β βββββββββββββββββββββββ β
β β Rank β β
β β Window β SINK TRANSFORMATIONS β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β Sink β β
β STRUCTURAL TRANSFORMATIONS β Delta Sink β β
β βββββββββββββββββββββββ β Blob/ADLS Sink β β
β β Union β β SQL Sink β β
β β Join β β Cosmos DB Sink β β
β β Lookup β βββββββββββββββββββββββ β
β β Split β β
β βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Complex Data Flow Example
{
"name": "df_sales_transformation",
"properties": {
"type": "MappingDataFlow",
"typeProperties": {
"sources": [
{
"name": "source_sales",
"dataset": { "referenceName": "ds_raw_sales" }
},
{
"name": "source_customers",
"dataset": { "referenceName": "ds_customers" }
}
],
"sinks": [
{
"name": "sink_fact_sales",
"dataset": { "referenceName": "ds_curated_fact_sales" }
},
{
"name": "sink_invalid",
"dataset": { "referenceName": "ds_quarantine" }
}
],
"transformations": [
{
"name": "filter_valid",
"type": "Filter",
"typeProperties": {
"filterExpression": "sale_id IS NOT NULL AND amount > 0"
}
},
{
"name": "join_customer",
"type": "Join",
"typeProperties": {
"joinType": "LeftOuter",
"joinCondition": {
"left": "customer_id",
"right": "customer_id"
}
}
},
{
"name": "derive_columns",
"type": "DerivedColumn",
"typeProperties": {
"columns": [
{ "name": "total_amount", "expression": "quantity * unit_price" },
{ "name": "sale_year", "expression": "year(sale_date)" },
{ "name": "sale_month", "expression": "month(sale_date)" }
]
}
}
]
}
}
}
Debugging Data Flows
# Monitor Data Flow execution
from azure.identity import DefaultAzureCredential
from azure.mgmt.datafactory import DataFactoryManagementClient
credential = DefaultAzureCredential()
adf_client = DataFactoryManagementClient(credential, subscription_id)
# Get Data Flow run details
run = adf_client.data_flow_runs.get(
resource_group_name="rg-dataengineering-prod",
factory_name="adf-prod",
data_flow_run_id="run-id"
)
print(f"Status: {run.status}")
print(f"Duration: {run.run_duration}")
print(f"Cluster size: {run.cluster_size}")
# Get row counts per transformation
for transformation in run.transformations:
print(f"Transformation: {transformation.name}")
print(f" Input rows: {transformation.input_rows}")
print(f" Output rows: {transformation.output_rows}")
Performance Tuning
{
"compute": {
"coreCount": 8,
"computeType": "GeneralPurpose",
"orchestration": "Pipeline",
"parallel": 4
},
"sparkConfig": {
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.minExecutors": "2",
"spark.dynamicAllocation.maxExecutors": "8"
}
}
βΉοΈ
Pro Tip: Use Debug mode in ADF Studio to test Data Flows with sample data. Inspect row-level output at each transformation to identify data quality issues before production deployment.
Interview Questions
Q1: When would you use Data Flow Lookup vs Join transformation? A: Lookup for one-time reference data retrieval (cached in memory). Join for merging two streaming datasets. Lookup is more efficient for small reference datasets; Join for large dataset merges.
Q2: How do you optimize Data Flow performance? A: 1) Right-size cluster (coreCount, computeType), 2) Enable auto-scaling, 3) Optimize partitioning, 4) Use appropriate join strategies, 5) Enable schema drift only when needed, 6) Monitor cluster utilization.
Q3: What is the difference between Data Flow and Synapse Mapping Data Flow? A: They are the same technology. Synapse Mapping Data Flows are ADF Data Flows running on Synapse Spark pools, providing the same capabilities with Synapse workspace integration.