Azure AD, IAM & Managed Identities
Mastering identity management and access control for secure data engineering pipelines
Identity Architecture for Data Engineering
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β IDENTITY & ACCESS MANAGEMENT β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β IDENTITY PROVIDERS AUTHENTICATION AUTHORIZATION β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Azure AD ββββββββββββ OAuth 2.0 ββββββ>β RBAC β β
β β (Entra ID) β β OpenID Conn β β Roles β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Managed β β Token β β Data β β
β β Identities β β Exchange β β Factory β β
β β β β Service β β RBAC β β
β β βββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β β System ββ β
β β β Assigned ββ ββββββββββββββββ ββββββββββββββββ β
β β βββββββββββββ β Conditional β β Synapse β β
β β βββββββββββββ β Access β β RBAC β β
β β β User ββ β Policies β β β β
β β β Assigned ββ ββββββββββββββββ ββββββββββββββββ β
β β βββββββββββββ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Key Vault β β Storage β β
β ββββββββββββββββ β Access β β RBAC β β
β β Service ββββββββββββ Policies β β β β
β β Principals β ββββββββββββββββ ββββββββββββββββ β
β ββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Managed Identities Deep Dive
System-Assigned vs User-Assigned
| Feature | System-Assigned | User-Assigned |
|---|---|---|
| Lifecycle | Tied to resource | Independent |
| Sharing | Single resource | Multiple resources |
| Cleanup | Auto-deleted | Manual cleanup |
| Use Case | Single-service auth | Multi-service scenarios |
| Maximum | 1 per resource | Unlimited |
Managed Identity Configuration for Data Engineering
# Python: Using Managed Identity with Azure SDKs
from azure.identity import DefaultAzureCredential
from azure.storage.filedatalake import DataLakeServiceClient
from azure.synapse.artifacts import ArtifactsClient
# DefaultAzureCredential tries multiple auth methods automatically
credential = DefaultAzureCredential()
# ADLS Gen2 access with Managed Identity
datalake_client = DataLakeServiceClient(
account_url="https://stdatalake001.dfs.core.windows.net",
credential=credential
)
# List files in data lake
file_system_client = datalake_client.get_file_system_client("raw")
paths = list(file_system_client.list_paths(path="2024/01/"))
for path in paths:
print(f"Path: {path.name}, Size: {path.size}")
# Synapse Artifacts access
artifacts_client = ArtifactsClient(
credential=credential,
endpoint="https://syn-workspace.dev.azuresynapse.net"
)
# List pipelines
pipelines = artifacts_client.pipeline.get_pipeline_by_name("etl_pipeline")
Service Principal Configuration
{
"appId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"displayName": "sp-dataengineering-prod",
"password": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"tenant": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
}
# Service Principal Authentication
from azure.identity import ClientSecretCredential
credential = ClientSecretCredential(
tenant_id="your-tenant-id",
client_id="sp-dataengineering-prod",
client_secret="your-client-secret"
)
# Use with Azure Storage
from azure.storage.filedatalake import DataLakeServiceClient
client = DataLakeServiceClient(
account_url="https://stdatalake001.dfs.core.windows.net",
credential=credential
)
RBAC Roles for Data Engineering
Built-in Roles
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RBAC ROLE HIERARCHY β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β MANAGEMENT GROUP β
β βββ Subscription β
β βββ Resource Group β
β βββ Resource β
β β
β SCOPE LEVELS (top to bottom): β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MG > Sub > RG > Resource β β
β β β β
β β Roles assigned at higher scope inherit downward β β
β β More specific scope overrides inherited role β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β DATA ENGINEERING ROLES: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Storage Blob Data Contributor - ADLS read/write β β
β β Storage Blob Data Reader - ADLS read-only β β
β β Synapse Administrator - Full Synapse access β β
β β Synapse SQL Administrator - SQL pool management β β
β β Key Vault Secrets User - Read secrets β β
β β Data Factory Contributor - ADF management β β
β β Contributor - General management β β
β β Reader - View-only access β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Custom Role for Data Engineers
{
"Name": "Data Engineer Custom Role",
"Description": "Custom role for data engineering operations",
"AssignableScopes": [
"/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
],
"Actions": [
"Microsoft.Storage/storageAccounts/read",
"Microsoft.Storage/storageAccounts/write",
"Microsoft.Storage/storageAccounts/blobServices/containers/read",
"Microsoft.Storage/storageAccounts/blobServices/containers/write",
"Microsoft.Synapse/workspaces/read",
"Microsoft.Synapse/workspaces/sqlPools/read",
"Microsoft.Synapse/workspaces/sqlPools/write",
"Microsoft.Synapse/workspaces/notebooks/read",
"Microsoft.Synapse/workspaces/notebooks/write",
"Microsoft.DataFactory/pipelines/read",
"Microsoft.DataFactory/pipelines/write",
"Microsoft.DataFactory/factories/read",
"Microsoft.DataFactory/factories/write",
"Microsoft.KeyVault/vaults/secrets/read"
],
"NotActions": [
"Microsoft.Authorization/*/Delete",
"Microsoft.Authorization/*/Write",
"Microsoft.Authorization/elevateAccess/Action"
]
}
β οΈ
Security Critical: Never store Service Principal secrets in code, environment variables, or configuration files. Always use Azure Key Vault with Managed Identities for secret retrieval.
Azure AD Authentication Flow for Data Services
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AUTHENTICATION FLOW β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. REQUEST TOKEN β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β ADF/ βββββ>β Azure AD βββββ>β Token β β
β β Databricksβ β Endpoint β β Service β β
β ββββββββββββ ββββββββββββ ββββββ¬ββββββ β
β β β
β 2. VALIDATE & ISSUE β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Azure AD validates: β β
β β β’ Service Principal exists β β
β β β’ Credentials are valid β β
β β β’ SP has required permissions β β
β β β’ Conditional Access policies pass β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β 3. RECEIVE TOKEN βΌ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β ADF/ β<βββββ JWT β<βββββ Response β β
β β Databricksβ β Token β β β β
β ββββββ¬ββββββ ββββββββββββ ββββββββββββ β
β β β
β 4. ACCESS RESOURCE β
β βΌ β
β ββββββββββββ ββββββββββββ β
β β ADLS/ β<βββββ Use JWT β β
β β Synapse β β as Bearerβ β
β ββββββββββββ β Token β β
β ββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Bicep Template for IAM Setup
// Managed Identity for ADF
resource managedIdentity 'Microsoft.ManagedIdentity/userAssignedIdentities@2023-01-31' = {
name: 'mi-datafactory-prod'
location: location
tags: {
Environment: 'Production'
Project: 'DataEngineering'
}
}
// Role Assignment for ADF on ADLS
resource roleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
name: guid(resourceGroup().id, 'Storage Blob Data Contributor', managedIdentity.id)
scope: storageAccount
properties: {
roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', 'ba92f5b4-2d11-453d-a403-e96b0029c9fe')
principalId: managedIdentity.properties.principalId
principalType: 'ServicePrincipal'
}
}
// Role Assignment for Key Vault access
resource keyVaultRoleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
name: guid(resourceGroup().id, 'Key Vault Secrets User', managedIdentity.id)
scope: keyVault
properties: {
roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '4633458b-17de-408a-b875-068636670185')
principalId: managedIdentity.properties.principalId
principalType: 'ServicePrincipal'
}
}
// Output the Managed Identity Client ID
output managedIdentityClientId string = managedIdentity.properties.clientId
output managedIdentityPrincipalId string = managedIdentity.properties.principalId
βΉοΈ
Best Practice: Use User-Assigned Managed Identities when multiple services (ADF, Databricks, Synapse) need to access the same resources. This simplifies RBAC management and avoids role assignment proliferation.
Conditional Access Policies for Data Engineering
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONDITIONAL ACCESS POLICY FLOW β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β User/SP βββββ>β Sign-in βββββ>β Policy β β
β β Request β β Request β β Evaluation β β
β ββββββββββββ ββββββββββββββββ ββββββββ¬ββββββββ β
β β β
β βββββββββββββββββββββββββββββ€ β
β β β β
β βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ β
β β Block: β β Grant: β β
β β β’ Outside IP β β β’ MFA β β
β β β’ No MFA β β β’ Compliant β β
β β β’ Non-Comply β β Device β β
β ββββββββββββββββ β β’ App Cond. β β
β ββββββββββββββββ β
β β
β POLICY EXAMPLES FOR DATA ENGINEERING: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Block sign-ins from outside corporate network β β
β β 2. Require MFA for all admin operations β β
β β 3. Require compliant device for Synapse access β β
β β 4. Block legacy authentication protocols β β
β β 5. Session timeout for production portal access β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Python SDK for IAM Management
from azure.identity import DefaultAzureCredential
from azure.mgmt.authorization import AuthorizationManagementClient
from azure.mgmt.authorization.models import RoleAssignmentCreateParameters
credential = DefaultAzureCredential()
auth_client = AuthorizationManagementClient(credential, subscription_id)
# Assign 'Storage Blob Data Contributor' to ADF Managed Identity
role_assignment_params = RoleAssignmentCreateParameters(
role_definition_id=f"/subscriptions/{subscription_id}/providers/Microsoft.Authorization/roleDefinitions/ba92f5b4-2d11-453d-a403-e96b0029c9fe",
principal_id=adf_managed_identity_principal_id,
principal_type="ServicePrincipal"
)
auth_client.role_assignments.create(
scope=f"/subscriptions/{subscription_id}/resourceGroups/rg-datalake-prod/providers/Microsoft.Storage/storageAccounts/stdatalake001",
role_assignment_name="adf-blob-contributor",
parameters=role_assignment_params
)
# List all role assignments for a resource
assignments = auth_client.role_assignments.list_for_scope(
scope=f"/subscriptions/{subscription_id}/resourceGroups/rg-datalake-prod"
)
for assignment in assignments:
print(f"Role: {assignment.role_definition_id}")
print(f"Principal: {assignment.principal_id}")
print(f"Type: {assignment.principal_type}")
Interview Questions
Q1: Why should you never use Storage Account Keys for data engineering pipelines? A: Storage Account Keys provide full access to the storage account and are long-lived credentials that can be compromised. Managed Identities eliminate credential management, provide automatic rotation, and enable granular RBAC. If keys must be used, store them in Key Vault and rotate regularly.
Q2: Explain the difference between RBAC at the Storage Account level vs Container level. A: Storage Account-level RBAC applies to all containers and blobs. Container-level RBAC (using resource scope) provides more granular control. For example, grant a service principal access to only the "raw" container but not "curated."
Q3: How do you troubleshoot a 403 Forbidden error when ADF tries to access ADLS? A: Check: 1) Managed Identity is enabled on ADF, 2) Correct RBAC role is assigned at the right scope, 3) No Deny assignments override the role, 4) Private Endpoints/Firewall rules allow traffic, 5) Azure AD tenant matches between resources.