Azure Databricks: Workspaces, Unity Catalog & Delta Lake
Enterprise Apache Spark platform with Unity Catalog governance and Delta Lake ACID transactions
Databricks Workspace Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AZURE DATABRICKS ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATABRICKS WORKSPACE β β
β β β β
β β CONTROL PLANE (Azure-managed) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Workspace Management Cluster Management β β β
β β β Notebook Management Job Scheduling β β β
β β β Secret Management Access Control β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β DATA PLANE (Customer-managed) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β CLUSTER 1 CLUSTER 2 CLUSTER 3 β β β
β β β βββββββββββββββ βββββββββββββββ βββββββββββ β β β
β β β β Spark Driver β β Spark Driver β β Driver β β β β
β β β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββ¬βββββ β β β
β β β β β β β β β
β β β ββββββββΌβββββββ ββββββββΌβββββββ ββββββΌβββββ β β β
β β β β Executor 1 β β Executor 1 β βExec 1 β β β β
β β β β Executor 2 β β Executor 2 β βExec 2 β β β β
β β β β Executor 3 β β Executor 3 β βExec 3 β β β β
β β β βββββββββββββββ βββββββββββββββ βββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β UNITY CATALOG β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Metastore β Catalogs β Schemas β Tables/Views β β β
β β β Access Control Lineage Data Quality Marketplace β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β CONNECTED SERVICES: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ADLS Gen2 (Data Lake) β Azure Key Vault β Azure SQL DB β β
β β Azure Data Factory β Azure Monitor β Power BI β β
β β Azure Event Hubs β Azure ML β Cosmos DB β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Unity Catalog Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β UNITY CATALOG HIERARCHY β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β METASTORE (One per region) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Metastore ID: abc-123-def-456 β β
β β Region: East US 2 β β
β β β β
β β CATALOGS β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Catalog: dataengineering β β β
β β β ββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β Schema: raw β β β
β β β β ββββββββββββββββββββββββββββββββββββββββββ β β β β
β β β β β Tables: β β β β β
β β β β β β’ sales_raw (Delta) β β β β β
β β β β β β’ inventory_raw (Delta) β β β β β
β β β β β Views: β β β β β
β β β β β β’ vw_latest_sales β β β β β
β β β β ββββββββββββββββββββββββββββββββββββββββββ β β β β
β β β ββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β β β
β β β ββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β Schema: curated β β β
β β β β ββββββββββββββββββββββββββββββββββββββββββ β β β β
β β β β β Tables: β β β β β
β β β β β β’ fact_sales (Delta) β β β β β
β β β β β β’ dim_customers (Delta) β β β β β
β β β β β β’ dim_products (Delta) β β β β β
β β β β ββββββββββββββββββββββββββββββββββββββββββ β β β β
β β β ββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ACCESS CONTROL: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Principal β Catalog β Schema β Permissions β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β Data Engineers β dataeng β raw β SELECT, WRITE β β
β β Data Analysts β dataeng β curated β SELECT β β
β β ML Engineers β dataeng β features β SELECT β β
β β Admins β dataeng β ALL β ALL β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delta Lake Operations
# Delta Lake operations in Databricks
from pyspark.sql import SparkSession
from delta.tables import DeltaTable
spark = SparkSession.builder.getOrCreate()
# Read Delta table
df = spark.read.format("delta").load("/mnt/datalake/curated/fact_sales")
# Write Delta table with partitioning
df.write.format("delta") \
.partitionBy("sale_year", "sale_month") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.save("/mnt/datalake/curated/fact_sales")
# Merge (Upsert) operation
delta_table = DeltaTable.forPath(spark, "/mnt/datalake/curated/fact_sales")
delta_table.alias("target").merge(
updates_df.alias("source"),
"target.sale_id = source.sale_id"
).whenMatchedUpdate(
set={
"target.quantity": "source.quantity",
"target.unit_price": "source.unit_price",
"target.last_updated": "current_timestamp()"
}
).whenNotMatchedInsert(
values={
"sale_id": "source.sale_id",
"customer_key": "source.customer_key",
"product_key": "source.product_key",
"sale_date": "source.sale_date",
"quantity": "source.quantity",
"unit_price": "source.unit_price",
"total_amount": "source.total_amount",
"created_date": "current_timestamp()",
"last_updated": "current_timestamp()"
}
).execute()
# Time travel query
df_yesterday = spark.read.format("delta") \
.option("versionAsOf", 123) \
.load("/mnt/datalake/curated/fact_sales")
# Get Delta table history
history = DeltaTable.forPath(spark, "/mnt/datalake/curated/fact_sales").history()
history.show()
# Optimize table (compaction)
spark.sql("OPTIMIZE delta.`/mnt/datalake/curated/fact_sales`")
# Vacuum old files
spark.sql("VACUUM delta.`/mnt/datalake/curated/fact_sales` RETAIN 168 HOURS")
βΉοΈ
Pro Tip: Always run OPTIMIZE after large merge operations to compact small files. Use Z-ORDER on frequently queried columns: OPTIMIZE delta.table ZORDER BY (customer_key, sale_date)
Databricks Job Configuration
{
"run_name": "daily_sales_transformation",
"tasks": [
{
"task_key": "extract_raw",
"new_cluster": {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_D4s_v3",
"num_workers": 4,
"autoscale": {
"min_workers": 2,
"max_workers": 8
},
"spark_conf": {
"spark.databricks.delta.optimizeWrite.enabled": "true",
"spark.databricks.delta.autoCompact.enabled": "true"
}
},
"notebook_task": {
"notebook_path": "/Repos/data_engineering/extract_raw_data",
"base_parameters": {
"date": "{{job.parameters.process_date}}",
"source": "sales_api"
}
},
"timeout_seconds": 3600,
"retry_on_failure": true
},
{
"task_key": "transform_curated",
"depends_on": [
{
"task_key": "extract_raw"
}
],
"existing_cluster_id": "1234-567890-abcde",
"notebook_task": {
"notebook_path": "/Repos/data_engineering/transform_to_curated"
},
"timeout_seconds": 7200
},
{
"task_key": "load_synapse",
"depends_on": [
{
"task_key": "transform_curated"
}
],
"new_cluster": {
"spark_version": "13.3.x-scala2.12",
"num_workers": 2
},
"notebook_task": {
"notebook_path": "/Repos/data_engineering/load_to_synapse"
}
}
],
"schedule": {
"quartz_cron_expression": "0 0 2 * * ?",
"timezone_id": "UTC"
},
"max_concurrent_runs": 1
}
Spark Optimization for Databricks
# Photon Engine (default on DBR 11.3+)
spark.conf.set("spark.databricks.photon.enabled", "true")
# Adaptive Query Execution
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
# Delta Cache
spark.conf.set("spark.databricks.io.cache.enabled", "true")
# Broadcast Join threshold
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100MB")
# Optimize write
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
# Performance monitoring
spark.conf.set("spark.sql.shuffle.partitions", "200")
Interview Questions
Q1: Explain the difference between Auto Optimize and manual OPTIMIZE in Delta Lake. A: Auto Optimize (autoCompact and optimizeWrite) automatically compacts small files during writes. Manual OPTIMIZE runs on-demand for explicit compaction. Use both: Auto Optimize for continuous operation, manual OPTIMIZE after bulk loads.
Q2: How do you handle data skew in Databricks Spark jobs? A: 1) Enable Adaptive Query Execution (AQE) with skewJoin, 2) Salting skewed keys, 3) Repartitioning data, 4) Using broadcast joins for small tables, 5) Adjusting shuffle partitions based on data volume.
Q3: What is the benefit of using Delta Cache in Databricks? A: Delta Cache (formerly Local Disk Cache) caches frequently accessed Delta table data on local SSDs, reducing storage I/O and improving read performance by 2-10x for repeated queries.