Azure Databricks: Workspaces, Unity Catalog & Delta Lake

Enterprise Apache Spark platform with Unity Catalog governance and Delta Lake ACID transactions

Databricks Workspace Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                    AZURE DATABRICKS ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    DATABRICKS WORKSPACE                      │   │
│  │                                                               │   │
│  │  CONTROL PLANE (Azure-managed)                               │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │ Workspace Management    Cluster Management          │    │   │
│  │  │ Notebook Management     Job Scheduling               │    │   │
│  │  │ Secret Management       Access Control               │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  │                                                               │   │
│  │  DATA PLANE (Customer-managed)                               │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │  CLUSTER 1           CLUSTER 2           CLUSTER 3  │    │   │
│  │  │  ┌─────────────┐    ┌─────────────┐    ┌─────────┐ │    │   │
│  │  │  │ Spark Driver │    │ Spark Driver │    │ Driver  │ │    │   │
│  │  │  └──────┬──────┘    └──────┬──────┘    └────┬────┘ │    │   │
│  │  │         │                  │                 │       │    │   │
│  │  │  ┌──────▼──────┐    ┌──────▼──────┐    ┌────▼────┐ │    │   │
│  │  │  │ Executor 1  │    │ Executor 1  │    │Exec 1   │ │    │   │
│  │  │  │ Executor 2  │    │ Executor 2  │    │Exec 2   │ │    │   │
│  │  │  │ Executor 3  │    │ Executor 3  │    │Exec 3   │ │    │   │
│  │  │  └─────────────┘    └─────────────┘    └─────────┘ │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  │                                                               │   │
│  │  UNITY CATALOG                                                │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │ Metastore → Catalogs → Schemas → Tables/Views        │    │   │
│  │  │ Access Control  Lineage  Data Quality  Marketplace   │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  CONNECTED SERVICES:                                                │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ ADLS Gen2 (Data Lake)  │ Azure Key Vault  │ Azure SQL DB   │   │
│  │ Azure Data Factory     │ Azure Monitor    │ Power BI       │   │
│  │ Azure Event Hubs       │ Azure ML         │ Cosmos DB      │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Unity Catalog Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    UNITY CATALOG HIERARCHY                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  METASTORE (One per region)                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Metastore ID: abc-123-def-456                            │   │
│  │ Region: East US 2                                        │   │
│  │                                                          │   │
│  │  CATALOGS                                                │   │
│  │  ┌──────────────────────────────────────────────────┐   │   │
│  │  │ Catalog: dataengineering                          │   │   │
│  │  │ ┌────────────────────────────────────────────┐   │   │   │
│  │  │ │ Schema: raw                                  │   │   │
│  │  │ │ ┌────────────────────────────────────────┐ │   │   │   │
│  │  │ │ │ Tables:                                 │ │   │   │   │
│  │  │ │ │ • sales_raw (Delta)                     │ │   │   │   │
│  │  │ │ │ • inventory_raw (Delta)                 │ │   │   │   │
│  │  │ │ │ Views:                                  │ │   │   │   │
│  │  │ │ │ • vw_latest_sales                       │ │   │   │   │
│  │  │ │ └────────────────────────────────────────┘ │   │   │   │
│  │  │ └────────────────────────────────────────────┘   │   │   │
│  │  │                                                  │   │   │
│  │  │ ┌────────────────────────────────────────────┐   │   │   │
│  │  │ │ Schema: curated                             │   │   │
│  │  │ │ ┌────────────────────────────────────────┐ │   │   │   │
│  │  │ │ │ Tables:                                 │ │   │   │   │
│  │  │ │ │ • fact_sales (Delta)                    │ │   │   │   │
│  │  │ │ │ • dim_customers (Delta)                 │ │   │   │   │
│  │  │ │ │ • dim_products (Delta)                  │ │   │   │   │
│  │  │ │ └────────────────────────────────────────┘ │   │   │   │
│  │  │ └────────────────────────────────────────────┘   │   │   │
│  │  └──────────────────────────────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ACCESS CONTROL:                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Principal        │ Catalog    │ Schema   │ Permissions   │   │
│  │ ─────────────────────────────────────────────────────── │   │
│  │ Data Engineers   │ dataeng    │ raw      │ SELECT, WRITE │   │
│  │ Data Analysts    │ dataeng    │ curated  │ SELECT        │   │
│  │ ML Engineers     │ dataeng    │ features │ SELECT        │   │
│  │ Admins           │ dataeng    │ ALL      │ ALL           │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Delta Lake Operations

# Delta Lake operations in Databricks
from pyspark.sql import SparkSession
from delta.tables import DeltaTable

spark = SparkSession.builder.getOrCreate()

# Read Delta table
df = spark.read.format("delta").load("/mnt/datalake/curated/fact_sales")

# Write Delta table with partitioning
df.write.format("delta") \
    .partitionBy("sale_year", "sale_month") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save("/mnt/datalake/curated/fact_sales")

# Merge (Upsert) operation
delta_table = DeltaTable.forPath(spark, "/mnt/datalake/curated/fact_sales")

delta_table.alias("target").merge(
    updates_df.alias("source"),
    "target.sale_id = source.sale_id"
).whenMatchedUpdate(
    set={
        "target.quantity": "source.quantity",
        "target.unit_price": "source.unit_price",
        "target.last_updated": "current_timestamp()"
    }
).whenNotMatchedInsert(
    values={
        "sale_id": "source.sale_id",
        "customer_key": "source.customer_key",
        "product_key": "source.product_key",
        "sale_date": "source.sale_date",
        "quantity": "source.quantity",
        "unit_price": "source.unit_price",
        "total_amount": "source.total_amount",
        "created_date": "current_timestamp()",
        "last_updated": "current_timestamp()"
    }
).execute()

# Time travel query
df_yesterday = spark.read.format("delta") \
    .option("versionAsOf", 123) \
    .load("/mnt/datalake/curated/fact_sales")

# Get Delta table history
history = DeltaTable.forPath(spark, "/mnt/datalake/curated/fact_sales").history()
history.show()

# Optimize table (compaction)
spark.sql("OPTIMIZE delta.`/mnt/datalake/curated/fact_sales`")

# Vacuum old files
spark.sql("VACUUM delta.`/mnt/datalake/curated/fact_sales` RETAIN 168 HOURS")

ℹ️

Pro Tip: Always run OPTIMIZE after large merge operations to compact small files. Use Z-ORDER on frequently queried columns: OPTIMIZE delta.table ZORDER BY (customer_key, sale_date)

Databricks Job Configuration

{
  "run_name": "daily_sales_transformation",
  "tasks": [
    {
      "task_key": "extract_raw",
      "new_cluster": {
        "spark_version": "13.3.x-scala2.12",
        "node_type_id": "Standard_D4s_v3",
        "num_workers": 4,
        "autoscale": {
          "min_workers": 2,
          "max_workers": 8
        },
        "spark_conf": {
          "spark.databricks.delta.optimizeWrite.enabled": "true",
          "spark.databricks.delta.autoCompact.enabled": "true"
        }
      },
      "notebook_task": {
        "notebook_path": "/Repos/data_engineering/extract_raw_data",
        "base_parameters": {
          "date": "{{job.parameters.process_date}}",
          "source": "sales_api"
        }
      },
      "timeout_seconds": 3600,
      "retry_on_failure": true
    },
    {
      "task_key": "transform_curated",
      "depends_on": [
        {
          "task_key": "extract_raw"
        }
      ],
      "existing_cluster_id": "1234-567890-abcde",
      "notebook_task": {
        "notebook_path": "/Repos/data_engineering/transform_to_curated"
      },
      "timeout_seconds": 7200
    },
    {
      "task_key": "load_synapse",
      "depends_on": [
        {
          "task_key": "transform_curated"
        }
      ],
      "new_cluster": {
        "spark_version": "13.3.x-scala2.12",
        "num_workers": 2
      },
      "notebook_task": {
        "notebook_path": "/Repos/data_engineering/load_to_synapse"
      }
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 0 2 * * ?",
    "timezone_id": "UTC"
  },
  "max_concurrent_runs": 1
}

Spark Optimization for Databricks

# Photon Engine (default on DBR 11.3+)
spark.conf.set("spark.databricks.photon.enabled", "true")

# Adaptive Query Execution
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# Delta Cache
spark.conf.set("spark.databricks.io.cache.enabled", "true")

# Broadcast Join threshold
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100MB")

# Optimize write
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

# Performance monitoring
spark.conf.set("spark.sql.shuffle.partitions", "200")

Interview Questions

Q1: Explain the difference between Auto Optimize and manual OPTIMIZE in Delta Lake. A: Auto Optimize (autoCompact and optimizeWrite) automatically compacts small files during writes. Manual OPTIMIZE runs on-demand for explicit compaction. Use both: Auto Optimize for continuous operation, manual OPTIMIZE after bulk loads.

Q2: How do you handle data skew in Databricks Spark jobs? A: 1) Enable Adaptive Query Execution (AQE) with skewJoin, 2) Salting skewed keys, 3) Repartitioning data, 4) Using broadcast joins for small tables, 5) Adjusting shuffle partitions based on data volume.

Q3: What is the benefit of using Delta Cache in Databricks? A: Delta Cache (formerly Local Disk Cache) caches frequently accessed Delta table data on local SSDs, reducing storage I/O and improving read performance by 2-10x for repeated queries.