Databricks on Azure Interview Q&A

25 interview questions on Azure Databricks, Spark, Delta Lake, and Unity Catalog

Question 1: What is Azure Databricks?

Answer: Managed Apache Spark platform on Azure. Provides collaborative workspace, managed clusters, Delta Lake integration, and Unity Catalog governance.

Question 2: What is the difference between Databricks and Synapse?

Answer: Databricks: Spark-focused, notebook-based, Delta Lake native. Synapse: SQL-focused, dedicated/serverless pools, T-SQL. Use Databricks for Spark; Synapse for SQL.

Question 3: What is Delta Lake?

Answer: Open-source storage layer providing ACID transactions, schema evolution, time travel, and data skipping on Parquet files.

Question 4: What is Unity Catalog?

Answer: Centralized governance for Databricks. Provides access control, lineage, data discovery, and marketplace capabilities.

Question 5: What is the Photon engine?

Answer: Vectorized C++ execution engine providing 2-8x performance improvement for Spark SQL operations. Enabled by default on DBR 11.3+.

Question 6: What is Adaptive Query Execution (AQE)?

Answer: Dynamically optimizes query plans based on runtime statistics. Adjusts join strategies, partition counts, and skew handling.

Question 7: What is the benefit of Auto Loader?

Answer: Incremental ingestion with automatic file discovery, schema inference, and checkpoint management. Uses Event Grid for near-real-time detection.

Question 8: How do you optimize Spark performance?

Answer: 1) Enable Photon and AQE, 2) Optimize partition count, 3) Use broadcast joins, 4) Cache DataFrames, 5) Avoid UDFs, 6) Use Delta Lake Z-ORDER.

Question 9: What is the difference between job clusters and interactive clusters?

Answer: Job clusters: Ephemeral, created per job, destroyed after. Interactive: Persist for development. Job clusters are cost-efficient for scheduled workloads.

Question 10: What is Delta Lake time travel?

Answer: Query previous versions of Delta tables using version numbers or timestamps. Enables auditing, reprocessing, and debugging.

Question 11: How do you handle data skew in Spark?

Answer: Enable AQE skewJoin, salting skewed keys, repartitioning, broadcast joins, and adjusting shuffle partitions.

Question 12: What is the benefit of Delta Cache?

Answer: Caches Delta table data on local SSDs for faster reads. Reduces storage I/O by 2-10x for repeated queries.

Question 13: What is the difference between OPTIMIZE and VACUUM?

Answer: OPTIMIZE: Compacts small files into larger ones. VACUUM: Removes old files no longer referenced. Use OPTIMIZE for performance; VACUUM for storage cleanup.

Question 14: How do you implement data quality in Databricks?

Answer: Great Expectations, Delta Lake constraints, custom validation logic, and Purview integration.

Question 15: What is the benefit of Structured Streaming?

Answer: Stream processing with exactly-once semantics, Delta Lake integration, and unified batch/streaming code.

Question 16: How do you manage Databricks secrets?

Answer: Databricks Secret Scopes (Key Vault-backed), notebook widgets, and environment variables. Never hardcode secrets.

Question 17: What is the difference between Databricks SQL and notebooks?

Answer: Databricks SQL: SQL warehouses for dashboards and queries. Notebooks: Spark-based development with multiple languages.

Question 18: How do you implement CI/CD for Databricks?

Answer: Databricks Repos (Git integration), Terraform for infrastructure, Databricks CLI for deployment, and multi-environment setup.

Question 19: What is the benefit of Delta Sharing?

Answer: Securely share Delta tables across organizations without copying data. Uses open protocols for interoperability.

Question 20: How do you monitor Databricks costs?

Answer: Databricks Cost Management, cluster utilization metrics, DBU consumption, and Azure Cost Management integration.

Question 21: What is the difference between MLflow and Databricks?

Answer: MLflow: Open-source ML lifecycle management. Databricks: Managed platform with MLflow integration. Use MLflow for tracking; Databricks for compute.

Question 22: How do you handle cluster sizing?

Answer: Start small, auto-scale based on workload, use spot instances for cost savings, and right-size based on utilization metrics.

Question 23: What is the benefit of notebooks in Databricks?

Answer: Collaborative development, multi-language support (Python, SQL, R), visualization, and integration with Delta Lake.

Question 24: How do you implement data governance?

Answer: Unity Catalog for access control and lineage, Purview integration, cluster policies, and secret management.

Question 25: What is the lakehouse pattern?

Answer: Combines data lake storage (ADLS Gen2) with data warehouse capabilities (Delta Lake ACID transactions, schema enforcement).