Databricks on Azure Interview Q&A
25 interview questions on Azure Databricks, Spark, Delta Lake, and Unity Catalog
Question 1: What is Azure Databricks?
Answer: Managed Apache Spark platform on Azure. Provides collaborative workspace, managed clusters, Delta Lake integration, and Unity Catalog governance.
Question 2: What is the difference between Databricks and Synapse?
Answer: Databricks: Spark-focused, notebook-based, Delta Lake native. Synapse: SQL-focused, dedicated/serverless pools, T-SQL. Use Databricks for Spark; Synapse for SQL.
Question 3: What is Delta Lake?
Answer: Open-source storage layer providing ACID transactions, schema evolution, time travel, and data skipping on Parquet files.
Question 4: What is Unity Catalog?
Answer: Centralized governance for Databricks. Provides access control, lineage, data discovery, and marketplace capabilities.
Question 5: What is the Photon engine?
Answer: Vectorized C++ execution engine providing 2-8x performance improvement for Spark SQL operations. Enabled by default on DBR 11.3+.
Question 6: What is Adaptive Query Execution (AQE)?
Answer: Dynamically optimizes query plans based on runtime statistics. Adjusts join strategies, partition counts, and skew handling.
Question 7: What is the benefit of Auto Loader?
Answer: Incremental ingestion with automatic file discovery, schema inference, and checkpoint management. Uses Event Grid for near-real-time detection.
Question 8: How do you optimize Spark performance?
Answer: 1) Enable Photon and AQE, 2) Optimize partition count, 3) Use broadcast joins, 4) Cache DataFrames, 5) Avoid UDFs, 6) Use Delta Lake Z-ORDER.
Question 9: What is the difference between job clusters and interactive clusters?
Answer: Job clusters: Ephemeral, created per job, destroyed after. Interactive: Persist for development. Job clusters are cost-efficient for scheduled workloads.
Question 10: What is Delta Lake time travel?
Answer: Query previous versions of Delta tables using version numbers or timestamps. Enables auditing, reprocessing, and debugging.
Question 11: How do you handle data skew in Spark?
Answer: Enable AQE skewJoin, salting skewed keys, repartitioning, broadcast joins, and adjusting shuffle partitions.
Question 12: What is the benefit of Delta Cache?
Answer: Caches Delta table data on local SSDs for faster reads. Reduces storage I/O by 2-10x for repeated queries.
Question 13: What is the difference between OPTIMIZE and VACUUM?
Answer: OPTIMIZE: Compacts small files into larger ones. VACUUM: Removes old files no longer referenced. Use OPTIMIZE for performance; VACUUM for storage cleanup.
Question 14: How do you implement data quality in Databricks?
Answer: Great Expectations, Delta Lake constraints, custom validation logic, and Purview integration.
Question 15: What is the benefit of Structured Streaming?
Answer: Stream processing with exactly-once semantics, Delta Lake integration, and unified batch/streaming code.
Question 16: How do you manage Databricks secrets?
Answer: Databricks Secret Scopes (Key Vault-backed), notebook widgets, and environment variables. Never hardcode secrets.
Question 17: What is the difference between Databricks SQL and notebooks?
Answer: Databricks SQL: SQL warehouses for dashboards and queries. Notebooks: Spark-based development with multiple languages.
Question 18: How do you implement CI/CD for Databricks?
Answer: Databricks Repos (Git integration), Terraform for infrastructure, Databricks CLI for deployment, and multi-environment setup.
Question 19: What is the benefit of Delta Sharing?
Answer: Securely share Delta tables across organizations without copying data. Uses open protocols for interoperability.
Question 20: How do you monitor Databricks costs?
Answer: Databricks Cost Management, cluster utilization metrics, DBU consumption, and Azure Cost Management integration.
Question 21: What is the difference between MLflow and Databricks?
Answer: MLflow: Open-source ML lifecycle management. Databricks: Managed platform with MLflow integration. Use MLflow for tracking; Databricks for compute.
Question 22: How do you handle cluster sizing?
Answer: Start small, auto-scale based on workload, use spot instances for cost savings, and right-size based on utilization metrics.
Question 23: What is the benefit of notebooks in Databricks?
Answer: Collaborative development, multi-language support (Python, SQL, R), visualization, and integration with Delta Lake.
Question 24: How do you implement data governance?
Answer: Unity Catalog for access control and lineage, Purview integration, cluster policies, and secret management.
Question 25: What is the lakehouse pattern?
Answer: Combines data lake storage (ADLS Gen2) with data warehouse capabilities (Delta Lake ACID transactions, schema enforcement).