Databricks/Spark Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks/Spark Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Databricks on AWS β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Managed Spark clusters β β
β β β’ Delta Lake (ACID transactions) β β
β β β’ Unity Catalog (governance) β β
β β β’ MLflow (ML lifecycle) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Spark Components β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Driver β Cluster Manager β Executors β β
β β β β
β β RDD β DataFrame β Dataset β β
β β β β
β β Spark SQL, Spark Streaming, MLlib, GraphX β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Storage Integration β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ S3: Primary storage β β
β β β’ Delta Lake: ACID layer on S3 β β
β β β’ DBFS: Databricks File System (mounts to S3) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q1: How does Databricks differ from EMR?
Answer:
Comparison:
| Feature | Databricks | EMR |
|---|---|---|
| Management | Fully managed | Self-managed clusters |
| Spark | Optimized runtime | Open-source Spark |
| UI | Collaborative notebooks | CLI/API |
| Governance | Unity Catalog | IAM-based |
| Cost | DBU-based | Instance-based |
Choose Databricks when:
- Need collaborative notebooks
- Require built-in ML tools
- Want optimized Spark runtime
- Need Unity Catalog governance
Choose EMR when:
- Need full cluster control
- Custom Spark applications
- Long-running clusters
- Cost is primary concern
Decision Tree:
βββββββββββββββββββββββββββββββββββββββββββ
β Need collaborative notebooks? β
ββββββββββββββββββ¬βββββββββββββββββββββββββ
β
ββββββββββ΄βββββββββ
βΌ βΌ
Yes No
β β
βΌ βΌ
βββββββββββββ βββββββββββββββββββββ
βDatabricks β β Need custom JARs? β
βββββββββββββ βββββββββββ¬ββββββββββ
β
ββββββββββββ΄βββββββββββ
βΌ βΌ
Yes No
β β
βΌ βΌ
βββββββββββββ βββββββββββββ
β EMR β βDatabricks β
βββββββββββββ βββββββββββββ
Q2: What is Delta Lake and why use it?
Answer:
Delta Lake Features:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Delta Lake Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ACID Transactions β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Atomic: All or nothing β β
β β β’ Consistent: Data integrity β β
β β β’ Isolated: Concurrent reads/writes β β
β β β’ Durable: Persistent storage β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Time Travel β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Version history β β
β β β’ Query by version/timestamp β β
β β β’ Rollback support β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Schema Evolution β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Add columns β β
β β β’ Rename columns β β
β β β’ Change data types β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Performance β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Z-Order indexing β β
β β β’ Data skipping β β
β β β’ Compaction β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delta Lake Usage:
# Write to Delta Lake
df.write.format("delta").save("/delta/events")
# Read with time travel
df = spark.read.format("delta").option("versionAsOf", 5).load("/delta/events")
# SQL syntax
"""
CREATE TABLE events USING delta LOCATION '/delta/events'
SELECT * FROM delta.`/delta/events`
"""
Q3: How do you optimize Spark job performance?
Answer:
Spark Optimization Techniques:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Spark Performance Optimization β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Memory Management β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ spark.executor.memory: 8-16 GB β β
β β β’ spark.memory.fraction: 0.6-0.8 β β
β β β’ spark.memory.storageFraction: 0.5 β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Partitioning β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ spark.sql.shuffle.partitions: 200 (default) β β
β β β’ Target: 128MB-1GB per partition β β
β β β’ Use repartition() or coalesce() β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Serialization β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Kryo serializer (faster than Java) β β
β β β’ spark.serializer: org.apache.spark.serializer.KryoSerializer β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Caching β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ cache() / persist() for DataFrames β β
β β β’ unpersist() when no longer needed β β
β β β’ Use appropriate storage level β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Spark Configuration:
# Optimal Spark configuration
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.executor.memory", "8g")
spark.conf.set("spark.executor.cores", "4")
spark.conf.set("spark.executor.instances", "10")
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
Q4: How do you handle data skew in Spark?
Answer:
Data Skew Solutions:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Skew Handling Strategies β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Detection β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Check partition sizes β β
β β β’ Monitor task execution times β β
β β β’ Use Spark UI to identify hot partitions β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Salting Technique β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β df_salted = df.withColumn( β β
β β "salt", β β
β β (rand() * 10).cast("int") β β
β β ).withColumn( β β
β β "salted_key", β β
β β concat(col("key"), lit("_"), col("salt")) β β
β β ) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Broadcast Join β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β from pyspark.sql.functions import broadcast β β
β β result = large_df.join(broadcast(small_df), "key") β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β AQE (Adaptive Query Execution) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β spark.conf.set("spark.sql.adaptive.enabled", "true") β β
β β spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q5: How do you implement Delta Lake CDC?
Answer:
Delta Lake CDC:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Delta Lake CDC Implementation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Merge Operation β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β deltaTable.alias("target").merge( β β
β β source.alias("source"), β β
β β "target.id = source.id" β β
β β ).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Change Data Feed β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Enable change data feed β β
β β β’ Query changes by version β β
β β β’ Track insert, update, delete operations β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Auto Loader β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Incremental file ingestion β β
β β β’ Schema inference and evolution β β
β β β’ Exactly-once processing β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Merge Example:
from delta.tables import DeltaTable
# Load target table
deltaTable = DeltaTable.forPath(spark, "/delta/customers")
# Merge operation
deltaTable.alias("target").merge(
source.alias("source"),
"target.customer_id = source.customer_id"
).whenMatchedUpdate(
set={
"name": "source.name",
"email": "source.email",
"updated_at": "current_timestamp()"
}
).whenNotMatchedInsert(
values={
"customer_id": "source.customer_id",
"name": "source.name",
"email": "source.email",
"created_at": "current_timestamp()"
}
).execute()
Q6: How do you implement Delta Lake time travel?
Answer:
Time Travel Implementation:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Delta Lake Time Travel β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Query by Version β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β df = spark.read.format("delta").option( β β
β β "versionAsOf", 5 β β
β β ).load("/delta/events") β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Query by Timestamp β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β df = spark.read.format("delta").option( β β
β β "timestampAsOf", "2024-01-15" β β
β β ).load("/delta/events") β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β History β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β deltaTable.history() β β
β β # Returns version, timestamp, operation, etc. β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Rollback β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β deltaTable.restore(5) β β
β β # Restores table to version 5 β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q7: How do you optimize Delta Lake performance?
Answer:
Delta Lake Optimization:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Delta Lake Performance Optimization β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Z-Order Indexing β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β OPTIMIZE events ZORDER BY (event_date, user_id) β β
β β β β
β β β’ Co-locate related data β β
β β β’ Improve query performance β β
β β β’ Data skipping enabled β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Compaction β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β OPTIMIZE events β β
β β β β
β β β’ Combine small files β β
β β β’ Reduce file count β β
β β β’ Improve read performance β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Data Skipping β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Statistics-based pruning β β
β β β’ Min/max values per file β β
β β β’ Automatic with Z-Order β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Partitioning β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β df.write.partitionBy("year", "month") β β
β β .format("delta").save("/delta/events") β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q8: How do you implement Delta Lake schema evolution?
Answer:
Schema Evolution:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Delta Lake Schema Evolution β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Add Columns β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β df.write.format("delta") β β
β β .option("mergeSchema", "true") β β
β β .mode("append").save("/delta/events") β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Overwrite Schema β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β df.write.format("delta") β β
β β .option("overwriteSchema", "true") β β
β β .mode("overwrite").save("/delta/events") β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β SQL Schema Evolution β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ALTER TABLE events ADD COLUMN (new_col STRING) β β
β β ALTER TABLE events ALTER COLUMN old_col RENAME TO new_nameβ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q9: How do you implement streaming with Delta Lake?
Answer:
Delta Lake Streaming:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Delta Lake Streaming Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Structured Streaming β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β streaming_df = spark.readStream.format("delta") β β
β β .load("/delta/events") β β
β β β β
β β query = streaming_df.writeStream.format("delta") β β
β β .outputMode("append") β β
β β .start("/delta/output") β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Auto Loader β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β df = spark.readStream.format("cloudFiles") β β
β β .option("cloudFiles.format", "json") β β
β β .load("s3://bucket/input/") β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Output Modes β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Append: Add new rows β β
β β β’ Update: Update existing rows β β
β β β’ Complete: Full result table β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q10: How do you implement Unity Catalog?
Answer:
Unity Catalog Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Unity Catalog Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Three-Level Namespace β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β catalog.schema.table β β
β β β β
β β β’ Catalog: Top-level grouping β β
β β β’ Schema: Logical grouping (database) β β
β β β’ Table: Data storage β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Access Control β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ GRANT SELECT ON TABLE catalog.schema.table TO user β β
β β β’ Column-level permissions β β
β β β’ Row-level filtering β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Lineage β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Automatic lineage tracking β β
β β β’ Column-level lineage β β
β β β’ Integration with governance tools β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q11: How do you implement MLflow on Databricks?
Answer:
MLflow Implementation:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MLflow Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β MLflow Components β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Tracking: Log parameters, metrics, artifacts β β
β β β’ Projects: Reproducible ML code β β
β β β’ Models: Model format and serving β β
β β β’ Registry: Model versioning and lifecycle β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Tracking β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β mlflow.start_run() β β
β β mlflow.log_param("learning_rate", 0.01) β β
β β mlflow.log_metric("accuracy", 0.95) β β
β β mlflow.log_artifact("model.pkl") β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Model Registry β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β mlflow.register_model( β β
β β "runs:/run_id/model", β β
β β "production_model" β β
β β ) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q12: How do you implement Databricks notebooks?
Answer:
Notebook Best Practices:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks Notebook Best Practices β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Organization β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Use sections and headings β β
β β β’ Add markdown documentation β β
β β β’ Separate ETL, analysis, and visualization β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Reproducibility β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Use widgets for parameters β β
β β β’ Version control with Git β β
β β β’ Document dependencies β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Performance β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Cache DataFrames β β
β β β’ Use broadcast joins for small tables β β
β β β’ Monitor Spark UI β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Collaboration β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Use comments and documentation β β
β β β’ Share notebooks with teams β β
β β β’ Use version history β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q13: How do you implement Databricks jobs?
Answer:
Databricks Jobs:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks Jobs Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Job Types β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Notebook tasks β β
β β β’ JAR tasks β β
β β β’ Python shell tasks β β
β β β’ Spark submit tasks β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Scheduling β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Cron expressions β β
β β β’ Dependency chains β β
β β β’ Retry policies β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Cluster Configuration β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ New cluster (ephemeral) β β
β β β’ Existing cluster (shared) β β
β β β’ Auto-scaling configuration β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Monitoring β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Run history β β
β β β’ Alerting β β
β β β’ Logging β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q14: How do you optimize Databricks costs?
Answer:
Databricks Cost Optimization:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks Cost Optimization β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Cluster Optimization β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Right-size instance types β β
β β β’ Use spot instances β β
β β β’ Auto-terminate idle clusters β β
β β β’ Use serverless for notebooks β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β DBU Optimization β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Use appropriate compute tier β β
β β β’ Optimize job runtimes β β
β β β’ Use cached data β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Storage Optimization β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Delta Lake compaction β β
β β β’ Z-Order indexing β β
β β β’ File size optimization β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Cost Monitoring β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Usage dashboards β β
β β β’ Budget alerts β β
β β β’ Cost allocation by team/project β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q15: How do you implement data quality in Databricks?
Answer:
Data Quality Implementation:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks Data Quality β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Delta Live Tables (DLT) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β @dlt.table β β
β β def raw_data(): β β
β β return spark.read.format("json").load(path) β β
β β β β
β β @dlt.expect("valid_amount", "amount > 0") β β
β β def validated_data(): β β
β β return dlt.read("raw_data") β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Great Expectations β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Integrate with Databricks β β
β β β’ Define expectations β β
β β β’ Validate data quality β β
β β β’ Generate quality reports β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Custom Validation β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Define quality rules β β
β β β’ Implement validation functions β β
β β β’ Log quality metrics β β
β β β’ Alert on failures β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q16: How do you implement Databricks security?
Answer:
Databricks Security:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks Security Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Network Security β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ VPC deployment β β
β β β’ PrivateLink β β
β β β’ Security groups β β
β β β’ IP access lists β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Data Security β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Encryption at rest (SSE) β β
β β β’ Encryption in transit (TLS) β β
β β β’ Unity Catalog access control β β
β β β’ Column-level security β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Identity & Access β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ SSO integration β β
β β β’ IAM role mapping β β
β β β’ SCIM provisioning β β
β β β’ Multi-factor authentication β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Audit β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Audit logs β β
β β β’ Unity Catalog audit trail β β
β β β’ AWS CloudTrail integration β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q17: How do you implement Databricks CI/CD?
Answer:
Databricks CI/CD:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks CI/CD Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Source Control β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Git integration β β
β β β’ Branch support β β
β β β’ Commit history β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Databricks Repos β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Clone Git repos β β
β β β’ Branch management β β
β β β’ Commit and push β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Deployment β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Databricks CLI β β
β β β’ REST API β β
β β β’ Terraform provider β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Testing β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Unit tests β β
β β β’ Integration tests β β
β β β’ Data quality tests β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q18: How do you implement Databricks multi-hop architecture?
Answer:
Multi-Hop Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks Multi-Hop Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Bronze (Raw) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Raw data as-is β β
β β β’ Schema on read β β
β β β’ Append-only β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Silver (Validated) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Cleaned data β β
β β β’ Schema enforced β β
β β β’ Deduplicated β β
β β β’ Validated β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Gold (Business) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Business-level aggregates β β
β β β’ Dimension/fact tables β β
β β β’ Optimized for queries β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Delta Live Tables Implementation β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β @dlt.table β β
β β def bronze(): β β
β β return spark.read.stream.format("cloudFiles") β β
β β β β
β β @dlt.table β β
β β def silver(): β β
β β return dlt.read("bronze").filter("is_valid") β β
β β β β
β β @dlt.table β β
β β def gold(): β β
β β return dlt.read("silver").groupBy("category").agg()β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q19: How do you implement Databricks for real-time?
Answer:
Real-Time Processing:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks Real-Time Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Structured Streaming β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Read from Kafka/Kinesis β β
β β β’ Process with Spark β β
β β β’ Write to Delta Lake β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Delta Live Tables (Streaming) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β @dlt.table β β
β β def streaming_events(): β β
β β return spark.readStream.format("cloudFiles") β β
β β .option("cloudFiles.format", "json") β β
β β .load("/events/raw") β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Latency Options β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Micro-batch: Seconds latency β β
β β β’ Continuous: Millisecond latency β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q20: How do you implement Databricks for ML?
Answer:
Databricks ML Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks ML Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ML Pipeline β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Data Prep β Feature Eng β Training β Evaluation β Deployβ β
β β β β
β β β’ AutoML for quick prototyping β β
β β β’ MLflow for tracking β β
β β β’ Model Registry for versioning β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Feature Store β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Centralized feature repository β β
β β β’ Online/offline stores β β
β β β’ Feature sharing β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Model Serving β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Real-time endpoints β β
β β β’ Batch inference β β
β β β’ A/B testing β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q21: How do you implement Databricks governance?
Answer:
Databricks Governance:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks Governance Framework β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Unity Catalog β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Centralized governance β β
β β β’ Fine-grained access control β β
β β β’ Data lineage β β
β β β’ Data discovery β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Access Control β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ GRANT/REVOKE statements β β
β β β’ Role-based access β β
β β β’ Column/row-level security β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Audit β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ System table audit logs β β
β β β’ Access history β β
β β β’ Lineage tracking β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q22: How do you migrate to Databricks?
Answer:
Migration Strategy:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks Migration Strategy β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Assessment β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Inventory existing workloads β β
β β β’ Identify dependencies β β
β β β’ Estimate costs β β
β β β’ Plan timeline β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Migration Phases β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Phase 1: Lift and shift (quick win) β β
β β Phase 2: Optimize (performance) β β
β β Phase 3: Modernize (Delta Lake, DLT) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Tools β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Databricks Accelerate β β
β β β’ Migration assessment tools β β
β β β’ Automated code conversion β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q23: How do you implement Databricks best practices?
Answer:
Best Practices Checklist:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks Best Practices β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Code Organization β
β β Use Databricks Repos for version control β
β β Modular notebook design β
β β Use widgets for parameters β
β β Document code and assumptions β
β β
β Performance β
β β Cache frequently used DataFrames β
β β Use broadcast joins for small tables β
β β Optimize partition counts β
β β Use Delta Lake for ACID transactions β
β β
β Cost Optimization β
β β Right-size clusters β
β β Use spot instances β
β β Auto-terminate idle clusters β
β β Monitor DBU usage β
β β
β Governance β
β β Implement Unity Catalog β
β β Use access controls β
β β Enable audit logging β
β β Track data lineage β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q24: How do you troubleshoot Databricks issues?
Answer:
Troubleshooting Guide:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Performance Issues β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Check Spark UI for bottlenecks β β
β β β’ Monitor executor memory/disk β β
β β β’ Review shuffle read/write β β
β β β’ Check for data skew β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Cluster Issues β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Check cluster status β β
β β β’ Review driver/executor logs β β
β β β’ Monitor network connectivity β β
β β β’ Check instance availability β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Data Issues β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Verify data format β β
β β β’ Check schema compatibility β β
β β β’ Review file sizes β β
β β β’ Validate data quality β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q25: How do you implement Databricks for data engineering?
Answer:
Data Engineering Best Practices:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Databricks Data Engineering Best Practices β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Architecture β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Multi-hop architecture (Bronze/Silver/Gold) β β
β β β’ Delta Lake as foundation β β
β β β’ Unity Catalog for governance β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ETL Patterns β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Delta Live Tables for declarative ETL β β
β β β’ Auto Loader for incremental ingestion β β
β β β’ Merge for upserts β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Scheduling β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Databricks Jobs for orchestration β β
β β β’ Dependency chains β β
β β β’ Retry policies β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Monitoring β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Job run history β β
β β β’ Cluster metrics β β
β β β’ Cost tracking β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Summary
Mastering Databricks and Spark requires understanding:
- Architecture: Lakehouse, multi-hop, Delta Lake
- Performance: Partitioning, caching, AQE
- Governance: Unity Catalog, access control, lineage
- ML: MLflow, Feature Store, Model Serving
- Cost: Cluster optimization, DBU management
These concepts form the foundation for building scalable data platforms on Databricks/AWS.