Data Lake Interview Q&A

25 interview questions on data lake architecture, ADLS Gen2, and governance

Question 1: What is the zone-based data lake architecture?

Answer: Raw zone (original format, immutable), Curated zone (analytics-ready, Delta Lake), Sandbox zone (exploration), Archive zone (compliance). Each zone has specific retention, access, and format policies.

Question 2: How do you implement POSIX ACLs in ADLS Gen2?

Answer: Use az storage fs access set commands for user/group permissions. Set default ACLs for inheritance. Use mask for effective permissions. Test with az storage fs access show.

Question 3: What is the benefit of Hierarchical Namespace?

Answer: Directory operations, POSIX ACLs, atomic renames, improved Hadoop compatibility. Essential for data lake workloads requiring directory-level operations.

Question 4: How do you optimize ADLS Gen2 performance?

Answer: 1) Use hierarchical namespace, 2) Avoid small files (aim for 1GB+), 3) Use partitioning, 4) Enable ADLS Gen2 API, 5) Use Parallel File System Operations.

Question 5: What is the difference between flat and hierarchical namespace?

Answer: Flat: Blob-level operations only. Hierarchical: Directory operations, POSIX ACLs, atomic renames. Use hierarchical for data lake workloads.

Question 6: How do you implement lifecycle management?

Answer: Configure rules to move data between tiers (Hot→Cool→Cold→Archive) based on modification time. Use prefix filters for zone-specific policies.

Question 7: What is the recommended file format for data lakes?

Answer: Parquet (columnar, compressed, analytics-optimized). Use Delta Lake for ACID transactions. Avoid CSV/JSON for large datasets.

Question 8: How do you handle small file problems?

Answer: Use OPTIMIZE (Delta Lake), compaction jobs, partition pruning, and file size optimization at ingestion. Aim for 128MB-1GB per file.

Question 9: What is the difference between ADLS Gen2 and Azure Blob Storage?

Answer: ADLS Gen2: Hierarchical namespace, POSIX ACLs, analytics-optimized. Blob Storage: Flat namespace, object storage, mediaoptimized. Use ADLS Gen2 for data lake workloads.

Question 10: How do you implement data lake security?

Answer: Private Endpoints, Managed Identities, RBAC/ACLs, encryption at rest (CMK), encryption in transit (TLS 1.2), and monitoring with diagnostic settings.

Question 11: What is the benefit of Delta Lake in data lakes?

Answer: ACID transactions, schema evolution, time travel, data skipping, auto compaction. Enables reliable data lake operations with analytics capabilities.

Question 12: How do you handle data lake governance?

Answer: Purview for discovery/classification, sensitivity labels, business glossary, access policies, and audit logging. Implement data stewardship per domain.

Question 13: What is the difference between data lake and data warehouse?

Answer: Data Lake: Raw data, schema-on-read, diverse formats. Data Warehouse: Structured data, schema-on-write, optimized for analytics. Use both for lakehouse architecture.

Question 14: How do you implement data quality in data lakes?

Answer: Schema validation at ingestion, Great Expectations for validation, Delta Lake constraints, Purview classification, and quarantine zone for failed records.

Question 15: What is the benefit of data lake for ML?

Answer: Stores raw data for feature engineering, supports diverse formats, enables data exploration, and integrates with ML frameworks (Spark, TensorFlow).

Question 16: How do you handle data lake cost optimization?

Answer: Lifecycle management, compression, partition pruning, avoiding small files, and using appropriate storage tiers. Monitor costs with Azure Cost Management.

Question 17: What is the difference between data lake and data lakehouse?

Answer: Data Lake: Storage-focused. Data Lakehouse: Combines data lake storage with data warehouse capabilities (Delta Lake, ACID transactions, schema enforcement).

Question 18: How do you implement data lake for streaming?

Answer: Use Event Hubs Capture for auto-archiving, Delta Lake for streaming tables, Auto Loader in Databricks, and file naming conventions for partitioning.

Question 19: What is the benefit of ADLS Gen2 for analytics?

Answer: Optimized for Hadoop/Spark workloads, hierarchical namespace, POSIX ACLs, high throughput, and integration with Synapse Serverless.

Question 20: How do you handle data lake backup?

Answer: Use RA-GRS for geo-redundancy, soft delete for recovery, point-in-time restore for Delta Lake, and cross-region replication for critical data.

Question 21: What is the difference between data lake zones and tiers?

Answer: Zones: Logical separation (raw, curated, sandbox). Tiers: Storage classes (Hot, Cool, Cold, Archive). Zones organize data; tiers optimize costs.

Question 22: How do you implement data lake for IoT?

Answer: Use Event Hubs for ingestion, partition by device/time, Delta Lake for storage, and Stream Analytics for real-time processing alongside batch analytics.

Question 23: What is the benefit of ADLS Gen2 for data engineering?

Answer: High throughput, hierarchical namespace, POSIX ACLs, Hadoop compatibility, and integration with Azure data services (ADF, Synapse, Databricks).

Question 24: How do you handle data lake migration?

Answer: Use AzCopy for file migration, Data Box for large datasets, DMS for database migration, and validation scripts for post-migration verification.

Question 25: What is the future of data lakes?

Answer: Lakehouse architecture (combining data lake and warehouse), unified analytics (Fabric), real-time data lakes, and AI-integrated data platforms.