Data Lake Interview Q&A
25 interview questions on data lake architecture, ADLS Gen2, and governance
Question 1: What is the zone-based data lake architecture?
Answer: Raw zone (original format, immutable), Curated zone (analytics-ready, Delta Lake), Sandbox zone (exploration), Archive zone (compliance). Each zone has specific retention, access, and format policies.
Question 2: How do you implement POSIX ACLs in ADLS Gen2?
Answer: Use az storage fs access set commands for user/group permissions. Set default ACLs for inheritance. Use mask for effective permissions. Test with az storage fs access show.
Question 3: What is the benefit of Hierarchical Namespace?
Answer: Directory operations, POSIX ACLs, atomic renames, improved Hadoop compatibility. Essential for data lake workloads requiring directory-level operations.
Question 4: How do you optimize ADLS Gen2 performance?
Answer: 1) Use hierarchical namespace, 2) Avoid small files (aim for 1GB+), 3) Use partitioning, 4) Enable ADLS Gen2 API, 5) Use Parallel File System Operations.
Question 5: What is the difference between flat and hierarchical namespace?
Answer: Flat: Blob-level operations only. Hierarchical: Directory operations, POSIX ACLs, atomic renames. Use hierarchical for data lake workloads.
Question 6: How do you implement lifecycle management?
Answer: Configure rules to move data between tiers (HotβCoolβColdβArchive) based on modification time. Use prefix filters for zone-specific policies.
Question 7: What is the recommended file format for data lakes?
Answer: Parquet (columnar, compressed, analytics-optimized). Use Delta Lake for ACID transactions. Avoid CSV/JSON for large datasets.
Question 8: How do you handle small file problems?
Answer: Use OPTIMIZE (Delta Lake), compaction jobs, partition pruning, and file size optimization at ingestion. Aim for 128MB-1GB per file.
Question 9: What is the difference between ADLS Gen2 and Azure Blob Storage?
Answer: ADLS Gen2: Hierarchical namespace, POSIX ACLs, analytics-optimized. Blob Storage: Flat namespace, object storage, mediaoptimized. Use ADLS Gen2 for data lake workloads.
Question 10: How do you implement data lake security?
Answer: Private Endpoints, Managed Identities, RBAC/ACLs, encryption at rest (CMK), encryption in transit (TLS 1.2), and monitoring with diagnostic settings.
Question 11: What is the benefit of Delta Lake in data lakes?
Answer: ACID transactions, schema evolution, time travel, data skipping, auto compaction. Enables reliable data lake operations with analytics capabilities.
Question 12: How do you handle data lake governance?
Answer: Purview for discovery/classification, sensitivity labels, business glossary, access policies, and audit logging. Implement data stewardship per domain.
Question 13: What is the difference between data lake and data warehouse?
Answer: Data Lake: Raw data, schema-on-read, diverse formats. Data Warehouse: Structured data, schema-on-write, optimized for analytics. Use both for lakehouse architecture.
Question 14: How do you implement data quality in data lakes?
Answer: Schema validation at ingestion, Great Expectations for validation, Delta Lake constraints, Purview classification, and quarantine zone for failed records.
Question 15: What is the benefit of data lake for ML?
Answer: Stores raw data for feature engineering, supports diverse formats, enables data exploration, and integrates with ML frameworks (Spark, TensorFlow).
Question 16: How do you handle data lake cost optimization?
Answer: Lifecycle management, compression, partition pruning, avoiding small files, and using appropriate storage tiers. Monitor costs with Azure Cost Management.
Question 17: What is the difference between data lake and data lakehouse?
Answer: Data Lake: Storage-focused. Data Lakehouse: Combines data lake storage with data warehouse capabilities (Delta Lake, ACID transactions, schema enforcement).
Question 18: How do you implement data lake for streaming?
Answer: Use Event Hubs Capture for auto-archiving, Delta Lake for streaming tables, Auto Loader in Databricks, and file naming conventions for partitioning.
Question 19: What is the benefit of ADLS Gen2 for analytics?
Answer: Optimized for Hadoop/Spark workloads, hierarchical namespace, POSIX ACLs, high throughput, and integration with Synapse Serverless.
Question 20: How do you handle data lake backup?
Answer: Use RA-GRS for geo-redundancy, soft delete for recovery, point-in-time restore for Delta Lake, and cross-region replication for critical data.
Question 21: What is the difference between data lake zones and tiers?
Answer: Zones: Logical separation (raw, curated, sandbox). Tiers: Storage classes (Hot, Cool, Cold, Archive). Zones organize data; tiers optimize costs.
Question 22: How do you implement data lake for IoT?
Answer: Use Event Hubs for ingestion, partition by device/time, Delta Lake for storage, and Stream Analytics for real-time processing alongside batch analytics.
Question 23: What is the benefit of ADLS Gen2 for data engineering?
Answer: High throughput, hierarchical namespace, POSIX ACLs, Hadoop compatibility, and integration with Azure data services (ADF, Synapse, Databricks).
Question 24: How do you handle data lake migration?
Answer: Use AzCopy for file migration, Data Box for large datasets, DMS for database migration, and validation scripts for post-migration verification.
Question 25: What is the future of data lakes?
Answer: Lakehouse architecture (combining data lake and warehouse), unified analytics (Fabric), real-time data lakes, and AI-integrated data platforms.