Batch Analytics Interview Q&A

25 comprehensive interview questions on batch analytics, pipelines, and Azure data services

Question 1: What is the difference between ETL and ELT?

Answer: ETL (Extract, Transform, Load) transforms data before loading to the target system. ELT (Extract, Load, Transform) loads raw data first, then transforms using the target system's capabilities.

When to use ETL: Complex transformations requiring specialized compute (Databricks), data quality validation before load, data masking/anonymization requirements.

When to use ELT: Leveraging Synapse SQL for transformation, large datasets where loading raw data is faster, scenarios requiring reprocessing of historical data.

ℹ️

Pro Tip: Modern data engineering favors ELT patterns with cloud data warehouses (Synapse, Snowflake) that can handle transformations at scale.

Question 2: How do you handle incremental loading in ADF?

Answer: Use watermark patterns (track last processed timestamp/ID), CDC (Change Data Capture) for database sources, file-based triggers for file arrivals, or tumbling window triggers for time-based processing.

Question 3: Explain the difference between ADF Copy Activity and Data Flow.

Answer: Copy Activity uses optimized engines for simple data movement (faster, cheaper). Data Flow uses Spark for complex transformations (joins, aggregations, derived columns) with visual debugging.

Question 4: How do you optimize Synapse SQL Pool performance?

Answer: 1) Choose correct distribution (Hash for facts, Replicated for dims), 2) Use Clustered Columnstore Indexes, 3) Update statistics, 4) Implement partitioning, 5) Use CTAS for data loading, 6) Enable result set caching.

Question 5: What is the benefit of Delta Lake in data engineering?

Answer: ACID transactions, schema enforcement/evolution, time travel (versioning), data skipping (Z-ORDER), auto compaction, and streaming support. Enables reliable data lake operations.

Question 6: How do you implement data quality checks?

Answer: Use Great Expectations for Python-based validation, ADF Data Flows for visual quality rules, Purview for automated classification, and custom SQL checks in Synapse. Implement at ingestion, transformation, and loading stages.

Question 7: What is the difference between ADLS Gen2 and Blob Storage?

Answer: ADLS Gen2 has Hierarchical Namespace (POSIX ACLs, directory operations), optimized for analytics. Blob Storage is flat namespace, optimized for object storage. ADLS Gen2 is recommended for data lake workloads.

Question 8: How do you handle schema changes in data pipelines?

Answer: Use schema drift in ADF Data Flows, Delta Lake mergeSchema option, Purview for schema discovery, and schema registry for versioned schemas. Implement schema validation at ingestion.

Question 9: What is the recommended file format for data lakes?

Answer: Parquet (columnar, compressed, optimized for analytics). Use Delta Lake for ACID transactions. Avoid CSV/JSON for large datasets (no compression, no schema).

Question 10: How do you monitor data pipeline health?

Answer: Azure Monitor metrics/logs, ADF pipeline run history, custom KQL queries, Power BI dashboards, and alerting for failures. Implement end-to-end monitoring with diagnostic settings.

Question 11: What is the difference between dedicated and serverless SQL pools?

Answer: Dedicated: Reserved compute, predictable performance, DWU-based pricing. Serverless: Pay-per-TB-scanned, no infrastructure management, for ad-hoc exploration. Use dedicated for production; serverless for exploration.

Question 12: How do you implement slowly changing dimensions (SCD)?

Answer: SCD Type 1: Overwrite (MERGE UPDATE). SCD Type 2: Add new row with effective dates (MERGE INSERT). SCD Type 3: Add new column for previous value. Use Synapse SQL for implementation.

Question 13: What is the benefit of using Managed Identities?

Answer: Eliminates credential management, automatic token rotation, integrates with Azure AD, provides granular RBAC. Always prefer over storage keys or connection strings.

Question 14: How do you handle late-arriving data?

Answer: Stream Analytics late arrival policy (up to 5 days), ADF watermarks with flexible windows, Delta Lake time travel for reprocessing, and idempotent processing for duplicates.

Question 15: What is the difference between partitioning and bucketing?

Answer: Partitioning: Directory-based (date, region), query pattern optimization. Bucketing: Hash-based within partitions, fixed number of files. Use partitioning for query filters; bucketing for join optimization.

Question 16: How do you implement CI/CD for data pipelines?

Answer: ADF Git integration, ARM/Bicep templates for infrastructure, Azure DevOps pipelines for build/test/deploy, environment promotion with approval gates, and automated testing.

Question 17: What is the difference between hot, warm, and cold data?

Answer: Hot: Frequently accessed (last 30 days), Hot tier. Warm: Infrequent access (30-90 days), Cool tier. Cold: Rare access (90+ days), Archive tier. Use lifecycle management for automatic tiering.

Question 18: How do you handle data skew in Synapse?

Answer: Check distribution skew with DBCC PDW_SHOWSPACEUSED. Change distribution key to more evenly distributed column. Use Round Robin for staging. Consider replicated tables for small dimensions.

Question 19: What is the benefit of materialized views?

Answer: Pre-computed aggregations for faster query performance. Automatic refresh when base tables change. Reduce query complexity. Use for common aggregations in reporting.

Question 20: How do you implement data governance?

Answer: Purview for discovery/classification, sensitivity labels for protection, RBAC for access control, business glossary for standardization, and audit logging for compliance.

Question 21: What is the difference between batch and micro-batch processing?

Answer: Batch: Process data on schedule (hourly/daily). Micro-batch: Process small batches continuously (every few seconds). Use batch for large datasets; micro-batch for near-real-time requirements.

Question 22: How do you handle failed pipeline runs?

Answer: Implement retry policies, dead-letter queues for failed records, alerting for failures, logging to monitoring systems, and automated remediation where possible.

Question 23: What is the benefit of using Azure Data Factory?

Answer: Visual ETL orchestration, 90+ connectors, managed integration runtimes, monitoring, Git integration, and serverless execution. Simplifies complex data pipeline management.

Question 24: How do you optimize data lake storage costs?

Answer: Lifecycle management (Hot→Cool→Archive), file size optimization (aim for 1GB+), compression (Parquet+Snappy), partition pruning, and avoiding small files.

Question 25: What is the difference between OLTP and OLAP?

Answer: OLTP: Transactional (INSERT/UPDATE/DELETE), normalized schema, row-store. OLAP: Analytical (SELECT/AGGREGATE), denormalized schema, column-store. Use Synapse for OLAP; SQL DB for OLTP.