Mixed Topics & Comprehensive Review
Mixed Topics & Comprehensive Review
Module 65 β Complete Data Engineering Review & Mixed Interview Questions
AWS Data Engineering Services Map
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS DATA ENGINEERING SERVICES ECOSYSTEM β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β INGESTION PROCESSING STORAGE β
β βββββββββ ββββββββββ βββββββ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β βKinesis β βAWS Glue β βAmazon S3 β β
β βData β β(Spark/ β β(Data β β
β βStreams β βPySpark) β βLake) β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β βKinesis β βAWS EMR β βAmazon β β
β βFirehose β β(Managed β βDynamoDB β β
β β β βHadoop) β β β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β βAWS MSK β βAWS β βAmazon β β
β β(Managed β βLambda β βRedshift β β
β βKafka) β β(Serverlessβ β β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β βAWS DMS β βAmazon β βAmazon β β
β β(Database β βSageMaker β βElastiCacheβ β
β βMigration)β β(ML) β β β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β
β ANALYTICS ORCHESTRATION GOVERNANCE β
β ββββββββββ ββββββββββββ ββββββββββ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β βAmazon β βAWS Step β βAWS Lake β β
β βAthena β βFunctions β βFormation β β
β β(Serverlessβ β β β β β
β βQuery) β ββββββββββββ ββββββββββββ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β ββββββββββββ βAWS MWAA β βAWS Glue β β
β βAmazon β β(Managed β βData β β
β βQuickSightβ βAirflow) β βCatalog β β
β β(BI) β β β β β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β βAmazon β βAmazon β βAWS β β
β βEMR β βECS/Fargateβ βConfig β β
β βStudio β β(Containersβ β β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β
β SECURITY MONITORING COST β
β ββββββββ ββββββββββ βββββ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β βAWS KMS β βAmazon β βAWS β β
β β(Keys) β βCloudWatchβ βCost β β
β ββββββββββββ β β βExplorer β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β βAWS IAM β ββββββββββββ ββββββββββββ β
β β(Access β βAWS β βAWS β β
β βControl) β βCloudTrailβ βBudgets β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β βAmazon β βAWS β βS3 β β
β βMacie β βX-Ray β βIntelligentβ β
β β(Data β β β βTiering β β
β βPrivacy) β ββββββββββββ ββββββββββββ β
β ββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Comprehensive Scenario-Based Questions
Scenario 1: E-Commerce Real-Time Analytics
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β E-COMMERCE REAL-TIME ANALYTICS ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββ ββββββββββββ βββββββββββββ
β β Web β β Mobile β β POS β β IoT ββ
β β App β β App β β System β β Devices ββ
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬βββββββ
β β β β β β
β βΌ βΌ βΌ βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β KINESIS DATA STREAMS ββ
β β β’ User clicks β’ Transactions β’ Inventory updates ββ
β β β’ Search queries β’ Payment events β’ Sensor data ββ
β βββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β β β
β ββββββββββββββββββββββββΌβββββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β KINESIS β β LAMBDA β β KINESIS β β
β β ANALYTICS β β PROCESSOR β β FIREHOSE β β
β β (Flink) β β β β β β
β β β β β’ Enrichment β β β’ S3 Archive β β
β β β’ Session β β β’ Validation β β β’ Parquet β β
β β tracking β β β’ Transform β β β β
β β β’ Fraud β β β β β β
β β detection β β β β β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β DYNAMODB β β REDSHIFT β β S3 DATA LAKE β β
β β (Real-time β β (Analytics) β β (Historical) β β
β β Features) β β β β β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β β β β
β βββββββββββββββββββββββΌββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββ β
β β QUICKSIGHT β β
β β (Dashboards) β β
β ββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Question: Design a real-time analytics solution for an e-commerce platform handling 100K events/second with <1 second latency requirements.
Answer:
- Ingestion: Kinesis Data Streams with shard splitting based on throughput
- Processing: Kinesis Analytics with Flink for real-time aggregation
- Storage: DynamoDB for real-time features, Redshift for analytics
- Visualization: QuickSight dashboards with SPICE for performance
- Cost optimization: Use Firehose for bulk loading to S3, Lambda for event processing
Scenario 2: Financial Data Warehouse Migration
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FINANCIAL DATA WAREHOUSE MIGRATION ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β CURRENT STATE TARGET STATE β
β ββββββββββββ ββββββββββββ β
β ββββββββββββββββ ββββββββββββββββ β
β β Oracle β DMS + CDC β Aurora β β
β β Exadata β ββββββββββββββββΆβ PostgreSQL β β
β β 50TB β β β β
β ββββββββββββββββ ββββββββββββββββ β
β β β β
β β β β
β βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ β
β β ETL Jobs β AWS Glue β Glue Jobs β β
β β (Informatica)β ββββββββββββββββΆβ (Spark) β β
β ββββββββββββββββ ββββββββββββββββ β
β β β β
β β β β
β βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ β
β β Teradata β SCT + DMS β Redshift β β
β β 100TB β ββββββββββββββββΆβ Serverless β β
β ββββββββββββββββ ββββββββββββββββ β
β β β β
β β β β
β βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ β
β β Reporting β β QuickSight β β
β β (Cognos) β β + Athena β β
β ββββββββββββββββ ββββββββββββββββ β
β β
β MIGRATION PHASES: β
β Phase 1: Schema conversion + DMS full load β
β Phase 2: CDC cutover + ETL migration β
β Phase 3: Data warehouse migration β
β Phase 4: Reporting migration + validation β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Question: A financial institution needs to migrate from Oracle/Teradata to AWS with zero downtime. How would you approach this?
Answer:
- Assessment: Use SCT to analyze schemas and estimate effort
- Database migration: DMS with CDC for Oracle β Aurora PostgreSQL
- Data warehouse: Snowball for bulk transfer, then DMS CDC for Redshift
- ETL migration: Recreate Informatica jobs in AWS Glue
- Validation: Automated data comparison scripts
- Cutover: Blue-green deployment with gradual traffic shifting
Scenario 3: Multi-Region Data Lake
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MULTI-REGION DATA LAKE ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β REGION A (US-EAST-1) REGION B (EU-WEST-1) β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β S3 Data Lake β β S3 Data Lake β β
β β ββββββββββββββββββββ β β ββββββββββββββββββββ β β
β β β Raw Zone β β βββββββΆ β β Raw Zone β β β
β β β (Bronze) β β S3 β β (Bronze) β β β
β β ββββββββββββββββββββ β CRR β ββββββββββββββββββββ β β
β β ββββββββββββββββββββ β β ββββββββββββββββββββ β β
β β β Processed Zone β β β β Processed Zone β β β
β β β (Silver) β β β β (Silver) β β β
β β ββββββββββββββββββββ β β ββββββββββββββββββββ β β
β β ββββββββββββββββββββ β β ββββββββββββββββββββ β β
β β β Curated Zone β β β β Curated Zone β β β
β β β (Gold) β β β β (Gold) β β β
β β ββββββββββββββββββββ β β ββββββββββββββββββββ β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β β β
β βΌ βΌ β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β Redshift Cluster β β Redshift Cluster β β
β β (Analytics) β β (Analytics) β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β β β
β βββββββββββββββββββ¬ββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββ β
β β Route 53 β β
β β (Latency-based β β
β β Routing) β β
β ββββββββββββββββββββ β
β β
β DATA CONSISTENCY: β
β β’ S3 Cross-Region Replication for object replication β
β β’ DynamoDB Global Tables for real-time sync β
β β’ EventBridge for cross-region event distribution β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Question: Design a multi-region data lake for a global company with data sovereignty requirements.
Answer:
- Storage: S3 buckets per region with CRR for replication
- Catalog: Glue Data Catalog with cross-region replication
- Analytics: Redshift clusters per region with Global Tables
- Access: Route 53 latency-based routing, cross-account roles
- Compliance: Lake Formation for region-specific access controls
Mixed Technical Questions
Data Modeling & Design
Q1: How do you handle slowly changing dimensions (SCD) in AWS?
Answer: Use Glue ETL with custom logic for SCD Type 1 (overwrite), Type 2 (history), Type 3 (limited history). Store in S3 with partitioning by effective date, query with Athena or load to Redshift.
Q2: Explain schema-on-read vs schema-on-write in data lakes.
Answer: Schema-on-write: define schema before writing (RDBMS). Schema-on-read: apply schema when reading (data lakes). AWS Glue provides schema-on-read with Data Catalog. Benefits: flexibility vs performance.
Q3: How do you optimize Parquet files for Athena queries?
Answer: Use columnar compression (Snappy, GZIP), partition data by query patterns, optimize file sizes (128MB-1GB), use Glue compaction jobs, and implement bucketing for high-cardinality columns.
Performance & Optimization
Q4: How do you optimize Redshift query performance?
Answer: Use distribution keys (DISTKEY), sort keys (SORTKEY), vacuum analyze regularly, use result caching, implement workload management (WLM), and consider Redshift Serverless for variable workloads.
Q5: Explain Kinesis shard splitting and merging.
Answer: Split: divide shard into two for increased capacity. Merge: combine two shards for cost savings. Use UpdateShardCount API or auto-scaling. Monitor with CloudWatch GetRecords.IteratorAgeMilliseconds.
Q6: How do you handle data skew in Spark on EMR?
Answer: Use salting technique for skewed keys, repartition by different key, use broadcast joins for small tables, and enable AQE (Adaptive Query Execution) for automatic optimization.
Security & Compliance
Q7: How do you implement column-level security in a data lake?
Answer: Use Lake Formation column-level permissions, implement data masking with Glue, use IAM policies for fine-grained access, and Amazon Macie for sensitive data discovery.
Q8: Explain encryption at rest vs in transit for AWS services.
Answer: At rest: S3 SSE-S3, SSE-KMS, SSE-C; EBS encryption; Redshift encryption. In transit: SSL/TLS for all endpoints, VPN, Direct Connect with MACsec. Use KMS for key management.
Q9: How do you audit data access in a data lake?
Answer: Enable CloudTrail for API logging, S3 access logging, Lake Formation audit logs, VPC flow logs for network access, and CloudWatch Logs for application logs.
Cost Optimization
Q10: How do you reduce costs in an AWS data pipeline?
Answer: Use S3 Intelligent-Tiering, Lambda for variable workloads, Spot Instances for batch processing, Redshift Reserved Instances for steady state, and auto-scaling for all services.
Q11: Compare costs of EMR vs Glue for different workloads.
Answer: EMR: better for long-running, complex workloads with custom configurations. Glue: better for serverless, short-running ETL jobs. EMR with Spot Instances can be 70% cheaper than on-demand.
Q12: How do you implement cost allocation tags for data services?
Answer: Tag all resources with project, environment, team, and cost-center. Use AWS Cost Explorer with tags, implement budget alerts, and review unused resources monthly.
DevOps & MLOps
Q13: How do you implement CI/CD for data pipelines?
Answer: Use CodeCommit for version control, CodeBuild for testing, CodePipeline for deployment, CloudFormation for infrastructure, and Synthetics for monitoring.
Q14: Explain MLOps architecture on AWS.
Answer: Use SageMaker for training, Step Functions for orchestration, S3 for model artifacts, A/B testing with SageMaker Endpoints, and Model Monitor for drift detection.
Q15: How do you handle data pipeline failures and retries?
Answer: Implement dead-letter queues, Step Functions retry logic, CloudWatch alarms for failures, SNS notifications, and idempotent processing.
Real-Time Processing
Q16: Compare Kinesis Data Streams vs MSK (Managed Kafka).
Answer: Kinesis: simpler, serverless, AWS-native. MSK: Apache Kafka compatible, more features, better ecosystem. Choose Kinesis for simplicity, MSK for existing Kafka workloads.
Q17: How do you handle late-arriving data in streaming?
Answer: Use watermarks in Flink, window extensions, late data handling policies, and store in DLQ for reprocessing. Implement idempotent processing.
Q18: Explain exactly-once semantics in AWS streaming.
Answer: Kinesis: at-least-once with deduplication. MSK: exactly-once with transactions. Flink: exactly-once with checkpointing. Use DynamoDB for idempotent writes.
Data Quality & Governance
Q19: How do you implement data quality checks in AWS?
Answer: Use Glue DataBrew for profiling, Lambda for custom checks, Great Expectations integration, and CloudWatch metrics for monitoring. Implement data contracts.
Q20: Explain data lineage tracking in AWS.
Answer: Use Glue Data Catalog lineage, Lake Formation for permissions lineage, CloudTrail for access lineage, and custom metadata in DynamoDB.
Migration & Modernization
Q21: How do you migrate from on-premises Hadoop to AWS?
Answer: Use AWS MSK for Kafka migration, EMR for Hadoop, S3 for HDFS, and Lake Formation for Hive metastore. Consider lift-and-shift vs re-architecture.
Q22: Explain database migration with AWS DMS best practices.
Answer: Use full load + CDC for zero downtime, enable validation, tune batch size, monitor replication lag, and test failover procedures.
Q23: How do you modernize legacy ETL to serverless?
Answer: Replace Informatica/DataStage with AWS Glue, use Lambda for event processing, Step Functions for orchestration, and S3 for data staging.
Advanced Architecture Patterns
Q24: Design a real-time ML feature store on AWS.
Answer: Use Kinesis for streaming features, DynamoDB for online store, S3 for offline store, SageMaker Feature Store for management, and Lambda for feature computation.
Q25: Explain the Lambda architecture implementation on AWS.
Answer: Batch layer: EMR/Glue for historical processing. Speed layer: Kinesis Analytics for real-time. Serving layer: DynamoDB + Redshift for queries. Use Step Functions for orchestration.
Final Interview Preparation Checklist
β
Review all topics covered in modules 1-64 and practice explaining concepts clearly.
Technical Skills Assessment
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA ENGINEERING SKILLS MATRIX β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β CORE SERVICES β PROFICIENCY LEVEL β
β βββββββββββββ β ββββββββββββββββ β
β S3, IAM, VPC β ββββββββββββ Expert β
β Lambda, Glue β ββββββββββββ Expert β
β Redshift, Athena β ββββββββββββ Expert β
β Kinesis, MSK β ββββββββββββ Expert β
β EMR, Step Functions β ββββββββββββ Expert β
β DynamoDB, ElastiCache β ββββββββββββ Expert β
β QuickSight, SageMaker β ββββββββββββ Advanced β
β Lake Formation β ββββββββββββ Advanced β
β MSK Connect β ββββββββββββ Advanced β
β Snow Family β ββββββββββββ Advanced β
β β
β CONCEPTS β PROFICIENCY LEVEL β
β ββββββββ β ββββββββββββββββ β
β Data Modeling β ββββββββββββ Expert β
β ETL/ELT Patterns β ββββββββββββ Expert β
β Streaming Architecture β ββββββββββββ Expert β
β Data Lake Design β ββββββββββββ Expert β
β Cost Optimization β ββββββββββββ Expert β
β Security & Compliance β ββββββββββββ Advanced β
β DevOps/MLOps β ββββββββββββ Advanced β
β Hybrid Architectures β ββββββββββββ Advanced β
β β
β SOFT SKILLS β PROFICIENCY LEVEL β
β βββββββββββ β ββββββββββββββββ β
β System Design β ββββββββββββ Expert β
β Problem Solving β ββββββββββββ Expert β
β Communication β ββββββββββββ Advanced β
β Documentation β ββββββββββββ Advanced β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Interview Day Tips
- Prepare stories: Have 3-5 real-world examples ready
- Know your resume: Be ready to deep-dive into any project
- Ask questions: Show interest in their data challenges
- Think out loud: Explain your reasoning process
- Be honest: If you don't know something, say so and explain how you'd learn
- Practice whiteboarding: Design diagrams clearly
- Review recent projects: Know what you've worked on recently
- Prepare questions: About team, tech stack, challenges
Common Interview Formats
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INTERVIEW FORMAT BREAKDOWN β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β PHONE SCREEN (30-45 min) β
β β’ Resume walkthrough β
β β’ Basic technical questions β
β β’ Project discussion β
β β’ Salary expectations β
β β
β TECHNICAL PHONE SCREEN (45-60 min) β
β β’ Coding challenge (SQL, Python) β
β β’ System design question β
β β’ AWS service knowledge β
β β
β ONSITE INTERVIEW (4-6 hours) β
β β’ System design (1 hour) β
β β’ Coding (1-2 hours) β
β β’ Technical deep-dive (1 hour) β
β β’ Behavioral (1 hour) β
β β’ Hiring manager (30 min) β
β β
β PANEL INTERVIEW (2-3 hours) β
β β’ Multiple interviewers β
β β’ Mix of technical and behavioral β
β β’ Team fit assessment β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Questions to Ask Interviewers
- What are the biggest data challenges your team is facing?
- How is the data platform architecture evolving?
- What does success look like in this role in 6 months?
- How do you handle data governance and quality?
- What's the team structure and collaboration model?
- What tools and technologies are you planning to adopt?
- How do you approach technical debt in data pipelines?
- What's the on-call rotation and incident response process?
Final Review Summary
β οΈ
Review all 65 modules before your interview. Focus on areas where you feel less confident and practice explaining concepts out loud.
Complete Module Index
- AWS Overview & Fundamentals (Modules 1-15)
- Data Pipeline Architecture (Modules 16-35)
- Service Deep Dives (Modules 36-50)
- Interview Q&A (Modules 51-65)
Key Takeaways from All Modules
- Master core services: S3, Lambda, Glue, Redshift, Kinesis
- Understand patterns: ETL, ELT, streaming, batch, data lake
- Practice system design: Real-world scenarios with trade-offs
- Know cost optimization: Right-sizing, auto-scaling, reserved capacity
- Security first: Encryption, access control, compliance
- Monitoring & observability: CloudWatch, X-Ray, custom metrics
- DevOps practices: CI/CD, infrastructure as code, testing
- Soft skills: Communication, problem-solving, teamwork
Resources for Further Learning
- AWS Documentation & Well-Architected Framework
- AWS re:Invent videos and whitepapers
- AWS Hands-on Labs and Workshops
- Data engineering blogs and communities
- Practice with AWS Free Tier and sandbox accounts
Good luck with your interviews! π―