Mixed Topics & Comprehensive Review

Module 65 — Complete Data Engineering Review & Mixed Interview Questions

AWS Data Engineering Services Map

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                AWS DATA ENGINEERING SERVICES ECOSYSTEM             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  INGESTION                 PROCESSING                STORAGE       │
│  ─────────                ──────────                ───────       │
│  ┌──────────┐            ┌──────────┐            ┌──────────┐    │
│  │Kinesis   │            │AWS Glue  │            │Amazon S3 │    │
│  │Data      │            │(Spark/   │            │(Data     │    │
│  │Streams   │            │PySpark)  │            │Lake)     │    │
│  └──────────┘            └──────────┘            └──────────┘    │
│  ┌──────────┐            ┌──────────┐            ┌──────────┐    │
│  │Kinesis   │            │AWS EMR   │            │Amazon    │    │
│  │Firehose  │            │(Managed  │            │DynamoDB  │    │
│  │          │            │Hadoop)   │            │          │    │
│  └──────────┘            └──────────┘            └──────────┘    │
│  ┌──────────┐            ┌──────────┐            ┌──────────┐    │
│  │AWS MSK   │            │AWS       │            │Amazon    │    │
│  │(Managed  │            │Lambda    │            │Redshift  │    │
│  │Kafka)    │            │(Serverless│           │          │    │
│  └──────────┘            └──────────┘            └──────────┘    │
│  ┌──────────┐            ┌──────────┐            ┌──────────┐    │
│  │AWS DMS   │            │Amazon    │            │Amazon    │    │
│  │(Database │            │SageMaker │            │ElastiCache│   │
│  │Migration)│            │(ML)      │            │          │    │
│  └──────────┘            └──────────┘            └──────────┘    │
│                                                                     │
│  ANALYTICS                ORCHESTRATION             GOVERNANCE     │
│  ──────────               ────────────              ──────────     │
│  ┌──────────┐            ┌──────────┐            ┌──────────┐    │
│  │Amazon    │            │AWS Step  │            │AWS Lake  │    │
│  │Athena    │            │Functions │            │Formation │    │
│  │(Serverless│           │          │            │          │    │
│  │Query)    │            └──────────┘            └──────────┘    │
│  └──────────┘            ┌──────────┐            ┌──────────┐    │
│  ┌──────────┐            │AWS MWAA  │            │AWS Glue  │    │
│  │Amazon    │            │(Managed  │            │Data      │    │
│  │QuickSight│            │Airflow)  │            │Catalog   │    │
│  │(BI)      │            │          │            │          │    │
│  └──────────┘            └──────────┘            └──────────┘    │
│  ┌──────────┐            ┌──────────┐            ┌──────────┐    │
│  │Amazon    │            │Amazon    │            │AWS       │    │
│  │EMR       │            │ECS/Fargate│           │Config    │    │
│  │Studio    │            │(Containers│           │          │    │
│  └──────────┘            └──────────┘            └──────────┘    │
│                                                                     │
│  SECURITY                 MONITORING                COST           │
│  ────────                 ──────────                ─────          │
│  ┌──────────┐            ┌──────────┐            ┌──────────┐    │
│  │AWS KMS   │            │Amazon    │            │AWS       │    │
│  │(Keys)    │            │CloudWatch│            │Cost      │    │
│  └──────────┘            │          │            │Explorer  │    │
│  ┌──────────┐            └──────────┘            └──────────┘    │
│  │AWS IAM   │            ┌──────────┐            ┌──────────┐    │
│  │(Access   │            │AWS       │            │AWS       │    │
│  │Control)  │            │CloudTrail│            │Budgets   │    │
│  └──────────┘            └──────────┘            └──────────┘    │
│  ┌──────────┐            ┌──────────┐            ┌──────────┐    │
│  │Amazon    │            │AWS       │            │S3        │    │
│  │Macie     │            │X-Ray     │            │Intelligent│   │
│  │(Data     │            │          │            │Tiering   │    │
│  │Privacy)  │            └──────────┘            └──────────┘    │
│  └──────────┘                                                     │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Comprehensive Scenario-Based Questions

Scenario 1: E-Commerce Real-Time Analytics

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│            E-COMMERCE REAL-TIME ANALYTICS ARCHITECTURE             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐│
│  │  Web     │     │  Mobile  │     │  POS     │     │  IoT     ││
│  │  App     │     │  App     │     │  System  │     │  Devices ││
│  └────┬─────┘     └────┬─────┘     └────┬─────┘     └────┬─────┘│
│       │                │                │                │       │
│       ▼                ▼                ▼                ▼       │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    KINESIS DATA STREAMS                     ││
│  │  • User clicks    • Transactions   • Inventory updates     ││
│  │  • Search queries • Payment events  • Sensor data          ││
│  └─────────────────────────────┬───────────────────────────────┘│
│                                │                                 │
│         ┌──────────────────────┼──────────────────────┐         │
│         │                      │                      │         │
│         ▼                      ▼                      ▼         │
│  ┌──────────────┐      ┌──────────────┐      ┌──────────────┐ │
│  │ KINESIS      │      │ LAMBDA       │      │ KINESIS      │ │
│  │ ANALYTICS    │      │ PROCESSOR    │      │ FIREHOSE     │ │
│  │ (Flink)      │      │              │      │              │ │
│  │              │      │ • Enrichment │      │ • S3 Archive │ │
│  │ • Session    │      │ • Validation │      │ • Parquet    │ │
│  │   tracking   │      │ • Transform  │      │              │ │
│  │ • Fraud      │      │              │      │              │ │
│  │   detection  │      │              │      │              │ │
│  └──────┬───────┘      └──────┬───────┘      └──────┬───────┘ │
│         │                     │                     │          │
│         ▼                     ▼                     ▼          │
│  ┌──────────────┐      ┌──────────────┐      ┌──────────────┐ │
│  │ DYNAMODB     │      │ REDSHIFT     │      │ S3 DATA LAKE │ │
│  │ (Real-time   │      │ (Analytics)  │      │ (Historical) │ │
│  │  Features)   │      │              │      │              │ │
│  └──────────────┘      └──────────────┘      └──────────────┘ │
│         │                     │                     │          │
│         └─────────────────────┼─────────────────────┘          │
│                               ▼                                │
│                        ┌──────────────┐                        │
│                        │ QUICKSIGHT   │                        │
│                        │ (Dashboards) │                        │
│                        └──────────────┘                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────────┘

Question: Design a real-time analytics solution for an e-commerce platform handling 100K events/second with <1 second latency requirements.

Answer:

Ingestion: Kinesis Data Streams with shard splitting based on throughput
Processing: Kinesis Analytics with Flink for real-time aggregation
Storage: DynamoDB for real-time features, Redshift for analytics
Visualization: QuickSight dashboards with SPICE for performance
Cost optimization: Use Firehose for bulk loading to S3, Lambda for event processing

Scenario 2: Financial Data Warehouse Migration

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│          FINANCIAL DATA WAREHOUSE MIGRATION ARCHITECTURE           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  CURRENT STATE                     TARGET STATE                   │
│  ────────────                      ────────────                    │
│  ┌──────────────┐                 ┌──────────────┐                │
│  │ Oracle       │    DMS + CDC    │ Aurora       │                │
│  │ Exadata      │ ───────────────▶│ PostgreSQL   │                │
│  │ 50TB         │                 │              │                │
│  └──────────────┘                 └──────────────┘                │
│         │                                │                        │
│         │                                │                        │
│         ▼                                ▼                        │
│  ┌──────────────┐                 ┌──────────────┐                │
│  │ ETL Jobs     │    AWS Glue     │ Glue Jobs    │                │
│  │ (Informatica)│ ───────────────▶│ (Spark)      │                │
│  └──────────────┘                 └──────────────┘                │
│         │                                │                        │
│         │                                │                        │
│         ▼                                ▼                        │
│  ┌──────────────┐                 ┌──────────────┐                │
│  │ Teradata     │    SCT + DMS    │ Redshift     │                │
│  │ 100TB        │ ───────────────▶│ Serverless   │                │
│  └──────────────┘                 └──────────────┘                │
│         │                                │                        │
│         │                                │                        │
│         ▼                                ▼                        │
│  ┌──────────────┐                 ┌──────────────┐                │
│  │ Reporting    │                 │ QuickSight   │                │
│  │ (Cognos)     │                 │ + Athena     │                │
│  └──────────────┘                 └──────────────┘                │
│                                                                     │
│  MIGRATION PHASES:                                                 │
│  Phase 1: Schema conversion + DMS full load                       │
│  Phase 2: CDC cutover + ETL migration                             │
│  Phase 3: Data warehouse migration                                │
│  Phase 4: Reporting migration + validation                        │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Question: A financial institution needs to migrate from Oracle/Teradata to AWS with zero downtime. How would you approach this?

Answer:

Assessment: Use SCT to analyze schemas and estimate effort
Database migration: DMS with CDC for Oracle → Aurora PostgreSQL
Data warehouse: Snowball for bulk transfer, then DMS CDC for Redshift
ETL migration: Recreate Informatica jobs in AWS Glue
Validation: Automated data comparison scripts
Cutover: Blue-green deployment with gradual traffic shifting

Scenario 3: Multi-Region Data Lake

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│              MULTI-REGION DATA LAKE ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  REGION A (US-EAST-1)              REGION B (EU-WEST-1)           │
│  ┌──────────────────────┐         ┌──────────────────────┐       │
│  │ S3 Data Lake         │         │ S3 Data Lake         │       │
│  │ ┌──────────────────┐ │         │ ┌──────────────────┐ │       │
│  │ │ Raw Zone         │ │ ◀─────▶ │ │ Raw Zone         │ │       │
│  │ │ (Bronze)         │ │  S3     │ │ (Bronze)         │ │       │
│  │ └──────────────────┘ │  CRR    │ └──────────────────┘ │       │
│  │ ┌──────────────────┐ │         │ ┌──────────────────┐ │       │
│  │ │ Processed Zone   │ │         │ │ Processed Zone   │ │       │
│  │ │ (Silver)         │ │         │ │ (Silver)         │ │       │
│  │ └──────────────────┘ │         │ └──────────────────┘ │       │
│  │ ┌──────────────────┐ │         │ ┌──────────────────┐ │       │
│  │ │ Curated Zone     │ │         │ │ Curated Zone     │ │       │
│  │ │ (Gold)           │ │         │ │ (Gold)           │ │       │
│  │ └──────────────────┘ │         │ └──────────────────┘ │       │
│  └──────────────────────┘         └──────────────────────┘       │
│           │                                   │                   │
│           ▼                                   ▼                   │
│  ┌──────────────────────┐         ┌──────────────────────┐       │
│  │ Redshift Cluster     │         │ Redshift Cluster     │       │
│  │ (Analytics)          │         │ (Analytics)          │       │
│  └──────────────────────┘         └──────────────────────┘       │
│           │                                   │                   │
│           └─────────────────┬─────────────────┘                   │
│                             ▼                                     │
│                    ┌──────────────────┐                           │
│                    │ Route 53         │                           │
│                    │ (Latency-based   │                           │
│                    │  Routing)        │                           │
│                    └──────────────────┘                           │
│                                                                     │
│  DATA CONSISTENCY:                                                 │
│  • S3 Cross-Region Replication for object replication             │
│  • DynamoDB Global Tables for real-time sync                       │
│  • EventBridge for cross-region event distribution                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Question: Design a multi-region data lake for a global company with data sovereignty requirements.

Answer:

Storage: S3 buckets per region with CRR for replication
Catalog: Glue Data Catalog with cross-region replication
Analytics: Redshift clusters per region with Global Tables
Access: Route 53 latency-based routing, cross-account roles
Compliance: Lake Formation for region-specific access controls

Mixed Technical Questions

Data Modeling & Design

Q1: How do you handle slowly changing dimensions (SCD) in AWS?

Answer: Use Glue ETL with custom logic for SCD Type 1 (overwrite), Type 2 (history), Type 3 (limited history). Store in S3 with partitioning by effective date, query with Athena or load to Redshift.

Q2: Explain schema-on-read vs schema-on-write in data lakes.

Answer: Schema-on-write: define schema before writing (RDBMS). Schema-on-read: apply schema when reading (data lakes). AWS Glue provides schema-on-read with Data Catalog. Benefits: flexibility vs performance.

Q3: How do you optimize Parquet files for Athena queries?

Answer: Use columnar compression (Snappy, GZIP), partition data by query patterns, optimize file sizes (128MB-1GB), use Glue compaction jobs, and implement bucketing for high-cardinality columns.

Performance & Optimization

Q4: How do you optimize Redshift query performance?

Answer: Use distribution keys (DISTKEY), sort keys (SORTKEY), vacuum analyze regularly, use result caching, implement workload management (WLM), and consider Redshift Serverless for variable workloads.

Q5: Explain Kinesis shard splitting and merging.

Answer: Split: divide shard into two for increased capacity. Merge: combine two shards for cost savings. Use UpdateShardCount API or auto-scaling. Monitor with CloudWatch GetRecords.IteratorAgeMilliseconds.

Q6: How do you handle data skew in Spark on EMR?

Answer: Use salting technique for skewed keys, repartition by different key, use broadcast joins for small tables, and enable AQE (Adaptive Query Execution) for automatic optimization.

Security & Compliance

Q7: How do you implement column-level security in a data lake?

Answer: Use Lake Formation column-level permissions, implement data masking with Glue, use IAM policies for fine-grained access, and Amazon Macie for sensitive data discovery.

Q8: Explain encryption at rest vs in transit for AWS services.

Answer: At rest: S3 SSE-S3, SSE-KMS, SSE-C; EBS encryption; Redshift encryption. In transit: SSL/TLS for all endpoints, VPN, Direct Connect with MACsec. Use KMS for key management.

Q9: How do you audit data access in a data lake?

Answer: Enable CloudTrail for API logging, S3 access logging, Lake Formation audit logs, VPC flow logs for network access, and CloudWatch Logs for application logs.

Cost Optimization

Q10: How do you reduce costs in an AWS data pipeline?

Answer: Use S3 Intelligent-Tiering, Lambda for variable workloads, Spot Instances for batch processing, Redshift Reserved Instances for steady state, and auto-scaling for all services.

Q11: Compare costs of EMR vs Glue for different workloads.

Answer: EMR: better for long-running, complex workloads with custom configurations. Glue: better for serverless, short-running ETL jobs. EMR with Spot Instances can be 70% cheaper than on-demand.

Q12: How do you implement cost allocation tags for data services?

Answer: Tag all resources with project, environment, team, and cost-center. Use AWS Cost Explorer with tags, implement budget alerts, and review unused resources monthly.

DevOps & MLOps

Q13: How do you implement CI/CD for data pipelines?

Answer: Use CodeCommit for version control, CodeBuild for testing, CodePipeline for deployment, CloudFormation for infrastructure, and Synthetics for monitoring.

Q14: Explain MLOps architecture on AWS.

Answer: Use SageMaker for training, Step Functions for orchestration, S3 for model artifacts, A/B testing with SageMaker Endpoints, and Model Monitor for drift detection.

Q15: How do you handle data pipeline failures and retries?

Answer: Implement dead-letter queues, Step Functions retry logic, CloudWatch alarms for failures, SNS notifications, and idempotent processing.

Real-Time Processing

Q16: Compare Kinesis Data Streams vs MSK (Managed Kafka).

Answer: Kinesis: simpler, serverless, AWS-native. MSK: Apache Kafka compatible, more features, better ecosystem. Choose Kinesis for simplicity, MSK for existing Kafka workloads.

Q17: How do you handle late-arriving data in streaming?

Answer: Use watermarks in Flink, window extensions, late data handling policies, and store in DLQ for reprocessing. Implement idempotent processing.

Q18: Explain exactly-once semantics in AWS streaming.

Answer: Kinesis: at-least-once with deduplication. MSK: exactly-once with transactions. Flink: exactly-once with checkpointing. Use DynamoDB for idempotent writes.

Data Quality & Governance

Q19: How do you implement data quality checks in AWS?

Answer: Use Glue DataBrew for profiling, Lambda for custom checks, Great Expectations integration, and CloudWatch metrics for monitoring. Implement data contracts.

Q20: Explain data lineage tracking in AWS.

Answer: Use Glue Data Catalog lineage, Lake Formation for permissions lineage, CloudTrail for access lineage, and custom metadata in DynamoDB.

Migration & Modernization

Q21: How do you migrate from on-premises Hadoop to AWS?

Answer: Use AWS MSK for Kafka migration, EMR for Hadoop, S3 for HDFS, and Lake Formation for Hive metastore. Consider lift-and-shift vs re-architecture.

Q22: Explain database migration with AWS DMS best practices.

Answer: Use full load + CDC for zero downtime, enable validation, tune batch size, monitor replication lag, and test failover procedures.

Q23: How do you modernize legacy ETL to serverless?

Answer: Replace Informatica/DataStage with AWS Glue, use Lambda for event processing, Step Functions for orchestration, and S3 for data staging.

Advanced Architecture Patterns

Q24: Design a real-time ML feature store on AWS.

Answer: Use Kinesis for streaming features, DynamoDB for online store, S3 for offline store, SageMaker Feature Store for management, and Lambda for feature computation.

Q25: Explain the Lambda architecture implementation on AWS.

Answer: Batch layer: EMR/Glue for historical processing. Speed layer: Kinesis Analytics for real-time. Serving layer: DynamoDB + Redshift for queries. Use Step Functions for orchestration.

Final Interview Preparation Checklist

✅

Review all topics covered in modules 1-64 and practice explaining concepts clearly.

Technical Skills Assessment

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│              DATA ENGINEERING SKILLS MATRIX                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  CORE SERVICES           │  PROFICIENCY LEVEL                     │
│  ─────────────           │  ────────────────                      │
│  S3, IAM, VPC            │  ████████████ Expert                   │
│  Lambda, Glue            │  ████████████ Expert                   │
│  Redshift, Athena        │  ████████████ Expert                   │
│  Kinesis, MSK            │  ████████████ Expert                   │
│  EMR, Step Functions     │  ████████████ Expert                   │
│  DynamoDB, ElastiCache   │  ████████████ Expert                   │
│  QuickSight, SageMaker   │  ████████░░░░ Advanced                 │
│  Lake Formation          │  ████████░░░░ Advanced                 │
│  MSK Connect             │  ████████░░░░ Advanced                 │
│  Snow Family             │  ████████░░░░ Advanced                 │
│                                                                     │
│  CONCEPTS                │  PROFICIENCY LEVEL                     │
│  ────────                │  ────────────────                      │
│  Data Modeling           │  ████████████ Expert                   │
│  ETL/ELT Patterns        │  ████████████ Expert                   │
│  Streaming Architecture  │  ████████████ Expert                   │
│  Data Lake Design        │  ████████████ Expert                   │
│  Cost Optimization       │  ████████████ Expert                   │
│  Security & Compliance   │  ████████░░░░ Advanced                 │
│  DevOps/MLOps            │  ████████░░░░ Advanced                 │
│  Hybrid Architectures    │  ████████░░░░ Advanced                 │
│                                                                     │
│  SOFT SKILLS             │  PROFICIENCY LEVEL                     │
│  ───────────             │  ────────────────                      │
│  System Design           │  ████████████ Expert                   │
│  Problem Solving         │  ████████████ Expert                   │
│  Communication           │  ████████░░░░ Advanced                 │
│  Documentation           │  ████████░░░░ Advanced                 │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Interview Day Tips

Prepare stories: Have 3-5 real-world examples ready
Know your resume: Be ready to deep-dive into any project
Ask questions: Show interest in their data challenges
Think out loud: Explain your reasoning process
Be honest: If you don't know something, say so and explain how you'd learn
Practice whiteboarding: Design diagrams clearly
Review recent projects: Know what you've worked on recently
Prepare questions: About team, tech stack, challenges

Common Interview Formats

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                  INTERVIEW FORMAT BREAKDOWN                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  PHONE SCREEN (30-45 min)                                         │
│  • Resume walkthrough                                              │
│  • Basic technical questions                                       │
│  • Project discussion                                              │
│  • Salary expectations                                             │
│                                                                     │
│  TECHNICAL PHONE SCREEN (45-60 min)                               │
│  • Coding challenge (SQL, Python)                                  │
│  • System design question                                          │
│  • AWS service knowledge                                           │
│                                                                     │
│  ONSITE INTERVIEW (4-6 hours)                                     │
│  • System design (1 hour)                                          │
│  • Coding (1-2 hours)                                              │
│  • Technical deep-dive (1 hour)                                    │
│  • Behavioral (1 hour)                                             │
│  • Hiring manager (30 min)                                         │
│                                                                     │
│  PANEL INTERVIEW (2-3 hours)                                      │
│  • Multiple interviewers                                           │
│  • Mix of technical and behavioral                                 │
│  • Team fit assessment                                             │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Key Questions to Ask Interviewers

What are the biggest data challenges your team is facing?
How is the data platform architecture evolving?
What does success look like in this role in 6 months?
How do you handle data governance and quality?
What's the team structure and collaboration model?
What tools and technologies are you planning to adopt?
How do you approach technical debt in data pipelines?
What's the on-call rotation and incident response process?

Final Review Summary

⚠️

Review all 65 modules before your interview. Focus on areas where you feel less confident and practice explaining concepts out loud.

Complete Module Index

AWS Overview & Fundamentals (Modules 1-15)
Data Pipeline Architecture (Modules 16-35)
Service Deep Dives (Modules 36-50)
Interview Q&A (Modules 51-65)

Key Takeaways from All Modules

Master core services: S3, Lambda, Glue, Redshift, Kinesis
Understand patterns: ETL, ELT, streaming, batch, data lake
Practice system design: Real-world scenarios with trade-offs
Know cost optimization: Right-sizing, auto-scaling, reserved capacity
Security first: Encryption, access control, compliance
Monitoring & observability: CloudWatch, X-Ray, custom metrics
DevOps practices: CI/CD, infrastructure as code, testing
Soft skills: Communication, problem-solving, teamwork

Resources for Further Learning

AWS Documentation & Well-Architected Framework
AWS re:Invent videos and whitepapers
AWS Hands-on Labs and Workshops
Data engineering blogs and communities
Practice with AWS Free Tier and sandbox accounts

Good luck with your interviews! 🎯