AWS Global Infrastructure Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS GLOBAL INFRASTRUCTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β REGION: us-east-1 (N. Virginia) β β
β β β β
β β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β β
β β β AZ-1a β β AZ-1b β β AZ-1c β β β
β β β βββββββββββββ β β βββββββββββββ β β βββββββββββββ β β β
β β β β EC2/S3/RDSβ β β β EC2/S3/RDSβ β β β EC2/S3/RDSβ β β β
β β β β Lambda β β β β Lambda β β β β Lambda β β β β
β β β β Redshift β β β β Redshift β β β β Redshift β β β β
β β β βββββββββββββ β β βββββββββββββ β β βββββββββββββ β β β
β β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β β
β β β β β β β
β β Low-Latency Low-Latency Low-Latency β β
β β (< 2ms) (< 2ms) (< 2ms) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β GLOBAL EDGE NETWORK (CloudFront, Route53) β β
β β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Edge: LAXβ β Edge: IADβ β Edge: FRAβ β Edge: NRTβ β β
β β β (LA) β β (VA) β β (Frankfurt)β β (Tokyo) β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β 200+ CloudFront Edge Locations Worldwide β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AWS Regions and Availability Zones
βΉοΈ
Key Concept: An AWS Region is a physical geographic area with multiple isolated Availability Zones (AZs). Each AZ has independent power, networking, and connectivity, housed in separate facilities.
Available AWS Regions (2024+)
| Region Code | Region Name | AZs | Launch Date | Key Services |
|---|---|---|---|---|
| us-east-1 | N. Virginia | 6 | 2006 | Most services launch here first |
| us-east-2 | Ohio | 3 | 2016 | Cost-effective US workloads |
| us-west-1 | N. California | 3 | 2009 | West coast low-latency |
| us-west-2 | Oregon | 4 | 2011 | Most popular US region |
| eu-west-1 | Ireland | 3 | 2007 | European operations |
| eu-west-2 | London | 3 | 2016 | UK data residency |
| eu-central-1 | Frankfurt | 3 | 2014 | German compliance |
| ap-southeast-1 | Singapore | 3 | 2010 | APAC hub |
| ap-northeast-1 | Tokyo | 3 | 2011 | Japanese market |
| ap-south-1 | Mumbai | 3 | 2016 | Indian market |
Region Selection Criteria for Data Engineering
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REGION SELECTION DECISION MATRIX β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. DATA RESIDENCY βββββββββΊ Which laws apply? β
β β GDPR, HIPAA, SOX β
β β β
β 2. LATENCY REQUIREMENTS βββΊ End-user proximity β
β β < 50ms = same region β
β β < 100ms = same continent β
β β β
β 3. SERVICE AVAILABILITY βββΊ Not all services in all regions β
β β Check service-by-region page β
β β β
β 4. COST βββββββββββββββββββΊ Prices vary 20-40% between regions β
β β us-east-1 often cheapest β
β β β
β 5. DISASTER RECOVERY ββββββΊ Cross-region replication β
β Active-active or pilot light β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΉοΈ
Pro Tip: For data engineering workloads, us-east-1 and us-west-2 typically have the broadest service availability and lowest costs. However, always verify compliance requirements before selecting a region.
AWS Service Categories for Data Engineering
Compute Services
| Service | Use Case | Serverless | Data Eng. Use |
|---|---|---|---|
| EC2 | Virtual machines | No | EMR clusters, custom runners |
| Lambda | Event-driven functions | Yes | Data transformations, triggers |
| ECS/Fargate | Container orchestration | Yes (Fargate) | Spark on containers |
| EKS | Kubernetes | No | ML workloads |
| Batch | Managed batch computing | Yes | Genomics, financial modeling |
Storage Services
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS STORAGE HIERARCHY β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PRIMARY STORAGE β β
β β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β β β S3 β β EBS (GP3) β β EFS (NFS) β β β
β β β Object Storeβ β Block Store β β File Store β β β
β β β 11 9's β β Single AZ β β Multi-AZ β β β
β β β unlimited β β 16TB max β β Petabytes β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ARCHIVAL STORAGE β β
β β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β β β S3 Glacier β β S3 Glacier β β Storage β β β
β β β Instant β β Deep Archiveβ β Gateway β β β
β β β Retrieval β β β β (Hybrid) β β β
β β β ms β β Hours β β On-prem β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Database Services
| Service | Type | Use Case | Data Eng. Pattern |
|---|---|---|---|
| RDS | Relational | OLTP workloads | Source for ETL |
| Aurora | MySQL/PostgreSQL compatible | High-perf relational | Data warehouse source |
| DynamoDB | Key-value/Document | NoSQL, high throughput | CDC via Streams |
| ElastiCache | In-memory (Redis/Memcached) | Caching, session store | Feature store |
| Neptune | Graph | Relationship data | Knowledge graphs |
| DocumentDB | Document (MongoDB) | Document workloads | Semi-structured data |
Analytics Services (Core for Data Engineering)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS ANALYTICS STACK β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β INGESTION PROCESSING STORAGE & QUERY β
β βββββββββ ββββββββββ βββββββββββββββ β
β β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β Kinesis βββββΊβ AWS Glue βββββΊβ Amazon S3 β β
β β Streams β β (ETL Jobs) β β (Data Lake) β β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β MSK βββββΊβ EMR βββββΊβ Redshift β β
β β (Kafka) β β (Spark/Hadoop)β β (Warehouse) β β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β DMS βββββΊβ Step FunctionsβββββΊβ Athena β β
β β (CDC) β β (Orchestrate) β β (Ad-hoc) β β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β
β β
β GOVERNANCE & CATALOG VISUALIZATION β
β ββββββββββββββββββββ βββββββββββββ β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β Lake Formation β β QuickSight β β
β β Glue Data Catalogβ β (BI/Analytics) β β
β ββββββββββββββββββββ ββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key AWS Concepts for Data Engineers
Shared Responsibility Model
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS SHARED RESPONSIBILITY MODEL β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CUSTOMER RESPONSIBILITY β β
β β "Security IN the Cloud" β β
β β β β
β β β’ Data Classification β’ Platform/OS Patching β β
β β β’ Encryption (at rest) β’ Network Configuration β β
β β β’ IAM User Management β’ Application Security β β
β β β’ Client-side Encryption β’ Server-side Encryption β β
β β β’ Operating System Updates β’ Firewall & Network Config β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β AWS RESPONSIBILITY β β
β β "Security OF the Cloud" β β
β β β β
β β β’ Hardware/AWS Operations β’ Software Patching β β
β β β’ Physical Security β’ Network Infrastructure β β
β β β’ Power & Cooling β’ Availability Zones β β
β β β’ Storage Media β’ Hardware Decommissioning β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AWS Pricing Models
β οΈ
Cost Warning: Data engineering workloads can generate massive bills. Always use Cost Explorer, set billing alerts, and consider Reserved Instances or Savings Plans for predictable workloads.
| Model | Description | Savings | Use Case |
|---|---|---|---|
| On-Demand | Pay per use, no commitment | 0% | Dev/Test, variable workloads |
| Reserved (1yr) | 1-year commitment | Up to 40% | Steady-state production |
| Reserved (3yr) | 3-year commitment | Up to 60% | Long-term infrastructure |
| Spot Instances | Spare capacity | Up to 90% | Batch processing, fault-tolerant |
| Savings Plans | Flexible commitment | Up to 72% | Variable usage patterns |
| Serverless | Pay per invocation | N/A | Event-driven, unpredictable |
Data Engineering Reference Architecture on AWS
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ENTERPRISE DATA PLATFORM ON AWS β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β On-Prem β β SaaS β β IoT β β Mobile β β APIs β β
β β Database β β (Salesforceβ β Sensors β β Apps β β (REST) β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β β β β
β βΌ βΌ βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β INGESTION LAYER β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ ββββββββββ β β
β β β DMS β β Kinesisβ β MSK β β API β β Snowballβ β β
β β β (CDC) β β(Streams)β β (Kafka) β β Gateway β β (Bulk) β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ ββββββββββ β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β RAW DATA ZONE (S3) β β
β β s3://data-lake-raw/ β β
β β βββ landing/ (Ingested data) β β
β β βββ bronze/ (Unvalidated) β β
β β βββ archive/ (Historical) β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PROCESSING LAYER β β
β β βββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββ β β
β β β AWS Glue β β EMR β β Lambda β β Batch β β β
β β β (ETL) β β (Spark) β β(Transform)β β (Large) β β β
β β βββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββ β β
β β βββββββββββββ β β
β β βStep Funcs β (Orchestration) β β
β β βββββββββββββ β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CURATED DATA ZONE β β
β β s3://data-lake-curated/ β
β β βββ silver/ (Cleaned, validated) β β
β β βββ gold/ (Business-ready) β β
β β βββ aggregates/ (Pre-computed) β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ β
β β Redshift β β Athena β β QuickSight β β
β β (Warehouse) β β (Ad-hoc) β β (BI/Dash) β β
β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β GOVERNANCE: Lake Formation β CATALOG: Glue Catalog β AUDIT: CloudTrailβ
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Interview Concepts
EC2 Instance Types for Data Engineering
| Category | Instances | Use Case | vCPU | Memory |
|---|---|---|---|---|
| Compute Optimized | c5, c6g | Spark, Hadoop | Up to 96 | Up to 192GB |
| Memory Optimized | r5, r6g | In-memory processing | Up to 128 | Up to 768GB |
| Storage Optimized | i3, d2 | HDFS, NoSQL | Up to 72 | Up to 244GB |
| Accelerated | p4, p5 | ML training | Up to 96 | Up to 2TB |
IAM Best Practices for Data Engineering
- Least Privilege Principle: Grant minimum necessary permissions
- Use Roles, Not Keys: Avoid long-term credentials
- Cross-Account Access: Use IAM roles for cross-account data sharing
- Service Control Policies: Organizational guardrails
- Permission Boundaries: Limit maximum permissions
Common Interview Questions & Answers
Q1: What is the difference between a Region and an Availability Zone?
Answer: A Region is a physical geographic area (e.g., us-east-1) containing multiple isolated Availability Zones (AZs). Each AZ is one or more discrete data centers with independent power, networking, and connectivity. AZs are connected via low-latency, high-bandwidth private fiber. For data engineering, multi-AZ deployments provide high availability, while multi-region provides disaster recovery.
Q2: How does AWS pricing work for data transfer?
Answer: Data transfer pricing depends on direction and volume:
- Inbound: Free (with exceptions)
- Outbound to Internet: First 100GB free/month, then $0.09/GB (decreasing with volume)
- Cross-AZ: $0.01/GB in each direction
- Cross-Region: $0.02/GB (varies by regions)
- S3 to CloudFront: Free
- VPC Endpoints: Free for S3 and DynamoDB
Q3: What is the AWS Well-Architected Framework?
Answer: It's a set of best practices across six pillars:
- Operational Excellence: Infrastructure as Code, monitoring
- Security: Defense in depth, encryption
- Reliability: Fault tolerance, recovery
- Performance Efficiency: Right-sizing, caching
- Cost Optimization: Right-sizing, reserved capacity
- Sustainability: Efficient resource use
For data engineering, focus on Cost Optimization and Performance Efficiency.
Q4: What are the benefits of using AWS for data engineering over on-premises?
Answer:
- Scalability: Scale resources up/down on demand
- Managed Services: Reduce operational overhead (Glue, Redshift, EMR)
- Pay-per-use: No upfront capital expenditure
- Global Reach: Deploy worldwide in minutes
- Integration: Native service integration
- Security: Enterprise-grade security built-in
- Innovation: Access to latest technologies (AI/ML, analytics)
Q5: How do you estimate costs for a data engineering pipeline?
Answer: Use the AWS Pricing Calculator:
- Compute: EC2 instances or Lambda invocations
- Storage: S3 storage class and volume
- Data Transfer: Inbound/outbound volumes
- Data Processing: Glue/EMR job duration
- Query: Redshift spectrum scans, Athena queries
- Monitoring: CloudWatch metrics and logs
Set up AWS Cost Explorer and create billing alarms for unexpected charges.
Cost Considerations
β οΈ
Cost Alert: Data engineering workloads can quickly become expensive. Key cost drivers:
- Data transfer: Cross-region and cross-AZ transfers add up
- Storage classes: Using S3 Standard for infrequently accessed data wastes money
- Compute: Over-provisioned instances or underutilized clusters
- Queries: Full table scans in Athena/Redshift Spectrum
- Logging: Excessive CloudWatch logs without retention policies
| Cost Factor | Optimization Strategy |
|---|---|
| S3 Storage | Use lifecycle policies to transition to IA/Glacier |
| EC2 Compute | Use Spot Instances for fault-tolerant batch jobs |
| Data Transfer | Use VPC endpoints for S3/DynamoDB |
| Redshift | Use Reserved Instances for steady-state |
| Athena | Use columnar formats (Parquet) and partitioning |
| Glue | Optimize job size and worker count |
Summary
Understanding AWS global infrastructure is foundational for data engineering. Key takeaways:
- Regions provide geographic isolation and compliance
- Availability Zones provide high availability within regions
- Edge Locations provide low-latency content delivery
- Service Selection depends on workload requirements
- Cost Optimization is critical for large-scale data platforms
- Security is a shared responsibility between AWS and customers