VPC Architecture for Data Engineering
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VPC: 10.0.0.0/16 (65,536 IPs) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PUBLIC SUBNETS (Internet-Facing) β β
β β β β
β β βββββββββββββββββββββββ βββββββββββββββββββββββ β β
β β β Public Subnet-AZ-a β β Public Subnet-AZ-b β β β
β β β 10.0.1.0/24 β β 10.0.2.0/24 β β β
β β β β β β β β
β β β β’ NAT Gateway β β β’ Bastion Host β β β
β β β β’ ALB/NLB β β β’ Load Balancers β β β
β β β β’ Bastion Host β β β β β
β β ββββββββββββ¬βββββββββββ ββββββββββββ¬βββββββββββ β β
β β β β β β
β βββββββββββββββΌβββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β INTERNET GATEWAY (igw-xxxxxxxx) β β
β β β’ Attached to VPC β β
β β β’ Provides internet access for public subnets β β
β β β’ Horizontal scaling, redundant, highly available β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PRIVATE SUBNETS (Data Layer) β β
β β β β
β β βββββββββββββββββββββββ βββββββββββββββββββββββ β β
β β β Private Subnet-AZ-aβ β Private Subnet-AZ-bβ β β
β β β 10.0.10.0/24 β β 10.0.20.0/24 β β β
β β β β β β β β
β β β β’ EMR Cluster β β β’ Redshift Cluster β β β
β β β β’ EC2 Instances β β β’ RDS Instances β β β
β β β β’ ECS/Fargate β β β’ ElastiCache β β β
β β ββββββββββββ¬βββββββββββ ββββββββββββ¬βββββββββββ β β
β β β β β β
β βββββββββββββββΌβββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β NAT GATEWAY (nat-gw-xxxxxxxx) β β
β β β’ Enables internet access for private subnets β β
β β β’ One per AZ for high availability β β
β β β’ Manages network address translation β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β VPC ENDPOINTS (Private Connectivity to AWS Services) β β
β β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββ β
β β β S3 Gateway β β DynamoDB β β Glue β β Redshift ββ β
β β β Endpoint β β Gateway β β Interface β β Interface ββ β
β β β β β Endpoint β β Endpoint β β Endpoint ββ β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Subnet Design Patterns
Three-Tier Data Architecture
| Tier | Subnet Type | CIDR | Purpose | Services |
|---|---|---|---|---|
| Presentation | Public | 10.0.1.0/24 | External access | ALB, NAT GW |
| Application | Private | 10.0.10.0/24 | Compute | EMR, ECS, Lambda |
| Data | Isolated | 10.0.20.0/24 | Storage | RDS, Redshift |
Data Engineering Subnet Strategy
# VPC Configuration for Data Engineering
vpc_config = {
"vpc_cidr": "10.0.0.0/16",
"subnets": {
"public": [
{"cidr": "10.0.1.0/24", "az": "us-east-1a", "purpose": "NAT/ALB"},
{"cidr": "10.0.2.0/24", "az": "us-east-1b", "purpose": "NAT/ALB"},
{"cidr": "10.0.3.0/24", "az": "us-east-1c", "purpose": "NAT/ALB"}
],
"private_compute": [
{"cidr": "10.0.10.0/24", "az": "us-east-1a", "purpose": "EMR/ECS"},
{"cidr": "10.0.11.0/24", "az": "us-east-1b", "purpose": "EMR/ECS"},
{"cidr": "10.0.12.0/24", "az": "us-east-1c", "purpose": "EMR/ECS"}
],
"private_data": [
{"cidr": "10.0.20.0/24", "az": "us-east-1a", "purpose": "RDS/Redshift"},
{"cidr": "10.0.21.0/24", "az": "us-east-1b", "purpose": "RDS/Redshift"},
{"cidr": "10.0.22.0/24", "az": "us-east-1c", "purpose": "RDS/Redshift"}
],
"isolated": [
{"cidr": "10.0.30.0/24", "az": "us-east-1a", "purpose": "No internet"},
{"cidr": "10.0.31.0/24", "az": "us-east-1b", "purpose": "No internet"}
]
}
}
βΉοΈ
Pro Tip: Always use /24 subnets for data layer services. This provides 251 usable IPs (5 reserved by AWS). For large EMR clusters, consider /23 or /22 subnets.
Security Groups vs NACLs
Security Groups (Stateful)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SECURITY GROUPS (Stateful Firewall) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β EMR Cluster Security Group (emr-sg) β β
β β β β
β β INBOUND RULES: OUTBOUND RULES: β β
β β βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β β
β β β Type β Port β Source β β Type β Port β Dest β β β
β β βββββββββββΌββββββββΌβββββββββββ€ βββββββββββΌββββββββΌββββββββββββ€ β β
β β β SSH β 22 β Bastion β β All β All β 0.0.0.0/0 β β β
β β β Custom β 8088 β Mgmt-sg β β TCP β 443 β S3 VPCE β β β
β β β Custom β 18080 β Mgmt-sg β β TCP β 443 β Glue VPCEβ β β
β β β Custom β 8020 β Worker-sgβ β TCP β 5439 β Redshift β β β
β β β Custom β 7077 β Worker-sgβ β TCP β 3306 β RDS β β β
β β βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β KEY CHARACTERISTICS: β
β β’ Stateful: Return traffic automatically allowed β
β β’ Instance-level: Attached to ENIs β
β β’ Allow rules only (no deny) β
β β’ Evaluated before NACLs β
β β’ Can reference other security groups β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Network ACLs (Stateless)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β NETWORK ACLS (Stateless Firewall) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Private Subnet NACL (nacl-private) β β
β β β β
β β INBOUND RULES (evaluated in order): β β
β β βββββββ¬ββββββββ¬βββββββββββ¬βββββββββββββββββββββββββββββββββββββββ β β
β β β # β Type β Port β Description β β β
β β βββββββΌββββββββΌβββββββββββΌβββββββββββββββββββββββββββββββββββββββ€ β β
β β β 100 β SSH β 22 β Allow SSH from Bastion β β β
β β β 110 β TCP β 1024-65535β Allow ephemeral (return traffic) β β β
β β β 120 β TCP β 443 β Allow HTTPS β β β
β β β * β All β All β Deny all other traffic β β β
β β βββββββ΄ββββββββ΄βββββββββββ΄βββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β OUTBOUND RULES (evaluated in order): β β
β β βββββββ¬ββββββββ¬βββββββββββ¬βββββββββββββββββββββββββββββββββββββββ β β
β β β # β Type β Port β Description β β β
β β βββββββΌββββββββΌβββββββββββΌβββββββββββββββββββββββββββββββββββββββ€ β β
β β β 100 β TCP β 22 β Allow SSH to Bastion β β β
β β β 110 β TCP β 1024-65535β Allow ephemeral ports β β β
β β β 120 β TCP β 443 β Allow HTTPS to internet β β β
β β β * β All β All β Deny all other traffic β β β
β β βββββββ΄ββββββββ΄βββββββββββ΄βββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β KEY CHARACTERISTICS: β
β β’ Stateless: Must explicitly allow return traffic β
β β’ Subnet-level: Applied to all instances in subnet β
β β’ Allow and Deny rules β
β β’ Evaluated after Security Groups β
β β’ Rule numbers determine evaluation order β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β οΈ
Security Warning: NACLs are stateless. Always remember to allow ephemeral ports (1024-65535) for return traffic. Forgetting this is a common cause of connectivity issues.
VPC Endpoints for Data Services
Gateway Endpoints (Free)
| Service | Endpoint Type | Use Case |
|---|---|---|
| S3 | Gateway | Data lake access without internet |
| DynamoDB | Gateway | NoSQL database access |
Interface Endpoints (Cost: $0.01/hr per ENI)
| Service | Endpoint Type | Use Case |
|---|---|---|
| Glue | Interface | ETL job metadata access |
| Redshift | Interface | Data warehouse connectivity |
| Kinesis | Interface | Streaming data ingestion |
| Lambda | Interface | Serverless compute access |
| STS | Interface | Role assumption |
VPC Endpoint Policy for S3
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowAccessToSpecificBuckets",
"Effect": "Allow",
"Principal": "*",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::data-lake-*",
"arn:aws:s3:::data-lake-*/*"
],
"Condition": {
"StringEquals": {
"aws:sourceVpce": "vpce-1234567890abcdef0"
}
}
}
]
}
NAT Gateway Design
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HIGH-AVAILABILITY NAT GATEWAY DESIGN β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β AZ-a AZ-b β β
β β βββββββββββββββββββ βββββββββββββββββββ β β
β β β Public Subnet β β Public Subnet β β β
β β β 10.0.1.0/24 β β 10.0.2.0/24 β β β
β β β β β β β β
β β β βββββββββββββ β β βββββββββββββ β β β
β β β βNAT Gatewayβ β β βNAT Gatewayβ β β β
β β β βnat-aaa111 β β β βnat-bbb222 β β β β
β β β βEIP: 1.2.3 β β β βEIP: 4.5.6 β β β β
β β β βββββββ¬ββββββ β β βββββββ¬ββββββ β β β
β β ββββββββββΌβββββββββ ββββββββββΌβββββββββ β β
β β β β β β
β β βΌ βΌ β β
β β βββββββββββββββββββ βββββββββββββββββββ β β
β β βPrivate Subnet β βPrivate Subnet β β β
β β β10.0.10.0/24 β β10.0.20.0/24 β β β
β β β β β β β β
β β β Route Table: β β Route Table: β β β
β β β 0.0.0.0/0 β β β 0.0.0.0/0 β β β β
β β β nat-aaa111 β β nat-bbb222 β β β
β β βββββββββββββββββββ βββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β KEY DESIGN PRINCIPLES: β
β β’ One NAT Gateway per AZ for high availability β
β β’ Route tables point to local NAT Gateway β
β β’ Each NAT Gateway has its own Elastic IP β
β β’ For cost optimization, consider NAT instances for dev/test β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Engineering Network Patterns
Pattern 1: Private Data Layer
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PRIVATE DATA LAYER PATTERN β
β β
β Internet βββΊ ALB βββΊ Public Subnet βββΊ NAT GW βββΊ Private Subnet β
β β β
β βΌ β
β βββββββββββββββββββ β
β β VPC Endpoints β β
β β β β
β β β’ S3 Gateway β β
β β β’ Glue IF β β
β β β’ Redshift IF β β
β β β’ Lambda IF β β
β βββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββ β
β β Data Services β β
β β β β
β β β’ EMR Cluster β β
β β β’ Redshift β β
β β β’ RDS Aurora β β
β β β’ ElastiCache β β
β βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Pattern 2: Cross-VPC Data Sharing
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VPC PEERING FOR DATA SHARING β
β β
β ββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ β
β β VPC-A (Data Engineering) β β VPC-B (Analytics) β β
β β 10.0.0.0/16 βββββββΊβ 10.1.0.0/16 β β
β β β Peeringβ β β
β β ββββββββββββββββββββββββββ β β ββββββββββββββββββββββββββ β β
β β β Subnet: 10.0.10.0/24 β β β β Subnet: 10.1.10.0/24 β β β
β β β Services: EMR, Glue β β β β Services: Redshift β β β
β β ββββββββββββββββββββββββββ β β ββββββββββββββββββββββββββ β β
β β β β β β
β β Route Table: β β Route Table: β β
β β 10.1.0.0/16 β pcx-xxx β β 10.0.0.0/16 β pcx-xxx β β
β ββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ β
β β
β ALTERNATIVE: TRANSIT GATEWAY (for many VPCs) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β TRANSIT GATEWAY β β
β β β β
β β VPC-A βββΊ TGW βββΊ VPC-B β β
β β VPC-C βββΊ TGW βββΊ VPC-D β β
β β VPC-E βββΊ TGW βββΊ VPC-F β β
β β β β
β β Benefits: β β
β β β’ Hub-and-spoke topology β β
β β β’ Centralized routing β β
β β β’ Cross-region support β β
β β β’ Multicast support β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
VPC Flow Logs
Enabling Flow Logs
# Enable VPC Flow Logs to CloudWatch
aws ec2 create-flow-logs \
--resource-type VPC \
--resource-ids vpc-1234567890abcdef0 \
--traffic-type ALL \
--log-destination-type cloud-watch-logs \
--log-group-name /aws/vpc/flowlogs \
--deliver-logs-permission-arn arn:aws:iam::123456789012:role/VPCFlowLogsRole
# Enable VPC Flow Logs to S3
aws ec2 create-flow-logs \
--resource-type VPC \
--resource-ids vpc-1234567890abcdef0 \
--traffic-type ALL \
--log-destination-type s3 \
--log-destination arn:aws:s3:::vpc-flow-logs-bucket
Flow Log Query in Athena
-- Find top talkers in VPC
SELECT
srcaddr,
dstaddr,
SUM(bytes) as total_bytes,
COUNT(*) as packet_count
FROM vpc_flow_logs
WHERE account_id = '123456789012'
AND region = 'us-east-1'
AND day = '2024-01-15'
GROUP BY srcaddr, dstaddr
ORDER BY total_bytes DESC
LIMIT 20;
-- Find rejected connections
SELECT
srcaddr,
dstport,
protocol,
action,
COUNT(*) as attempts
FROM vpc_flow_logs
WHERE action = 'REJECT'
GROUP BY srcaddr, dstport, protocol, action
ORDER BY attempts DESC;
βΉοΈ
Pro Tip: Use VPC Flow Logs to troubleshoot connectivity issues, monitor network traffic, and detect security anomalies. Store them in S3 and query with Athena for cost-effective analysis.
Interview Questions & Answers
Q1: What is the difference between Security Groups and NACLs?
Answer:
- Security Groups: Stateful, instance-level, allow rules only, evaluated before NACLs
- NACLs: Stateless, subnet-level, allow and deny rules, evaluated after Security Groups
For data engineering, use Security Groups for fine-grained instance control and NACLs for broad subnet-level restrictions.
Q2: How do you design a VPC for a data lake?
Answer:
- Use /16 CIDR for scalability (65,536 IPs)
- Create subnets in multiple AZs
- Use VPC endpoints for S3, Glue, Redshift
- Place data services in private subnets
- Use NAT Gateways for internet access
- Implement flow logs for monitoring
Q3: What are VPC endpoints and when should you use them?
Answer: VPC endpoints allow private connectivity to AWS services without internet access. Use them for:
- S3 Gateway Endpoint: Free, for data lake access
- Interface Endpoints: $0.01/hr, for Glue, Redshift, Kinesis
- Benefits: Reduced data transfer costs, improved security, lower latency
Q4: How do you handle cross-VPC data sharing?
Answer:
- VPC Peering: Direct connectivity between two VPCs (non-transitive)
- Transit Gateway: Hub-and-spoke for many VPCs (transitive)
- Shared Services VPC: Central data services VPC with cross-account access
- S3 Cross-Account Access: Bucket policies for cross-account reads
Q5: What is the maximum number of subnets per VPC?
Answer:
- Default quota: 200 subnets per VPC
- Adjustable: Can be increased via Service Quotas
- Best practice: Use larger CIDR blocks (/16) for flexibility
- Data engineering: Plan for growth, especially for EMR clusters
Cost Considerations
| Component | Cost | Optimization |
|---|---|---|
| NAT Gateway | $0.045/hr + data | Use VPC endpoints instead |
| VPC Flow Logs | $0.50/GB (CW) | Use S3 for long-term storage |
| VPC Peering | Free | No cross-region peering |
| Transit Gateway | $0.05/hr per attachment | Use for many VPCs |
| VPC Endpoints (Gateway) | Free | Always use for S3/DynamoDB |
| VPC Endpoints (Interface) | $0.01/hr per ENI | Only when needed |
β οΈ
Cost Warning: NAT Gateway data processing charges can add up quickly for data engineering workloads. Always use VPC endpoints for S3 and other AWS services to avoid NAT Gateway costs.
Summary
VPC networking is critical for secure, high-performance data engineering. Key takeaways:
- Design: Use /16 CIDR with /24 subnets for data services
- Security: Security Groups (instance) + NACLs (subnet) for defense in depth
- VPC Endpoints: Gateway (free) for S3/DynamoDB, Interface for other services
- NAT Gateways: One per AZ for high availability
- Monitoring: VPC Flow Logs for traffic analysis and troubleshooting
- Cross-VPC: Use Transit Gateway for many VPCs, Peering for simple cases