S3 Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AMAZON S3 ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β S3 BUCKET: data-lake-production β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β BUCKET LEVEL CONFIGURATION β β β
β β β β β β
β β β β’ Versioning: Enabled β β β
β β β β’ Encryption: AES-256 / KMS β β β
β β β β’ Access Logging: Enabled β β β
β β β β’ Lifecycle Rules: Configured β β β
β β β β’ Replication: Cross-Region Enabled β β β
β β β β’ Object Lock: Governance Mode β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β PREFIX HIERARCHY β β β
β β β β β β
β β β data-lake-production/ β β β
β β β βββ raw/ (Incoming data) β β β
β β β β βββ landing/ (Temporary staging) β β β
β β β β βββ source_a/ (Source system A) β β β
β β β β βββ source_b/ (Source system B) β β β
β β β βββ processed/ (Cleaned data) β β β
β β β β βββ silver/ (Validated) β β β
β β β β βββ gold/ (Business-ready) β β β
β β β βββ curated/ (Aggregated) β β β
β β β β βββ daily/ (Daily aggregates) β β β
β β β β βββ weekly/ (Weekly reports) β β β
β β β β βββ monthly/ (Monthly rollups) β β β
β β β βββ archive/ (Historical) β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STORAGE CLASSES β β
β β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββ β β
β β β Standard β β IA β β One Zone-IAβ β Glacier β β β
β β β β β β β β β Instant β β β
β β β $0.023/GB β β $0.0125/GB β β $0.01/GB β β $0.004/GBβ β β
β β β ms access β β min 30 daysβ β min 30 daysβ β ms accessβ β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββ β β
β β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β β β Glacier β β Glacier β β Deep β β β
β β β Flexible β β Deep β β Archive β β β
β β β Retrieval β β Archive β β (DynamoDB) β β β
β β β $0.0036/GB β β $0.00099/GBβ β $0.00099/GBβ β β
β β β 1-5 hrs β β 12 hrs β β 12 hrs β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Storage Classes Comparison
| Storage Class | Durability | Availability | Min Storage | Retrieval | Use Case |
|---|---|---|---|---|---|
| S3 Standard | 11 9's | 99.99% | None | ms | Frequently accessed |
| S3 Intelligent-Tiering | 11 9's | 99.9% | None | ms | Unknown access patterns |
| S3 Standard-IA | 11 9's | 99.9% | 30 days | ms | Infrequent access |
| S3 One Zone-IA | 11 9's | 99.5% | 30 days | ms | Recreatable, infrequent |
| S3 Glacier Instant | 11 90's | 99.9% | 90 days | ms | Archive, fast retrieval |
| S3 Glacier Flexible | 11 9's | 99.99% | 90 days | min-hours | Archive, flexible |
| S3 Glacier Deep Archive | 11 9's | 99.99% | 180 days | 12 hrs | Long-term archive |
S3 Lifecycle Policies
Data Lake Lifecycle Configuration
{
"Rules": [
{
"ID": "RawDataLifecycle",
"Status": "Enabled",
"Filter": {
"Prefix": "raw/"
},
"Transitions": [
{
"Days": 0,
"StorageClass": "STANDARD"
},
{
"Days": 90,
"StorageClass": "STANDARD_IA"
},
{
"Days": 180,
"StorageClass": "GLACIER"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": {
"Days": 2555
}
},
{
"ID": "ProcessedDataLifecycle",
"Status": "Enabled",
"Filter": {
"Prefix": "processed/"
},
"Transitions": [
{
"Days": 0,
"StorageClass": "STANDARD"
},
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER_INSTANT_RETRIEVAL"
},
{
"Days": 365,
"StorageClass": "GLACIER"
}
]
},
{
"ID": "CuratedDataLifecycle",
"Status": "Enabled",
"Filter": {
"Prefix": "curated/"
},
"Transitions": [
{
"Days": 0,
"StorageClass": "STANDARD"
},
{
"Days": 365,
"StorageClass": "STANDARD_IA"
}
]
},
{
"ID": "CleanupMultipartUploads",
"Status": "Enabled",
"Filter": {},
"AbortIncompleteMultipartUpload": {
"DaysAfterInitiation": 7
}
},
{
"ID": "CleanupExpiredMarkers",
"Status": "Enabled",
"Filter": {},
"Expiration": {
"ExpiredObjectDeleteMarker": true
}
}
]
}
βΉοΈ
Pro Tip: Use S3 Intelligent-Tiering for data with unknown or changing access patterns. It automatically moves objects between tiers based on usage, saving up to 70% on storage costs.
S3 Versioning and Replication
Versioning Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β S3 VERSIONING & REPLICATION β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SOURCE BUCKET (us-east-1) β β
β β β β
β β Object: data/sales/2024-01-15.parquet β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Version ID β Size β Storage Class β Last Modified β β β
β β ββββββββββββββββββΌβββββββββββΌββββββββββββββββββΌββββββββββββββββββ€ β β
β β β v1 (current) β 2.5 GB β STANDARD β 2024-01-15 β β β
β β β v2 β 2.3 GB β STANDARD_IA β 2024-01-14 β β β
β β β v3 β 2.1 GB β GLACIER β 2024-01-13 β β β
β β β v4 (delete) β 0 β - β 2024-01-12 β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββ΄ββββββββββββ β
β βΌ βΌ β
β ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ β
β β CROSS-REGION REPLICATION β β SAME-REGION REPLICATION β β
β β (CRR) β β (SRR) β β
β β β β β β
β β Source: us-east-1 β β Source: us-east-1 β β
β β Dest: us-west-2 β β Dest: us-east-1 β β
β β β β (Different account) β β
β β β’ Async replication β β β β
β β β’ ~15 min latency β β β’ Async replication β β
β β β’ Versioning required β β β’ ~15 min latency β β
β β β’ Encryption required β β β’ Versioning required β β
β ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Replication Configuration
{
"Role": "arn:aws:iam::123456789012:role/S3ReplicationRole",
"Rules": [
{
"Status": "Enabled",
"Priority": 1,
"Filter": {
"Prefix": "curated/"
},
"Destination": {
"Bucket": "arn:aws:s3:::data-lake-disaster-recovery",
"StorageClass": "STANDARD_IA",
"ReplicationTime": {
"Status": "Enabled",
"Time": {
"Minutes": 15
}
},
"EncryptionConfiguration": {
"ReplicaKmsKeyId": "arn:aws:kms:us-west-2:123456789012:key/12345678-1234-1234-1234-123456789012"
}
},
"DeleteMarkerReplication": {
"Status": "Enabled"
}
}
]
}
S3 Encryption Options
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β S3 ENCRYPTION OPTIONS β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SERVER-SIDE ENCRYPTION (SSE) β β
β β β β
β β βββββββββββββββββββββββ βββββββββββββββββββββββ β β
β β β SSE-S3 β β SSE-KMS β β β
β β β β β β β β
β β β β’ AWS managed keys β β β’ Customer managed β β β
β β β β’ AES-256 β β β’ AWS managed keys β β β
β β β β’ No audit trail β β β’ Audit trail β β β
β β β β’ Free β β β’ $1/key/month β β β
β β βββββββββββββββββββββββ βββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββ β β
β β β SSE-C β β β
β β β β β β
β β β β’ Customer providedβ β β
β β β β’ Must manage keys β β β
β β β β’ No AWS key mgmt β β β
β β βββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CLIENT-SIDE ENCRYPTION β β
β β β β
β β β’ Encrypt before upload β β
β β β’ Manage your own keys β β
β β β’ Full control over encryption β β
β β β’ More complex implementation β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Default Encryption Configuration
{
"Rules": [
{
"ApplyServerSideEncryptionByDefault": {
"SSEAlgorithm": "aws:kms",
"KMSMasterKeyID": "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012"
},
"BucketKeyEnabled": true
}
]
}
β οΈ
Security Warning: Always enable default encryption for data lakes. Use SSE-KMS for audit trails and compliance. Bucket keys reduce KMS costs by 99% for large buckets.
S3 Performance Optimization
Multipart Upload
import boto3
from boto3.s3.transfer import TransferConfig
s3 = boto3.client('s3')
# Configure multipart upload
config = TransferConfig(
multipart_threshold=1024 * 1024 * 100, # 100 MB
max_concurrency=10,
multipart_chunksize=1024 * 1024 * 100, # 100 MB
use_threads=True
)
# Upload with multipart
s3.upload_file(
'large_file.parquet',
'data-lake-bucket',
'raw/source/file.parquet',
Config=config
)
# For very large files (>5GB), use custom multipart
def multipart_upload(bucket, key, file_path, part_size=100*1024*1024):
import os
# Create multipart upload
response = s3.create_multipart_upload(Bucket=bucket, Key=key)
upload_id = response['UploadId']
parts = []
file_size = os.path.getsize(file_path)
with open(file_path, 'rb') as f:
part_number = 1
while True:
data = f.read(part_size)
if not data:
break
response = s3.upload_part(
Bucket=bucket,
Key=key,
PartNumber=part_number,
UploadId=upload_id,
Body=data
)
parts.append({
'PartNumber': part_number,
'ETag': response['ETag']
})
part_number += 1
# Complete multipart upload
s3.complete_multipart_upload(
Bucket=bucket,
Key=key,
UploadId=upload_id,
MultipartUpload={'Parts': parts}
)
S3 Transfer Acceleration
# Enable Transfer Acceleration
aws s3api put-bucket-accelerate-configuration \
--bucket data-lake-production \
--accelerate-configuration Status=Enabled
# Use accelerated endpoint
aws s3 cp large_file.parquet s3://data-lake-production/raw/ \
--endpoint-url https://data-lake-production.s3-accelerate.amazonaws.com
S3 for Data Lake Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β S3 DATA LAKE ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β INGESTION LAYER β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββ β β
β β β Kinesis β β DMS β β Glue Crawlerβ β API GW β β β
β β β Firehose β β (CDC) β β (Batch) β β (REST) β β β
β β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ βββββββ¬ββββββ β β
β β β β β β β β
β βββββββββββΌβββββββββββββββββΌβββββββββββββββββΌββββββββββββββββΌβββββββββ β
β β β β β β
β βΌ βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β RAW ZONE (Bronze) β β
β β s3://data-lake-raw/ β β
β β βββ landing/ (Temporary staging) β β
β β β βββ {date}/ (Daily partitions) β β
β β βββ source_system_a/ (Source A data) β β
β β β βββ {yyyy}/{mm}/{dd}/ (Date-partitioned) β β
β β β βββ _metadata.json (Schema info) β β
β β βββ source_system_b/ (Source B data) β β
β β β β
β β Format: Raw JSON/CSV/Parquet β β
β β Retention: 90 days standard, then Glacier β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PROCESSING LAYER (AWS Glue) β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Glue Crawler β Glue Catalog β Glue ETL Job β β β
β β β β β β
β β β 1. Discover schema (Crawler) β β β
β β β 2. Register in catalog β β β
β β β 3. Transform data (ETL job) β β β
β β β 4. Write to processed zone β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PROCESSED ZONE (Silver/Gold) β β
β β s3://data-lake-processed/ β β
β β βββ silver/ (Cleaned, validated) β β
β β β βββ transactions/ (Parquet, partitioned by date) β β
β β β βββ customers/ (Delta Lake format) β β
β β β βββ products/ (Optimized Parquet) β β
β β βββ gold/ (Business-ready) β β
β β βββ daily_metrics/ (Aggregated daily) β β
β β βββ customer_360/ (Customer view) β β
β β βββ financial_reports/ (Financial data) β β
β β β β
β β Format: Parquet/Delta Lake/ORC β β
β β Retention: 2 years standard, then IA β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ β
β β Athena β β Redshift β β QuickSight β β
β β (Ad-hoc) β β (Warehouse) β β (BI/Dash) β β
β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
S3 Access Patterns for Data Engineering
S3 Select (Filter at Storage Layer)
import boto3
s3 = boto3.client('s3')
# Use S3 Select to filter data without downloading entire object
response = s3.select_object_content(
Bucket='data-lake-processed',
Key='silver/transactions/2024/01/15.parquet',
Expression="SELECT * FROM s3object WHERE amount > 1000",
ExpressionType='SQL',
InputSerialization={
'Parquet': {}
},
OutputSerialization={
'JSON': {}
}
)
# Process filtered results
for event in response['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)
S3 Inventory
import boto3
s3 = boto3.client('s3')
# Create inventory configuration
s3.put_bucket_inventory_configuration(
Bucket='data-lake-production',
Id='daily-inventory',
InventoryConfiguration={
'Destination': {
'S3BucketDestination': {
'Format': 'Parquet',
'Bucket': 'arn:aws:s3:::data-lake-inventory',
'Prefix': 'inventory'
}
},
'IsEnabled': True,
'Id': 'daily-inventory',
'IncludedObjectVersions': 'Current',
'Schedule': {
'Frequency': 'Daily'
},
'OptionalFields': [
'Size',
'LastModifiedDate',
'StorageClass',
'ETag',
'IsMultipartUploaded',
'ReplicationStatus',
'EncryptionStatus',
'IntelligentTieringAccessTier',
'BucketKeyStatus',
'ChecksumAlgorithm'
]
}
)
Interview Questions & Answers
Q1: What is the difference between S3 Standard and S3 Intelligent-Tiering?
Answer:
- S3 Standard: Highest availability (99.99%), highest cost, for frequently accessed data
- S3 Intelligent-Tiering: Same durability, automatic tiering based on access patterns, no retrieval fees
Use S3 Intelligent-Tiering when you don't know the access pattern or it changes frequently. It can save up to 70% compared to Standard.
Q2: How do you handle S3 eventual consistency?
Answer: S3 provides strong read-after-write consistency for new objects. For overwrites and deletes:
- Read-after-write: Immediately consistent
- Read-after-delete: Immediately consistent (if you read a delete marker)
- List operations: Eventually consistent
Best practice: Use versioning and DynamoDB for metadata to handle consistency requirements.
Q3: What are the best practices for S3 performance?
Answer:
- Multipart Upload: For files >100MB
- Parallel Requests: Use prefix-level parallelism (up to 5,500 GET/s per prefix)
- Transfer Acceleration: For cross-region transfers
- S3 Select: Filter at storage layer
- Requester Pays: For large-scale analytics
- S3 Batch Operations: For bulk operations
Q4: How do you implement S3 lifecycle policies for data lakes?
Answer:
- Raw Zone: Standard β IA (90 days) β Glacier (180 days) β Deep Archive (365 days)
- Processed Zone: Standard β IA (30 days) β Glacier Instant (90 days)
- Curated Zone: Standard β IA (365 days)
- Archive Zone: Glacier Deep Archive immediately
Always set expiration rules for temporary data (landing zones).
Q5: What is S3 Replication Time Control (RTC)?
Answer: S3 RTC provides a durability SLA of 99.99% for cross-region replication within 15 minutes. It's useful for:
- Disaster recovery requirements
- Compliance requirements
- Low-latency access in multiple regions
Cost: $0.02 per GB replicated.
Cost Considerations
| Component | Cost | Optimization |
|---|---|---|
| S3 Standard | $0.023/GB/month | Use lifecycle policies |
| S3 Standard-IA | $0.0125/GB/month | Min 30 days storage |
| S3 Glacier Instant | $0.004/GB/month | Min 90 days storage |
| S3 Glacier Flexible | $0.0036/GB/month | Min 90 days, retrieval fees |
| S3 Deep Archive | $0.00099/GB/month | Min 180 days, 12hr retrieval |
| Data Transfer | $0.09/GB outbound | Use VPC endpoints |
| PUT/COPY/POST | $0.005/1000 requests | Batch operations |
| GET/SELECT | $0.0004/1000 requests | Use S3 Select |
β οΈ
Cost Warning: S3 costs can add up quickly with large data lakes. Always implement lifecycle policies and use S3 Cost Analysis to identify optimization opportunities.
Summary
S3 is the foundation of data lakes on AWS. Key takeaways:
- Storage Classes: Choose based on access patterns and retention requirements
- Lifecycle Policies: Automate data movement between tiers
- Versioning: Protect against accidental deletes and overwrites
- Replication: Cross-region for DR, same-region for compliance
- Encryption: Always enable, use SSE-KMS for audit trails
- Performance: Multipart upload, parallel requests, S3 Select
- Data Lake: Use prefix hierarchy for organization and access control