Data Lake Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS Data Lake Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Sources Ingestion Storage β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β DatabasesββββββββββΆβ Glue ββββββββΆβ S3 β β
β β APIs β β DMS β β (Lake) β β
β β Files β β Firehose β β β β
β ββββββββββββ ββββββββββββ ββββββ¬ββββββ β
β β β
β Governance β β
β βββββββββββββββ β β
β β Lake β β β
β β Formation βββββββββββββββββ β
β ββββββββ¬βββββββ β
β β β
β ββββββββββββΌβββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββ βββββββββββ βββββββββββ β
β β Athena β β Redshiftβ β QuickSightβ β
β β (Query) β β Spectrumβ β (Viz) β β
β βββββββββββ βββββββββββ βββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q1: How do you design a multi-layer data lake architecture on AWS?
Answer:
Medallion Architecture (Bronze/Silver/Gold):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Multi-Layer Data Lake Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Bronze Layer (Raw) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Raw data as-is β β
β β β’ Immutable, append-only β β
β β β’ Multiple formats (JSON, CSV, Parquet) β β
β β β’ Partitioned by source/date β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Silver Layer (Validated) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Cleaned, validated data β β
β β β’ Schema enforced β β
β β β’ Deduplicated β β
β β β’ Standardized formats (Parquet/ORC) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Gold Layer (Business) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Business-level aggregates β β
β β β’ Dimension/fact tables β β
β β β’ Optimized for analytics β β
β β β’ Materialized views β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
S3 Folder Structure:
s3://data-lake/
βββ bronze/
β βββ source_a/
β β βββ year=2024/
β β β βββ month=01/
β β β β βββ day=01/
β β β β β βββ data.json
βββ silver/
β βββ source_a/
β β βββ year=2024/
β β β βββ month=01/
β β β β βββ data.parquet
βββ gold/
β βββ dimensions/
β βββ facts/
β βββ aggregates/
Implementation with Glue:
# Bronze to Silver transformation
def bronze_to_silver(glueContext, bronze_path, silver_path):
# Read raw data
dynamic_frame = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": [bronze_path]},
format="json"
)
# Apply data quality rules
cleaned = dynamic_frame.resolveChoice(
choice="match_catalog",
database="datalake",
table_name="silver_table"
)
# Write to silver layer
glueContext.write_dynamic_frame.from_options(
frame=cleaned,
connection_type="s3",
connection_options={"path": silver_path},
format="parquet",
format_options={"compression": "snappy"}
)
βΉοΈ
Interview Tip: The Medallion Architecture (Bronze/Silver/Gold) is a widely accepted pattern. Always mention data quality gates between layers.
Q2: Compare S3 storage classes for data lake workloads.
Answer:
Storage Class Comparison:
| Storage Class | Use Case | Cost | Access Time |
|---|---|---|---|
| S3 Standard | Frequently accessed | $0.023/GB | Milliseconds |
| S3 Intelligent | Unknown access patterns | Variable | Milliseconds |
| S3 Standard-IA | Infrequent access | $0.0125/GB | Milliseconds |
| S3 One Zone-IA | Infrequent, re-creatable | $0.01/GB | Milliseconds |
| S3 Glacier Instant | Archive, millisecond | $0.004/GB | Milliseconds |
| S3 Glacier Flexible | Archive, minutes-hours | $0.0036/GB | Minutes |
| S3 Glacier Deep Archive | Long-term archive | $0.00099/GB | Hours |
Lifecycle Policy:
{
"Rules": [
{
"ID": "DataLakeLifecycle",
"Filter": {
"Prefix": "bronze/"
},
"Status": "Enabled",
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
]
}
]
}
Cost Optimization Example:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Storage Cost Comparison (1TB for 1 year) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β S3 Standard: $276/year β
β S3 Standard-IA: $150/year (45% savings) β
β S3 Intelligent: $180/year (35% savings) β
β S3 Glacier: $43/year (84% savings) β
β S3 Deep Archive: $12/year (96% savings) β
β β
β Recommended Strategy: β
β Days 1-30: Standard ($23/month) β
β Days 31-90: Standard-IA ($12.50/month) β
β Days 91-365: Glacier ($3.60/month) β
β After 365: Deep Archive ($0.99/month) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q3: How do you implement data cataloging with AWS Glue Data Catalog?
Answer:
Glue Data Catalog Components:
import boto3
glue = boto3.client('glue')
# Create database
glue.create_database(
DatabaseInput={
'Name': 'analytics_db',
'Description': 'Analytics data lake database'
}
)
# Create table
glue.create_table(
DatabaseName='analytics_db',
TableInput={
'Name': 'sales_data',
'Description': 'Sales transaction data',
'StorageDescriptor': {
'Columns': [
{'Name': 'transaction_id', 'Type': 'string'},
{'Name': 'amount', 'Type': 'decimal(10,2)'},
{'Name': 'transaction_date', 'Type': 'date'}
],
'Location': 's3://data-lake/silver/sales/',
'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
'SerdeInfo': {
'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
}
},
'PartitionKeys': [
{'Name': 'year', 'Type': 'int'},
{'Name': 'month', 'Type': 'int'}
]
}
)
Automatic Cataloging with Glue Crawlers:
# Create crawler
glue.create_crawler(
Name='sales_crawler',
Role='arn:aws:iam::role/GlueCrawlerRole',
DatabaseName='analytics_db',
Targets={
'S3Targets': [
{'Path': 's3://data-lake/silver/sales/'}
]
},
Schedule='cron(0 1 * * ? *)', # Daily at 1 AM
SchemaChangePolicy={
'UpdateBehavior': 'UPDATE_IN_DATABASE',
'DeleteBehavior': 'LOG'
}
)
# Start crawler
glue.start_crawler(Name='sales_crawler')
Catalog Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Glue Data Catalog Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Data Catalog β β
β β βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββ β β
β β β Databases β Tables β Partitions β Columns β β β
β β β β β β β β β
β β β analytics_dbβ sales_data β year=2024 β id, amountβ β β
β β β β customers β month=01 β name, emailβ β β
β β β warehouse_dbβ orders β region=us β order_id β β β
β β βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββΌβββββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Athena β β Redshift β β EMR β β
β β Query β β Spectrum β β Spark SQL β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q4: How do you implement data lake security and access control?
Answer:
Security Layers:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Lake Security Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Network Security β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ VPC endpoints for S3 β β
β β β’ S3 bucket policies β β
β β β’ Security groups β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Identity & Access β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ IAM roles/policies β β
β β β’ Lake Formation permissions β β
β β β’ S3 bucket policies β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Data Protection β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Encryption at rest (SSE-S3, SSE-KMS, SSE-C) β β
β β β’ Encryption in transit (TLS) β β
β β β’ Column-level encryption β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Governance β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Lake Formation grants β β
β β β’ CloudTrail auditing β β
β β β’ Data classification β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
IAM Policy for Data Lake Access:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::data-lake",
"arn:aws:s3:::data-lake/*"
],
"Condition": {
"StringLike": {
"s3:prefix": [
"bronze/${aws:PrincipalTag/department}/*",
"silver/${aws:PrincipalTag/department}/*"
]
}
}
}
]
}
Lake Formation Permissions:
# Grant table-level permissions
lakeformation = boto3.client('lakeformation')
lakeformation.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::role/analyst-role'},
Resource={
'Table': {
'DatabaseName': 'analytics_db',
'Name': 'sales_data'
}
},
Permissions=['SELECT', 'DESCRIBE'],
GrantOption=False
)
# Grant column-level permissions
lakeformation.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::role/hr-role'},
Resource={
'TableWithColumns': {
'DatabaseName': 'analytics_db',
'Name': 'employees',
'ColumnNames': ['employee_id', 'name', 'department']
}
},
Permissions=['SELECT']
)
Q5: How do you optimize query performance in Athena for data lakes?
Answer:
Optimization Techniques:
1. Partitioning:
-- Create partitioned table
CREATE EXTERNAL TABLE sales (
transaction_id STRING,
amount DECIMAL(10,2),
customer_id STRING
)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS PARQUET
LOCATION 's3://data-lake/silver/sales/';
-- Query with partition pruning
SELECT * FROM sales
WHERE year = 2024 AND month = 1 AND day = 15;
2. File Format Optimization:
# Convert to optimized Parquet
df.write \
.partitionBy("year", "month", "day") \
.option("compression", "snappy") \
.parquet("s3://data-lake/optimized/sales/")
3. Bucketing for Joins:
-- Bucketed table for efficient joins
CREATE EXTERNAL TABLE orders (
order_id STRING,
customer_id STRING,
amount DECIMAL(10,2)
)
CLUSTERED BY (customer_id) INTO 32 BUCKETS
STORED AS PARQUET
LOCATION 's3://data-lake/silver/orders/';
Performance Comparison:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Athena Query Performance Optimization β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Scenario: 1TB dataset, query 1 day of data β
β β
β Without Optimization: β
β β’ Full scan: 1,000 GB scanned β
β β’ Cost: $5.00 per query β
β β’ Time: ~5 minutes β
β β
β With Partitioning: β
β β’ Partition pruned: 2.7 GB scanned β
β β’ Cost: $0.014 per query β
β β’ Time: ~10 seconds β
β β
β With Columnar Format: β
β β’ Column projection: 0.8 GB scanned β
β β’ Cost: $0.004 per query β
β β’ Time: ~3 seconds β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q6: Design a data lake for real-time and batch analytics.
Answer:
Hybrid Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Real-Time + Batch Data Lake Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Real-Time Path β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Kinesis βββββΆβ Lambda βββββΆβ S3 β β
β β Streams β β Transform β β (Hot) β β
β βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ β
β β β
β Batch Path β β
β βββββββββββββββ βββββββββββββββ β β
β β Glue βββββΆβ EMR ββββββββββββ€ β
β β Crawler β β Spark β β β
β βββββββββββββββ βββββββββββββββ β β
β βΌ β
β βββββββββββββββ β
β β Unified β β
β β Data Lake β β
β β (S3) β β
β ββββββββ¬βββββββ β
β β β
β ββββββββββββββββββΌβββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββ βββββββββββ βββββββββββ
β β Athena β βRedshift β βQuickSightβ
β β(Ad-hoc) β βSpectrum β β(Dash) ββ
β βββββββββββ βββββββββββ βββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Lambda Real-Time Ingestion:
import boto3
import json
from datetime import datetime
s3 = boto3.client('s3')
def lambda_handler(event, context):
for record in event['Records']:
data = json.loads(record['kinesis']['data'])
# Add metadata
enriched = {
**data,
'_ingestion_time': datetime.now().isoformat(),
'_source': 'real-time',
'_year': datetime.now().year,
'_month': datetime.now().month,
'_day': datetime.now().day
}
# Write to S3 with partitioning
key = f"bronze/real-time/year={enriched['_year']}/month={enriched['_month']}/day={enriched['_day']}/{record['kinesis']['sequenceNumber']}.json"
s3.put_object(
Bucket='data-lake-bucket',
Key=key,
Body=json.dumps(enriched)
)
Q7: How do you implement data quality in a data lake?
Answer:
Data Quality Framework:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Lake Quality Framework β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Quality Dimensions β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Completeness - All expected data present β β
β β β’ Accuracy - Data matches real-world entities β β
β β β’ Consistency - No contradictions across datasets β β
β β β’ Timeliness - Data available when needed β β
β β β’ Validity - Data conforms to defined formats/rules β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Implementation Layers β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Ingestion β Validation β Processing β Publishing β β
β β β β β β β β
β β Schema Rules Engine Checks SLA Monitor β β
β β Check Aggregate β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Glue Data Quality Rules:
from pyspark.sql import DataFrame
from pyspark.sql.functions import col, count, when, sum
class DataQualityEngine:
def __init__(self):
self.rules = {}
def add_rule(self, name, rule_func, severity='ERROR'):
self.rules[name] = {
'func': rule_func,
'severity': severity
}
def validate(self, df: DataFrame) -> dict:
results = {}
for name, rule in self.rules.items():
try:
passed = rule['func'](df)
results[name] = {
'passed': passed,
'severity': rule['severity']
}
except Exception as e:
results[name] = {
'passed': False,
'error': str(e),
'severity': rule['severity']
}
return results
# Usage
engine = DataQualityEngine()
# Add rules
engine.add_rule(
'no_nulls',
lambda df: df.filter(col('customer_id').isNull()).count() == 0
)
engine.add_rule(
'valid_amounts',
lambda df: df.filter(col('amount') < 0).count() == 0
)
engine.add_rule(
'referential_integrity',
lambda df: df.join(
spark.read.table('customers'),
'customer_id',
'left_anti'
).count() == 0
)
Q8: How do you implement data lineage in a data lake?
Answer:
Lineage Tracking Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Lake Lineage Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Lineage Metadata β β
β β βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββ β β
β β β Source β Transformationβ Target β lineage β β β
β β β Tables β Details β Tables β Graph β β β
β β βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄ββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββΌβββββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Glue β β CloudTrail β β Custom β β
β β Catalog β β API Logs β β Metadata β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Lineage Implementation:
class LineageTracker:
def __init__(self):
self.lineage_store = []
def track_transformation(self, job_name, inputs, outputs, transformation_sql):
lineage_record = {
'job_name': job_name,
'inputs': inputs,
'outputs': outputs,
'transformation': transformation_sql,
'timestamp': datetime.now().isoformat(),
'job_run_id': get_job_run_id()
}
self.lineage_store.append(lineage_record)
# Store in DynamoDB
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('data_lineage')
table.put_item(Item=lineage_record)
def get_upstream(self, table_name):
# Get all upstream dependencies
upstream = []
for record in self.lineage_store:
if table_name in record['outputs']:
upstream.append(record)
upstream.extend(self.get_upstream_tables(record['inputs']))
return upstream
def get_downstream(self, table_name):
# Get all downstream dependencies
downstream = []
for record in self.lineage_store:
if table_name in record['inputs']:
downstream.append(record)
downstream.extend(self.get_downstream_tables(record['outputs']))
return downstream
Q9: How do you handle schema evolution in a data lake?
Answer:
Schema Evolution Strategies:
1. Schema Registry with Glue:
# Register schema
glue = boto3.client('glue')
# Create schema
glue.create_schema(
RegistryId={'RegistryName': 'my-registry'},
SchemaName='sales-schema',
DataFormat='PARQUET',
Compatibility='BACKWARD',
SchemaDefinition=json.dumps({
'type': 'record',
'name': 'Sales',
'fields': [
{'name': 'id', 'type': 'string'},
{'name': 'amount', 'type': 'double'}
]
})
)
2. Spark Schema Evolution:
# Read with schema evolution
df = spark.read \
.option("mergeSchema", "true") \
.parquet("s3://data-lake/sales/")
# Write with schema evolution
df.write \
.mode("append") \
.option("mergeSchema", "true") \
.parquet("s3://data-lake/sales/")
3. Schema Evolution Patterns:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Schema Evolution Patterns β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Forward Compatibility (Adding columns) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Old Schema: [id, name, amount] β β
β β New Schema: [id, name, amount, category] β β
β β Strategy: Add with default value, backward compatible β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Backward Compatibility (Removing columns) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Old Schema: [id, name, amount, category] β β
β β New Schema: [id, name, amount] β β
β β Strategy: Column still readable, just not written β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Full Compatibility (Renaming columns) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Old Schema: [id, cust_name, amt] β β
β β New Schema: [id, customer_name, amount] β β
β β Strategy: Use aliases, maintain both names β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q10: Design a multi-account data lake architecture.
Answer:
Multi-Account Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Multi-Account Data Lake Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Account Analytics Account Consumption β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β S3 ββββββββΆβ EMR/Glue βββββββΆβ Athena β β
β β (Raw) β Peeringβ β β Redshift β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β β β β
β β βββββββββββββββββΌββββββββββββββββ β β
β β β β β β β
β βΌ βΌ βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β AWS Organizations β β
β β βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬βββββββββββ β β
β β β Security β Logging β Audit β Billing β β β
β β β Account β Account β Account β Account β β β
β β βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ΄βββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Cross-Account Access:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::analytics-account:role/DataAnalyst"
},
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::shared-data-lake",
"arn:aws:s3:::shared-data-lake/*"
],
"Condition": {
"StringEquals": {
"aws:PrincipalOrgID": "o-xxxxxxxxxx"
}
}
}
]
}
Q11: How do you implement data lake governance with Lake Formation?
Answer:
Lake Formation Components:
# Register location
lakeformation = boto3.client('lakeformation')
lakeformation.register_resource(
ResourceArn='arn:aws:s3:::data-lake-bucket',
RoleArn='arn:aws:iam::role/LakeFormationRole',
UseServiceLinkedRole=False
)
# Grant database permissions
lakeformation.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'iam-role/data-engineer'},
Resource={
'Database': {
'Name': 'analytics_db'
}
},
Permissions=['CREATE_TABLE', 'ALTER', 'DROP']
)
# Grant table permissions with grant option
lakeformation.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'iam-role/data-steward'},
Resource={
'Table': {
'DatabaseName': 'analytics_db',
'Name': 'sales_data'
}
},
Permissions=['SELECT', 'INSERT', 'DELETE', 'UPDATE'],
GrantOption=True
)
Governance Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Lake Formation Governance β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Admin Layer β β
β β β’ Register storage locations β β
β β β’ Manage data lake administrators β β
β β β’ Configure cross-account permissions β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Permission Layer β β
β β β’ Database-level permissions β β
β β β’ Table-level permissions β β
β β β’ Column-level permissions β β
β β β’ Row-level permissions β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Audit Layer β β
β β β’ CloudTrail integration β β
β β β’ Permission grants/revoke history β β
β β β’ Data access logs β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q12: How do you optimize costs in a data lake?
Answer:
Cost Optimization Strategies:
1. Storage Optimization:
# S3 Lifecycle Policies
s3 = boto3.client('s3')
lifecycle_config = {
'Rules': [
{
'ID': 'OptimizeStorage',
'Status': 'Enabled',
'Filter': {'Prefix': 'bronze/'},
'Transitions': [
{'Days': 30, 'StorageClass': 'STANDARD_IA'},
{'Days': 90, 'StorageClass': 'GLACIER'},
{'Days': 365, 'StorageClass': 'DEEP_ARCHIVE'}
]
}
]
}
s3.put_bucket_lifecycle_configuration(
Bucket='data-lake-bucket',
LifecycleConfiguration=lifecycle_config
)
2. Query Optimization:
# Optimize file sizes for Athena
# Target: 128MB - 1GB per file
def optimize_file_sizes(input_path, output_path, target_size_mb=256):
df = spark.read.parquet(input_path)
# Calculate optimal number of partitions
total_size = df.count() * df.schema.jsonSize()
num_partitions = max(1, int(total_size / (target_size_mb * 1024 * 1024)))
df.repartition(num_partitions) \
.write \
.mode('overwrite') \
.parquet(output_path)
3. Cost Monitoring:
# CloudWatch cost anomaly detection
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_alarm(
AlarmName='DataLakeCostAnomaly',
MetricName='EstimatedCharges',
Namespace='AWS/Billing',
Statistic='Maximum',
Period=86400,
EvaluationPeriods=1,
Threshold=1000,
ComparisonOperator='GreaterThanThreshold',
AlarmActions=['arn:aws:sns:us-east-1:123456789:cost-alerts']
)
Cost Breakdown:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Lake Cost Optimization Impact β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Storage: β
β β’ Standard β IA: 40-50% savings β
β β’ Lifecycle policies: 60-80% savings β
β β’ Compression (Snappy): 60-70% size reduction β
β β
β Queries: β
β β’ Partitioning: 90%+ cost reduction β
β β’ File optimization: 50%+ faster queries β
β β’ Result caching: 80%+ faster repeated queries β
β β
β Processing: β
β β’ Spot instances: 60-70% savings β
β β’ Auto-scaling: 30-40% savings β
β β’ Serverless (Glue): No idle costs β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q13: How do you implement data lake disaster recovery?
Answer:
DR Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Lake Disaster Recovery β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Primary Region (us-east-1) DR Region (us-west-2) β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β S3 Primary Bucket βββββββββΆβ S3 Replica Bucket β β
β β (Versioning Enabled)β CRR β β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β Glue Catalog βββββββββΆβ Glue Catalog β β
β β (Primary) β Cross β (Replica) β β
β βββββββββββββββββββββββ Region βββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β Lake Formation βββββββββΆβ Lake Formation β β
β β Permissions β Backup β Permissions β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β
β RPO: 15 minutes (CRR lag) β
β RTO: 30 minutes (automated failover) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
S3 Cross-Region Replication:
s3 = boto3.client('s3')
# Enable CRR
s3.put_bucket_replication(
Bucket='primary-data-lake',
ReplicationConfiguration={
'Role': 'arn:aws:iam::role/s3-replication-role',
'Rules': [
{
'ID': 'replicate-all',
'Status': 'Enabled',
'Destination': {
'Bucket': 'arn:aws:s3:::dr-data-lake',
'StorageClass': 'STANDARD_IA'
},
'Filter': {'Prefix': ''},
'Status': 'Enabled'
}
]
}
)
# Enable versioning (required for CRR)
s3.put_bucket_versioning(
Bucket='primary-data-lake',
VersioningConfiguration={'Status': 'Enabled'}
)
Q14: How do you implement data lake monitoring and observability?
Answer:
Monitoring Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Lake Monitoring Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Metrics Logs Traces β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β CloudWatch β β CloudWatch β β X-Ray β β
β β Metrics β β Logs β β Tracing β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β β
β ββββββββββββββββββΌβββββββββββββββββ β
β βΌ β
β βββββββββββββββ β
β β Dashboard β β
β β (QuickSight)β β
β βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CloudWatch Metrics:
cloudwatch = boto3.client('cloudwatch')
# Data lake metrics
metrics = [
{
'MetricName': 'DataLakeSize',
'Dimensions': [{'Name': 'Lake', 'Value': 'production'}],
'Value': get_lake_size_gb(),
'Unit': 'Gigabytes'
},
{
'MetricName': 'QueryCount',
'Dimensions': [{'Name': 'Service', 'Value': 'Athena'}],
'Value': get_athena_query_count(),
'Unit': 'Count'
},
{
'MetricName': 'DataFreshness',
'Dimensions': [{'Name': 'Dataset', 'Value': 'sales'}],
'Value': get_data_freshness_hours(),
'Unit': 'Hours'
}
]
cloudwatch.put_metric_data(
Namespace='DataLake/Analytics',
MetricData=metrics
)
S3 Metrics:
# Monitor S3 bucket metrics
cloudwatch = boto3.client('cloudwatch')
# Get S3 bucket size
response = cloudwatch.get_metric_statistics(
Namespace='AWS/S3',
MetricName='BucketSizeBytes',
Dimensions=[
{'Name': 'BucketName', 'Value': 'data-lake-bucket'},
{'Name': 'StorageType', 'Value': 'StandardStorage'}
],
StartTime=datetime.now() - timedelta(days=1),
EndTime=datetime.now(),
Period=86400,
Statistics=['Average']
)
Q15: How do you implement data lake automation?
Answer:
Automation Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Lake Automation Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β CI/CD Pipeline β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Code βββββΆβ Build βββββΆβ Deploy β β
β β Commit β β (CodeBuild)β β (CloudFormation)β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β Orchestration β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β EventBridgeβββββΆβ Step βββββΆβ Lambda β β
β β Rules β β Functions β β Tasks β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β Infrastructure as Code β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ CloudFormation / CDK β β
β β β’ Glue workflows β β
β β β’ Step Functions state machines β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
EventBridge Automation:
import boto3
events = boto3.client('events')
# Schedule daily data lake refresh
events.put_rule(
Name='DailyDataLakeRefresh',
ScheduleExpression='cron(0 2 * * ? *)',
State='ENABLED'
)
# Add target
events.put_targets(
Rule='DailyDataLakeRefresh',
Targets=[
{
'Id': 'GlueWorkflow',
'Arn': 'arn:aws:states:us-east-1:123456789:stateMachine:DataLakeRefresh',
'RoleArn': 'arn:aws:iam::role/EventBridgeRole'
}
]
)
CloudFormation Template:
Resources:
DataLakeBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: data-lake-${AWS::AccountId}
VersioningConfiguration:
Status: Enabled
LifecycleConfiguration:
Rules:
- Id: OptimizeStorage
Status: Enabled
Transitions:
- TransitionInDays: 30
StorageClass: STANDARD_IA
- TransitionInDays: 90
StorageClass: GLACIER
GlueCrawler:
Type: AWS::Glue::Crawler
Properties:
Name: data-lake-crawler
Role: !GetAtt GlueRole.Arn
DatabaseName: analytics_db
Targets:
S3Targets:
- Path: !Sub "s3://${DataLakeBucket}/silver/"
Schedule: cron(0 1 * * ? *)
Q16: How do you implement data lake for machine learning workloads?
Answer:
ML-Ready Data Lake Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Lake for Machine Learning β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Sources Feature Engineering Feature Store β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Raw Data βββββββΆβ Glue/EMR ββββββββΆβ SageMaker β β
β β (S3) β β Features β β Feature Storeβ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β Training Model Registry Deployment β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β SageMaker βββββΆβ Model βββββββΆβ Endpoint β β
β β Training β β Registry β β (Real-time) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β Batch Transform β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β SageMaker βββββΆβ Results βββββββΆβ S3/Redshift β β
β β Batch β β β β β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Feature Engineering Pipeline:
# SageMaker Processing for feature engineering
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor
sklearn_processor = SKLearnProcessor(
framework_version='0.23-1',
role='arn:aws:iam::role/SageMakerRole',
instance_count=2,
instance_type='ml.m5.xlarge'
)
sklearn_processor.run(
code='feature_engineering.py',
inputs=[
ProcessingInput(
source='s3://data-lake/raw/customers/',
destination='/opt/ml/processing/input/customers'
),
ProcessingInput(
source='s3://data-lake/raw/transactions/',
destination='/opt/ml/processing/input/transactions'
)
],
outputs=[
ProcessingOutput(
output_name='features',
source='/opt/ml/processing/output/features',
destination='s3://data-lake/features/customers/'
)
]
)
Q17: How do you implement data lake for analytics and reporting?
Answer:
Analytics Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Lake Analytics Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Ad-Hoc Analytics BI Analytics Real-Time Analytics β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Athena β β QuickSight β β Kinesis β β
β β (SQL) β β (Dashboards)β β Analytics β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β β β β
β ββββββββββββββββββΌβββββββββββββββββββββ β
β βΌ β
β βββββββββββββββ β
β β Data Lake β β
β β (S3) β β
β βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Athena Query Optimization:
-- Create optimized table for analytics
CREATE EXTERNAL TABLE sales_analytics (
transaction_id STRING,
customer_id STRING,
product_id STRING,
amount DECIMAL(10,2),
quantity INT,
transaction_date TIMESTAMP
)
PARTITIONED BY (year INT, month INT, region STRING)
STORED AS PARQUET
LOCATION 's3://data-lake/gold/sales/'
TBLPROPERTIES (
'parquet.compression'='SNAPPY',
'projection.enabled'='true',
'projection.year.type'='integer',
'projection.year.range'='2020,2030'
);
QuickSight Integration:
# Create QuickSight dataset from Athena
quicksight = boto3.client('quicksight')
response = quicksight.create_data_source(
AwsAccountId='123456789',
DataSourceId='data-lake-source',
Name='Data Lake Analytics',
Type='ATHENA',
Parameters={
'Athena': {
'WorkGroup': 'primary'
}
},
Permissions=[
{
'Principal': 'arn:aws:quicksight:us-east-1:123456789:user/default/admin',
'Actions': [
'quicksight:DescribeDataSource',
'quicksight:DescribeDataSourcePermissions',
'quicksight:UpdateDataSource',
'quicksight:DeleteDataSource'
]
}
]
)
Q18: How do you implement data lake for IoT and time-series data?
Answer:
IoT Data Lake Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β IoT Data Lake Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β IoT Devices Ingestion Storage β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Sensors ββββΆβ IoT Core ββββΆβ Kinesis β β
β β (MQTT) β β β β Streams β β
β βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ β
β β β
β βββββββββββββββ βββββββββββββββ β β
β β Gateways ββββΆβ IoT Analyticsβββββββββββ€ β
β βββββββββββββββ βββββββββββββββ β β
β βΌ β
β βββββββββββββββ β
β β S3 β β
β β (IoT Lake) β β
β ββββββββ¬βββββββ β
β β β
β ββββββββββββββββββΌβββββββββββββ β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββ βββββββββββ
β β TimeStreamβ β Athena β β QuickSightβ
β β (Metrics) β β (Query) β β (Viz) ββ
β ββββββββββββ ββββββββββββ βββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
IoT Data Partitioning:
# Optimal partitioning for IoT data
# Partition by device_type/year/month/day/hour
partition_schema = """
device_type STRING,
year INT,
month INT,
day INT,
hour INT
"""
# Write IoT data with optimal partitioning
df.write \
.partitionBy("device_type", "year", "month", "day", "hour") \
.parquet("s3://data-lake/iot/")
Q19: How do you implement data lake for unstructured data?
Answer:
Unstructured Data Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Unstructured Data Lake Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Types Processing Storage β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Images ββββββΆβ Rekognition βββββΆβ S3 β β
β β Videos β β β β (Binary) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Documents ββββββΆβ Textract βββββΆβ Metadata β β
β β PDFs β β Comprehend β β (DynamoDB) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Audio ββββββΆβ Transcribe βββββΆβ Text Index β β
β β β β Comprehend β β (OpenSearch)β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Metadata Extraction:
import boto3
# Extract metadata from images
rekognition = boto3.client('rekognition')
def extract_image_metadata(bucket, key):
response = rekognition.detect_labels(
Image={'S3Object': {'Bucket': bucket, 'Name': key}},
MaxLabels=10
)
metadata = {
'labels': [label['Name'] for label in response['Labels']],
'confidence': [label['Confidence'] for label in response['Labels']]
}
return metadata
# Store metadata in DynamoDB
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('image_metadata')
table.put_item(
Item={
'image_id': key,
'bucket': bucket,
'metadata': metadata,
'processed_at': datetime.now().isoformat()
}
)
Q20: How do you implement data lake with data sharing capabilities?
Answer:
Data Sharing Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Lake Data Sharing β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Producer Account Data Consumer Accounts β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β S3 Bucket ββββΆβ Cross-Account β β
β β (Source) β β Access β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β Lake Formation ββββΆβ Shared Permissions β β
β β Grants β β β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β Redshift Spectrum ββββΆβ Federated Queries β β
β β (Shared Tables) β β β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Lake Formation Cross-Account Sharing:
# Grant cross-account access
lakeformation = boto3.client('lakeformation')
lakeformation.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::consumer-account:role/DataAnalyst'},
Resource={
'Table': {
'DatabaseName': 'shared_db',
'Name': 'shared_table'
}
},
Permissions=['SELECT', 'DESCRIBE']
)
Q21: How do you implement data versioning in a data lake?
Answer:
Data Versioning Strategies:
1. S3 Versioning:
# Enable versioning
s3 = boto3.client('s3')
s3.put_bucket_versioning(
Bucket='data-lake-bucket',
VersioningConfiguration={'Status': 'Enabled'}
)
# Get specific version
response = s3.get_object(
Bucket='data-lake-bucket',
Key='sales/data.parquet',
VersionId='abc123'
)
2. Delta Lake for ACID Transactions:
# Delta Lake versioning
from delta.tables import DeltaTable
# Write with versioning
df.write \
.format("delta") \
.mode("overwrite") \
.save("s3://data-lake/sales/")
# Read specific version
spark.read \
.format("delta") \
.option("versionAsOf", 5) \
.load("s3://data-lake/sales/")
# Time travel
spark.read \
.format("delta") \
.option("timestampAsOf", "2024-01-15") \
.load("s3://data-lake/sales/")
3. Glue Table Versions:
# Get table version
response = glue.get_table_version(
DatabaseName='analytics_db',
TableName='sales_data',
VersionId='4'
)
# Restore table version
glue.batch_update_table_version(
DatabaseName='analytics_db',
TableName='sales_data',
VersionDeltas=[{
'VersionId': '4',
'ViewToUpdate': response['TableVersion']['Table']
}]
)
Q22: How do you implement data lake for regulatory compliance?
Answer:
Compliance Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Lake Compliance Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Classification Access Control Audit Logging β
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββββ
β β β’ PII β β β’ Column-level β β β’ CloudTrail ββ
β β β’ PHI β β β’ Row-level β β β’ S3 access logsββ
β β β’ Financial β β β’ Time-based β β β’ Glue audit ββ
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββββ
β β
β Data Retention Encryption Masking β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β β’ Lifecycle β β β’ SSE-KMS β β β’ Dynamic β β
β β β’ Archival β β β’ Client-side β β β’ Static β β
β β β’ Deletion β β β’ Field-level β β β’ Tokenization β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Masking:
from pyspark.sql.functions import md5, concat, lit
# Mask PII columns
def mask_pii(df):
return df \
.withColumn("masked_email",
concat(
md5(col("email")),
lit("@masked.com")
)
) \
.withColumn("masked_ssn",
concat(
lit("XXX-XX-"),
col("ssn").substr(-4, 4)
)
)
Q23: How do you implement data lake backup and recovery?
Answer:
Backup Strategy:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Lake Backup Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Backup Strategy RPO RTO β
β βββββββββββββββββββββββ¬ββββββββββββ¬ββββββββββββ β
β β S3 Versioning β Real-time β Minutes β β
β β Cross-Region Replicationβ 15 min β Minutes β β
β β Daily Snapshots β 24 hours β Hours β β
β β Weekly Archives β 7 days β Hours β β
β βββββββββββββββββββββββ΄ββββββββββββ΄ββββββββββββ β
β β
β Recovery Scenarios β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Accidental deletion: S3 versioning rollback β β
β β β’ Corruption: Point-in-time recovery β β
β β β’ Regional outage: Cross-region failover β β
β β β’ Ransomware: Restore from immutable backup β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Automated Backup:
# Daily snapshot Lambda
def lambda_handler(event, context):
s3 = boto3.client('s3')
# Get all objects
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='data-lake-bucket')
for page in pages:
for obj in page.get('Contents', []):
# Copy with versioning
copy_source = {'Bucket': 'data-lake-bucket', 'Key': obj['Key']}
s3.copy_object(
CopySource=copy_source,
Bucket='data-lake-backup-bucket',
Key=f"daily/{datetime.now().strftime('%Y-%m-%d')}/{obj['Key']}",
MetadataDirective='COPY'
)
Q24: How do you optimize data lake performance?
Answer:
Performance Optimization:
1. File Format Optimization:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β File Format Comparison β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Format Compression Query Write Columnar β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Parquet Snappy/Gzip Fast Medium Yes β
β ORC Zlib Fast Slow Yes β
β Avro Snappy Medium Fast No β
β JSON None Slow Fast No β
β CSV Gzip Slow Fast No β
β β
β Recommendation: Parquet with Snappy for analytics workloads β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2. Partitioning Strategy:
# Optimal partitioning
# Partition by high-cardinality columns used in WHERE clauses
df.write \
.partitionBy("year", "month", "day", "region") \
.parquet("s3://data-lake/sales/")
# Optimal partition size: 128MB - 1GB per file
3. Caching:
# Athena result caching
# Enable in workgroup configuration
workgroup_config = {
'Name': 'optimized',
'Configuration': {
'ResultConfiguration': {
'OutputLocation': 's3://athena-results/'
},
'EnforceWorkGroupConfiguration': True,
'PublishCloudWatchMetricsEnabled': True,
'BytesScannedCutoffPerQuery': 1000000000, # 1GB limit
'RequesterPaysEnabled': False,
'EngineVersion': {
'SelectedEngineVersion': 'AUTO',
'EffectiveEngineVersion': 'Athena engine version 3'
}
}
}
Q25: How do you implement data lake best practices?
Answer:
Best Practices Checklist:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Lake Best Practices β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Architecture β
β β Implement layered architecture (Bronze/Silver/Gold) β
β β Use proper partitioning strategy β
β β Implement data quality gates β
β β Design for scalability and cost optimization β
β β
β Security β
β β Implement least-privilege access β
β β Enable encryption at rest and in transit β
β β Use Lake Formation for fine-grained permissions β
β β Enable audit logging β
β β
β Operations β
β β Implement monitoring and alerting β
β β Automate data pipeline orchestration β
β β Implement backup and recovery β
β β Document data lineage and metadata β
β β
β Cost Optimization β
β β Use appropriate storage classes β
β β Implement lifecycle policies β
β β Optimize file sizes and formats β
β β Monitor and optimize query performance β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Implementation Checklist:
class DataLakeBestPractices:
def __init__(self, data_lake_bucket):
self.bucket = data_lake_bucket
def implement_best_practices(self):
# 1. Enable versioning
self.enable_versioning()
# 2. Implement lifecycle policies
self.implement_lifecycle()
# 3. Enable encryption
self.enable_encryption()
# 4. Configure access logging
self.enable_access_logging()
# 5. Set up monitoring
self.setup_monitoring()
def enable_versioning(self):
s3 = boto3.client('s3')
s3.put_bucket_versioning(
Bucket=self.bucket,
VersioningConfiguration={'Status': 'Enabled'}
)
def implement_lifecycle(self):
s3 = boto3.client('s3')
lifecycle_config = {
'Rules': [
{
'ID': 'OptimizeStorage',
'Status': 'Enabled',
'Filter': {'Prefix': ''},
'Transitions': [
{'Days': 30, 'StorageClass': 'STANDARD_IA'},
{'Days': 90, 'StorageClass': 'GLACIER'}
]
}
]
}
s3.put_bucket_lifecycle_configuration(
Bucket=self.bucket,
LifecycleConfiguration=lifecycle_config
)
Summary
Mastering AWS data lake architecture requires understanding:
- Architecture Design: Layered architecture, partitioning, file formats
- Security & Governance: Lake Formation, encryption, access control
- Cost Optimization: Storage classes, lifecycle policies, query optimization
- Operations: Monitoring, automation, disaster recovery
- Best Practices: Data quality, lineage, versioning
These concepts form the foundation for building scalable, secure, and cost-effective data lakes on AWS.