🏔️ AWS Data Lake Interview

Comprehensive interview preparation for data lake architecture, governance, and analytics on AWS.

Module: AWS Data Engineering • Topic 53 of 65 • Premium Content

Data Lake Architecture Overview

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                    AWS Data Lake Architecture                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Data Sources          Ingestion          Storage                  │
│  ┌──────────┐         ┌──────────┐       ┌──────────┐             │
│  │ Databases│────────▶│   Glue   │──────▶│   S3     │             │
│  │ APIs     │         │  DMS     │       │  (Lake)  │             │
│  │ Files    │         │ Firehose │       │          │             │
│  └──────────┘         └──────────┘       └────┬─────┘             │
│                                                │                   │
│                     Governance                 │                   │
│                  ┌─────────────┐               │                   │
│                  │   Lake      │               │                   │
│                  │ Formation   │◀──────────────┘                   │
│                  └──────┬──────┘                                   │
│                         │                                          │
│              ┌──────────┼──────────┐                              │
│              ▼          ▼          ▼                              │
│        ┌─────────┐ ┌─────────┐ ┌─────────┐                      │
│        │ Athena  │ │ Redshift│ │ QuickSight│                     │
│        │ (Query) │ │ Spectrum│ │ (Viz)    │                      │
│        └─────────┘ └─────────┘ └─────────┘                      │
│                                                                   │
└─────────────────────────────────────────────────────────────────────┘

Q1: How do you design a multi-layer data lake architecture on AWS?

Answer:

Medallion Architecture (Bronze/Silver/Gold):

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Multi-Layer Data Lake Architecture                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Bronze Layer (Raw)                                            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Raw data as-is                                        │   │
│  │ • Immutable, append-only                                │   │
│  │ • Multiple formats (JSON, CSV, Parquet)                 │   │
│  │ • Partitioned by source/date                            │   │
│  └─────────────────────────────────────────────────────────┘   │
│                          │                                      │
│                          ▼                                      │
│  Silver Layer (Validated)                                      │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Cleaned, validated data                               │   │
│  │ • Schema enforced                                       │   │
│  │ • Deduplicated                                          │   │
│  │ • Standardized formats (Parquet/ORC)                    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                          │                                      │
│                          ▼                                      │
│  Gold Layer (Business)                                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Business-level aggregates                             │   │
│  │ • Dimension/fact tables                                 │   │
│  │ • Optimized for analytics                               │   │
│  │ • Materialized views                                    │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

S3 Folder Structure:

Architecture Diagram

s3://data-lake/
├── bronze/
│   ├── source_a/
│   │   ├── year=2024/
│   │   │   ├── month=01/
│   │   │   │   ├── day=01/
│   │   │   │   │   └── data.json
├── silver/
│   ├── source_a/
│   │   ├── year=2024/
│   │   │   ├── month=01/
│   │   │   │   └── data.parquet
├── gold/
│   ├── dimensions/
│   ├── facts/
│   └── aggregates/

Implementation with Glue:

# Bronze to Silver transformation
def bronze_to_silver(glueContext, bronze_path, silver_path):
    # Read raw data
    dynamic_frame = glueContext.create_dynamic_frame.from_options(
        connection_type="s3",
        connection_options={"paths": [bronze_path]},
        format="json"
    )
    
    # Apply data quality rules
    cleaned = dynamic_frame.resolveChoice(
        choice="match_catalog",
        database="datalake",
        table_name="silver_table"
    )
    
    # Write to silver layer
    glueContext.write_dynamic_frame.from_options(
        frame=cleaned,
        connection_type="s3",
        connection_options={"path": silver_path},
        format="parquet",
        format_options={"compression": "snappy"}
    )

ℹ️

Interview Tip: The Medallion Architecture (Bronze/Silver/Gold) is a widely accepted pattern. Always mention data quality gates between layers.

Q2: Compare S3 storage classes for data lake workloads.

Answer:

Storage Class Comparison:

Storage Class	Use Case	Cost	Access Time
S3 Standard	Frequently accessed	$0.023/GB	Milliseconds
S3 Intelligent	Unknown access patterns	Variable	Milliseconds
S3 Standard-IA	Infrequent access	$0.0125/GB	Milliseconds
S3 One Zone-IA	Infrequent, re-creatable	$0.01/GB	Milliseconds
S3 Glacier Instant	Archive, millisecond	$0.004/GB	Milliseconds
S3 Glacier Flexible	Archive, minutes-hours	$0.0036/GB	Minutes
S3 Glacier Deep Archive	Long-term archive	$0.00099/GB	Hours

Lifecycle Policy:

{
    "Rules": [
        {
            "ID": "DataLakeLifecycle",
            "Filter": {
                "Prefix": "bronze/"
            },
            "Status": "Enabled",
            "Transitions": [
                {
                    "Days": 30,
                    "StorageClass": "STANDARD_IA"
                },
                {
                    "Days": 90,
                    "StorageClass": "GLACIER"
                },
                {
                    "Days": 365,
                    "StorageClass": "DEEP_ARCHIVE"
                }
            ]
        }
    ]
}

Cost Optimization Example:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Storage Cost Comparison (1TB for 1 year)          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  S3 Standard:        $276/year                                 │
│  S3 Standard-IA:     $150/year (45% savings)                  │
│  S3 Intelligent:     $180/year (35% savings)                  │
│  S3 Glacier:         $43/year (84% savings)                   │
│  S3 Deep Archive:    $12/year (96% savings)                   │
│                                                                 │
│  Recommended Strategy:                                         │
│  Days 1-30: Standard ($23/month)                               │
│  Days 31-90: Standard-IA ($12.50/month)                        │
│  Days 91-365: Glacier ($3.60/month)                           │
│  After 365: Deep Archive ($0.99/month)                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Q3: How do you implement data cataloging with AWS Glue Data Catalog?

Answer:

Glue Data Catalog Components:

import boto3

glue = boto3.client('glue')

# Create database
glue.create_database(
    DatabaseInput={
        'Name': 'analytics_db',
        'Description': 'Analytics data lake database'
    }
)

# Create table
glue.create_table(
    DatabaseName='analytics_db',
    TableInput={
        'Name': 'sales_data',
        'Description': 'Sales transaction data',
        'StorageDescriptor': {
            'Columns': [
                {'Name': 'transaction_id', 'Type': 'string'},
                {'Name': 'amount', 'Type': 'decimal(10,2)'},
                {'Name': 'transaction_date', 'Type': 'date'}
            ],
            'Location': 's3://data-lake/silver/sales/',
            'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
            'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
            'SerdeInfo': {
                'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
            }
        },
        'PartitionKeys': [
            {'Name': 'year', 'Type': 'int'},
            {'Name': 'month', 'Type': 'int'}
        ]
    }
)

Automatic Cataloging with Glue Crawlers:

# Create crawler
glue.create_crawler(
    Name='sales_crawler',
    Role='arn:aws:iam::role/GlueCrawlerRole',
    DatabaseName='analytics_db',
    Targets={
        'S3Targets': [
            {'Path': 's3://data-lake/silver/sales/'}
        ]
    },
    Schedule='cron(0 1 * * ? *)',  # Daily at 1 AM
    SchemaChangePolicy={
        'UpdateBehavior': 'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'LOG'
    }
)

# Start crawler
glue.start_crawler(Name='sales_crawler')

Catalog Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Glue Data Catalog Architecture                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                   Data Catalog                           │   │
│  │ ┌─────────────┬─────────────┬─────────────┬───────────┐ │   │
│  │ │  Databases  │   Tables    │  Partitions │  Columns  │ │   │
│  │ │             │             │             │           │ │   │
│  │ │ analytics_db│ sales_data  │ year=2024   │ id, amount│ │   │
│  │ │             │ customers   │ month=01    │ name, email│ │   │
│  │ │ warehouse_db│ orders      │ region=us   │ order_id  │ │   │
│  │ └─────────────┴─────────────┴─────────────┴───────────┘ │   │
│  └─────────────────────────────────────────────────────────┘   │
│                          │                                      │
│         ┌────────────────┼────────────────┐                    │
│         ▼                ▼                ▼                    │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐           │
│  │   Athena    │  │ Redshift    │  │    EMR      │           │
│  │   Query     │  │ Spectrum    │  │  Spark SQL  │           │
│  └─────────────┘  └─────────────┘  └─────────────┘           │
└─────────────────────────────────────────────────────────────────┘

Q4: How do you implement data lake security and access control?

Answer:

Security Layers:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Lake Security Architecture                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Network Security                                              │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • VPC endpoints for S3                                  │   │
│  │ • S3 bucket policies                                    │   │
│  │ • Security groups                                       │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Identity & Access                                             │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • IAM roles/policies                                    │   │
│  │ • Lake Formation permissions                            │   │
│  │ • S3 bucket policies                                    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Data Protection                                               │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Encryption at rest (SSE-S3, SSE-KMS, SSE-C)         │   │
│  │ • Encryption in transit (TLS)                           │   │
│  │ • Column-level encryption                               │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Governance                                                    │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Lake Formation grants                                 │   │
│  │ • CloudTrail auditing                                   │   │
│  │ • Data classification                                   │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

IAM Policy for Data Lake Access:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::data-lake",
                "arn:aws:s3:::data-lake/*"
            ],
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "bronze/${aws:PrincipalTag/department}/*",
                        "silver/${aws:PrincipalTag/department}/*"
                    ]
                }
            }
        }
    ]
}

Lake Formation Permissions:

# Grant table-level permissions
lakeformation = boto3.client('lakeformation')

lakeformation.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::role/analyst-role'},
    Resource={
        'Table': {
            'DatabaseName': 'analytics_db',
            'Name': 'sales_data'
        }
    },
    Permissions=['SELECT', 'DESCRIBE'],
    GrantOption=False
)

# Grant column-level permissions
lakeformation.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::role/hr-role'},
    Resource={
        'TableWithColumns': {
            'DatabaseName': 'analytics_db',
            'Name': 'employees',
            'ColumnNames': ['employee_id', 'name', 'department']
        }
    },
    Permissions=['SELECT']
)

Q5: How do you optimize query performance in Athena for data lakes?

Answer:

Optimization Techniques:

1. Partitioning:

-- Create partitioned table
CREATE EXTERNAL TABLE sales (
    transaction_id STRING,
    amount DECIMAL(10,2),
    customer_id STRING
)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS PARQUET
LOCATION 's3://data-lake/silver/sales/';

-- Query with partition pruning
SELECT * FROM sales 
WHERE year = 2024 AND month = 1 AND day = 15;

2. File Format Optimization:

# Convert to optimized Parquet
df.write \
    .partitionBy("year", "month", "day") \
    .option("compression", "snappy") \
    .parquet("s3://data-lake/optimized/sales/")

3. Bucketing for Joins:

-- Bucketed table for efficient joins
CREATE EXTERNAL TABLE orders (
    order_id STRING,
    customer_id STRING,
    amount DECIMAL(10,2)
)
CLUSTERED BY (customer_id) INTO 32 BUCKETS
STORED AS PARQUET
LOCATION 's3://data-lake/silver/orders/';

Performance Comparison:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Athena Query Performance Optimization              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Scenario: 1TB dataset, query 1 day of data                   │
│                                                                 │
│  Without Optimization:                                         │
│  • Full scan: 1,000 GB scanned                                 │
│  • Cost: $5.00 per query                                       │
│  • Time: ~5 minutes                                            │
│                                                                 │
│  With Partitioning:                                            │
│  • Partition pruned: 2.7 GB scanned                            │
│  • Cost: $0.014 per query                                      │
│  • Time: ~10 seconds                                           │
│                                                                 │
│  With Columnar Format:                                         │
│  • Column projection: 0.8 GB scanned                           │
│  • Cost: $0.004 per query                                      │
│  • Time: ~3 seconds                                            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Q6: Design a data lake for real-time and batch analytics.

Answer:

Hybrid Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Real-Time + Batch Data Lake Architecture           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Real-Time Path                                                │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐        │
│  │   Kinesis   │───▶│   Lambda    │───▶│   S3        │        │
│  │   Streams   │    │  Transform  │    │  (Hot)      │        │
│  └─────────────┘    └─────────────┘    └──────┬──────┘        │
│                                               │                 │
│  Batch Path                                  │                 │
│  ┌─────────────┐    ┌─────────────┐          │                 │
│  │   Glue      │───▶│   EMR       │──────────┤                 │
│  │   Crawler   │    │  Spark      │          │                 │
│  └─────────────┘    └─────────────┘          │                 │
│                                               ▼                 │
│                                        ┌─────────────┐         │
│                                        │  Unified    │         │
│                                        │  Data Lake  │         │
│                                        │  (S3)       │         │
│                                        └──────┬──────┘         │
│                                               │                 │
│                              ┌────────────────┼────────────┐   │
│                              ▼                ▼            ▼   │
│                       ┌─────────┐      ┌─────────┐  ┌────────┐│
│                       │ Athena  │      │Redshift │  │QuickSight│
│                       │(Ad-hoc) │      │Spectrum │  │(Dash)  ││
│                       └─────────┘      └─────────┘  └────────┘│
└─────────────────────────────────────────────────────────────────┘

Lambda Real-Time Ingestion:

import boto3
import json
from datetime import datetime

s3 = boto3.client('s3')

def lambda_handler(event, context):
    for record in event['Records']:
        data = json.loads(record['kinesis']['data'])
        
        # Add metadata
        enriched = {
            **data,
            '_ingestion_time': datetime.now().isoformat(),
            '_source': 'real-time',
            '_year': datetime.now().year,
            '_month': datetime.now().month,
            '_day': datetime.now().day
        }
        
        # Write to S3 with partitioning
        key = f"bronze/real-time/year={enriched['_year']}/month={enriched['_month']}/day={enriched['_day']}/{record['kinesis']['sequenceNumber']}.json"
        
        s3.put_object(
            Bucket='data-lake-bucket',
            Key=key,
            Body=json.dumps(enriched)
        )

Q7: How do you implement data quality in a data lake?

Answer:

Data Quality Framework:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Lake Quality Framework                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Quality Dimensions                                            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Completeness - All expected data present             │   │
│  │ • Accuracy - Data matches real-world entities          │   │
│  │ • Consistency - No contradictions across datasets      │   │
│  │ • Timeliness - Data available when needed              │   │
│  │ • Validity - Data conforms to defined formats/rules    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Implementation Layers                                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Ingestion → Validation → Processing → Publishing       │   │
│  │     ↓            ↓            ↓            ↓            │   │
│  │  Schema     Rules Engine   Checks     SLA Monitor      │   │
│  │  Check                    Aggregate                    │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Glue Data Quality Rules:

from pyspark.sql import DataFrame
from pyspark.sql.functions import col, count, when, sum

class DataQualityEngine:
    def __init__(self):
        self.rules = {}
    
    def add_rule(self, name, rule_func, severity='ERROR'):
        self.rules[name] = {
            'func': rule_func,
            'severity': severity
        }
    
    def validate(self, df: DataFrame) -> dict:
        results = {}
        
        for name, rule in self.rules.items():
            try:
                passed = rule['func'](df)
                results[name] = {
                    'passed': passed,
                    'severity': rule['severity']
                }
            except Exception as e:
                results[name] = {
                    'passed': False,
                    'error': str(e),
                    'severity': rule['severity']
                }
        
        return results

# Usage
engine = DataQualityEngine()

# Add rules
engine.add_rule(
    'no_nulls',
    lambda df: df.filter(col('customer_id').isNull()).count() == 0
)

engine.add_rule(
    'valid_amounts',
    lambda df: df.filter(col('amount') < 0).count() == 0
)

engine.add_rule(
    'referential_integrity',
    lambda df: df.join(
        spark.read.table('customers'),
        'customer_id',
        'left_anti'
    ).count() == 0
)

Q8: How do you implement data lineage in a data lake?

Answer:

Lineage Tracking Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Lake Lineage Architecture                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                   Lineage Metadata                      │   │
│  │ ┌─────────────┬─────────────┬─────────────┬───────────┐ │   │
│  │ │   Source    │ Transformation│   Target   │  lineage  │ │   │
│  │ │   Tables    │   Details    │   Tables   │  Graph    │ │   │
│  │ └─────────────┴─────────────┴─────────────┴───────────┘ │   │
│  └─────────────────────────────────────────────────────────┘   │
│                          │                                      │
│         ┌────────────────┼────────────────┐                    │
│         ▼                ▼                ▼                    │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐           │
│  │   Glue      │  │  CloudTrail │  │  Custom     │           │
│  │  Catalog    │  │  API Logs   │  │  Metadata   │           │
│  └─────────────┘  └─────────────┘  └─────────────┘           │
└─────────────────────────────────────────────────────────────────┘

Lineage Implementation:

class LineageTracker:
    def __init__(self):
        self.lineage_store = []
    
    def track_transformation(self, job_name, inputs, outputs, transformation_sql):
        lineage_record = {
            'job_name': job_name,
            'inputs': inputs,
            'outputs': outputs,
            'transformation': transformation_sql,
            'timestamp': datetime.now().isoformat(),
            'job_run_id': get_job_run_id()
        }
        
        self.lineage_store.append(lineage_record)
        
        # Store in DynamoDB
        dynamodb = boto3.resource('dynamodb')
        table = dynamodb.Table('data_lineage')
        table.put_item(Item=lineage_record)
    
    def get_upstream(self, table_name):
        # Get all upstream dependencies
        upstream = []
        for record in self.lineage_store:
            if table_name in record['outputs']:
                upstream.append(record)
                upstream.extend(self.get_upstream_tables(record['inputs']))
        return upstream
    
    def get_downstream(self, table_name):
        # Get all downstream dependencies
        downstream = []
        for record in self.lineage_store:
            if table_name in record['inputs']:
                downstream.append(record)
                downstream.extend(self.get_downstream_tables(record['outputs']))
        return downstream

Q9: How do you handle schema evolution in a data lake?

Answer:

Schema Evolution Strategies:

1. Schema Registry with Glue:

# Register schema
glue = boto3.client('glue')

# Create schema
glue.create_schema(
    RegistryId={'RegistryName': 'my-registry'},
    SchemaName='sales-schema',
    DataFormat='PARQUET',
    Compatibility='BACKWARD',
    SchemaDefinition=json.dumps({
        'type': 'record',
        'name': 'Sales',
        'fields': [
            {'name': 'id', 'type': 'string'},
            {'name': 'amount', 'type': 'double'}
        ]
    })
)

2. Spark Schema Evolution:

# Read with schema evolution
df = spark.read \
    .option("mergeSchema", "true") \
    .parquet("s3://data-lake/sales/")

# Write with schema evolution
df.write \
    .mode("append") \
    .option("mergeSchema", "true") \
    .parquet("s3://data-lake/sales/")

3. Schema Evolution Patterns:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Schema Evolution Patterns                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Forward Compatibility (Adding columns)                        │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Old Schema: [id, name, amount]                         │   │
│  │ New Schema: [id, name, amount, category]               │   │
│  │ Strategy: Add with default value, backward compatible  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Backward Compatibility (Removing columns)                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Old Schema: [id, name, amount, category]               │   │
│  │ New Schema: [id, name, amount]                         │   │
│  │ Strategy: Column still readable, just not written      │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Full Compatibility (Renaming columns)                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Old Schema: [id, cust_name, amt]                       │   │
│  │ New Schema: [id, customer_name, amount]                │   │
│  │ Strategy: Use aliases, maintain both names             │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Q10: Design a multi-account data lake architecture.

Answer:

Multi-Account Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Multi-Account Data Lake Architecture               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Data Account          Analytics Account     Consumption       │
│  ┌─────────────┐       ┌─────────────┐      ┌─────────────┐   │
│  │   S3        │──────▶│  EMR/Glue   │─────▶│  Athena     │   │
│  │  (Raw)      │  Peering│            │      │  Redshift   │   │
│  └─────────────┘       └─────────────┘      └─────────────┘   │
│         │                     │                     │          │
│         │     ┌───────────────┼───────────────┐    │          │
│         │     │               │               │    │          │
│         ▼     ▼               ▼               ▼    ▼          │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │              AWS Organizations                          │   │
│  │  ┌─────────────┬─────────────┬─────────────┬──────────┐ │   │
│  │  │  Security   │  Logging    │  Audit      │  Billing │ │   │
│  │  │  Account    │  Account    │  Account    │  Account │ │   │
│  │  └─────────────┴─────────────┴─────────────┴──────────┘ │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Cross-Account Access:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::analytics-account:role/DataAnalyst"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::shared-data-lake",
                "arn:aws:s3:::shared-data-lake/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:PrincipalOrgID": "o-xxxxxxxxxx"
                }
            }
        }
    ]
}

Q11: How do you implement data lake governance with Lake Formation?

Answer:

Lake Formation Components:

# Register location
lakeformation = boto3.client('lakeformation')

lakeformation.register_resource(
    ResourceArn='arn:aws:s3:::data-lake-bucket',
    RoleArn='arn:aws:iam::role/LakeFormationRole',
    UseServiceLinkedRole=False
)

# Grant database permissions
lakeformation.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'iam-role/data-engineer'},
    Resource={
        'Database': {
            'Name': 'analytics_db'
        }
    },
    Permissions=['CREATE_TABLE', 'ALTER', 'DROP']
)

# Grant table permissions with grant option
lakeformation.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'iam-role/data-steward'},
    Resource={
        'Table': {
            'DatabaseName': 'analytics_db',
            'Name': 'sales_data'
        }
    },
    Permissions=['SELECT', 'INSERT', 'DELETE', 'UPDATE'],
    GrantOption=True
)

Governance Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Lake Formation Governance                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                   Admin Layer                           │   │
│  │  • Register storage locations                           │   │
│  │  • Manage data lake administrators                      │   │
│  │  • Configure cross-account permissions                  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                          │                                      │
│                          ▼                                      │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                   Permission Layer                      │   │
│  │  • Database-level permissions                           │   │
│  │  • Table-level permissions                              │   │
│  │  • Column-level permissions                             │   │
│  │  • Row-level permissions                                │   │
│  └─────────────────────────────────────────────────────────┘   │
│                          │                                      │
│                          ▼                                      │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                   Audit Layer                           │   │
│  │  • CloudTrail integration                               │   │
│  │  • Permission grants/revoke history                     │   │
│  │  • Data access logs                                     │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Q12: How do you optimize costs in a data lake?

Answer:

Cost Optimization Strategies:

1. Storage Optimization:

# S3 Lifecycle Policies
s3 = boto3.client('s3')

lifecycle_config = {
    'Rules': [
        {
            'ID': 'OptimizeStorage',
            'Status': 'Enabled',
            'Filter': {'Prefix': 'bronze/'},
            'Transitions': [
                {'Days': 30, 'StorageClass': 'STANDARD_IA'},
                {'Days': 90, 'StorageClass': 'GLACIER'},
                {'Days': 365, 'StorageClass': 'DEEP_ARCHIVE'}
            ]
        }
    ]
}

s3.put_bucket_lifecycle_configuration(
    Bucket='data-lake-bucket',
    LifecycleConfiguration=lifecycle_config
)

2. Query Optimization:

# Optimize file sizes for Athena
# Target: 128MB - 1GB per file
def optimize_file_sizes(input_path, output_path, target_size_mb=256):
    df = spark.read.parquet(input_path)
    
    # Calculate optimal number of partitions
    total_size = df.count() * df.schema.jsonSize()
    num_partitions = max(1, int(total_size / (target_size_mb * 1024 * 1024)))
    
    df.repartition(num_partitions) \
      .write \
      .mode('overwrite') \
      .parquet(output_path)

3. Cost Monitoring:

# CloudWatch cost anomaly detection
cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_alarm(
    AlarmName='DataLakeCostAnomaly',
    MetricName='EstimatedCharges',
    Namespace='AWS/Billing',
    Statistic='Maximum',
    Period=86400,
    EvaluationPeriods=1,
    Threshold=1000,
    ComparisonOperator='GreaterThanThreshold',
    AlarmActions=['arn:aws:sns:us-east-1:123456789:cost-alerts']
)

Cost Breakdown:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Lake Cost Optimization Impact                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Storage:                                                      │
│  • Standard → IA: 40-50% savings                              │
│  • Lifecycle policies: 60-80% savings                         │
│  • Compression (Snappy): 60-70% size reduction                │
│                                                                 │
│  Queries:                                                      │
│  • Partitioning: 90%+ cost reduction                          │
│  • File optimization: 50%+ faster queries                     │
│  • Result caching: 80%+ faster repeated queries               │
│                                                                 │
│  Processing:                                                   │
│  • Spot instances: 60-70% savings                             │
│  • Auto-scaling: 30-40% savings                               │
│  • Serverless (Glue): No idle costs                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Q13: How do you implement data lake disaster recovery?

Answer:

DR Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Lake Disaster Recovery                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Primary Region (us-east-1)      DR Region (us-west-2)        │
│  ┌─────────────────────┐        ┌─────────────────────┐       │
│  │ S3 Primary Bucket   │───────▶│ S3 Replica Bucket   │       │
│  │ (Versioning Enabled)│  CRR   │                     │       │
│  └─────────────────────┘        └─────────────────────┘       │
│                                                                 │
│  ┌─────────────────────┐        ┌─────────────────────┐       │
│  │ Glue Catalog        │───────▶│ Glue Catalog        │       │
│  │ (Primary)           │ Cross  │ (Replica)           │       │
│  └─────────────────────┘ Region └─────────────────────┘       │
│                                                                 │
│  ┌─────────────────────┐        ┌─────────────────────┐       │
│  │ Lake Formation      │───────▶│ Lake Formation      │       │
│  │ Permissions         │ Backup │ Permissions         │       │
│  └─────────────────────┘        └─────────────────────┘       │
│                                                                 │
│  RPO: 15 minutes (CRR lag)                                    │
│  RTO: 30 minutes (automated failover)                         │
└─────────────────────────────────────────────────────────────────┘

S3 Cross-Region Replication:

s3 = boto3.client('s3')

# Enable CRR
s3.put_bucket_replication(
    Bucket='primary-data-lake',
    ReplicationConfiguration={
        'Role': 'arn:aws:iam::role/s3-replication-role',
        'Rules': [
            {
                'ID': 'replicate-all',
                'Status': 'Enabled',
                'Destination': {
                    'Bucket': 'arn:aws:s3:::dr-data-lake',
                    'StorageClass': 'STANDARD_IA'
                },
                'Filter': {'Prefix': ''},
                'Status': 'Enabled'
            }
        ]
    }
)

# Enable versioning (required for CRR)
s3.put_bucket_versioning(
    Bucket='primary-data-lake',
    VersioningConfiguration={'Status': 'Enabled'}
)

Q14: How do you implement data lake monitoring and observability?

Answer:

Monitoring Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Lake Monitoring Architecture                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Metrics           Logs              Traces                    │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐           │
│  │ CloudWatch  │  │ CloudWatch  │  │ X-Ray       │           │
│  │ Metrics     │  │ Logs        │  │ Tracing     │           │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘           │
│         │                │                │                    │
│         └────────────────┼────────────────┘                    │
│                          ▼                                      │
│                   ┌─────────────┐                              │
│                   │  Dashboard  │                              │
│                   │  (QuickSight)│                              │
│                   └─────────────┘                              │
└─────────────────────────────────────────────────────────────────┘

CloudWatch Metrics:

cloudwatch = boto3.client('cloudwatch')

# Data lake metrics
metrics = [
    {
        'MetricName': 'DataLakeSize',
        'Dimensions': [{'Name': 'Lake', 'Value': 'production'}],
        'Value': get_lake_size_gb(),
        'Unit': 'Gigabytes'
    },
    {
        'MetricName': 'QueryCount',
        'Dimensions': [{'Name': 'Service', 'Value': 'Athena'}],
        'Value': get_athena_query_count(),
        'Unit': 'Count'
    },
    {
        'MetricName': 'DataFreshness',
        'Dimensions': [{'Name': 'Dataset', 'Value': 'sales'}],
        'Value': get_data_freshness_hours(),
        'Unit': 'Hours'
    }
]

cloudwatch.put_metric_data(
    Namespace='DataLake/Analytics',
    MetricData=metrics
)

S3 Metrics:

# Monitor S3 bucket metrics
cloudwatch = boto3.client('cloudwatch')

# Get S3 bucket size
response = cloudwatch.get_metric_statistics(
    Namespace='AWS/S3',
    MetricName='BucketSizeBytes',
    Dimensions=[
        {'Name': 'BucketName', 'Value': 'data-lake-bucket'},
        {'Name': 'StorageType', 'Value': 'StandardStorage'}
    ],
    StartTime=datetime.now() - timedelta(days=1),
    EndTime=datetime.now(),
    Period=86400,
    Statistics=['Average']
)

Q15: How do you implement data lake automation?

Answer:

Automation Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Lake Automation Architecture                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  CI/CD Pipeline                                                │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐        │
│  │  Code       │───▶│  Build      │───▶│  Deploy     │        │
│  │  Commit     │    │  (CodeBuild)│    │  (CloudFormation)│   │
│  └─────────────┘    └─────────────┘    └─────────────┘        │
│                                                                 │
│  Orchestration                                                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐        │
│  │  EventBridge│───▶│  Step       │───▶│  Lambda     │        │
│  │  Rules      │    │  Functions  │    │  Tasks      │        │
│  └─────────────┘    └─────────────┘    └─────────────┘        │
│                                                                 │
│  Infrastructure as Code                                        │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  • CloudFormation / CDK                                  │   │
│  │  • Glue workflows                                        │   │
│  │  • Step Functions state machines                        │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

EventBridge Automation:

import boto3

events = boto3.client('events')

# Schedule daily data lake refresh
events.put_rule(
    Name='DailyDataLakeRefresh',
    ScheduleExpression='cron(0 2 * * ? *)',
    State='ENABLED'
)

# Add target
events.put_targets(
    Rule='DailyDataLakeRefresh',
    Targets=[
        {
            'Id': 'GlueWorkflow',
            'Arn': 'arn:aws:states:us-east-1:123456789:stateMachine:DataLakeRefresh',
            'RoleArn': 'arn:aws:iam::role/EventBridgeRole'
        }
    ]
)

CloudFormation Template:

Resources:
  DataLakeBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: data-lake-${AWS::AccountId}
      VersioningConfiguration:
        Status: Enabled
      LifecycleConfiguration:
        Rules:
          - Id: OptimizeStorage
            Status: Enabled
            Transitions:
              - TransitionInDays: 30
                StorageClass: STANDARD_IA
              - TransitionInDays: 90
                StorageClass: GLACIER

  GlueCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Name: data-lake-crawler
      Role: !GetAtt GlueRole.Arn
      DatabaseName: analytics_db
      Targets:
        S3Targets:
          - Path: !Sub "s3://${DataLakeBucket}/silver/"
      Schedule: cron(0 1 * * ? *)

Q16: How do you implement data lake for machine learning workloads?

Answer:

ML-Ready Data Lake Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Lake for Machine Learning                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Data Sources          Feature Engineering    Feature Store    │
│  ┌─────────────┐      ┌─────────────┐       ┌─────────────┐  │
│  │ Raw Data    │─────▶│  Glue/EMR   │──────▶│ SageMaker   │  │
│  │ (S3)        │      │  Features   │       │ Feature Store│  │
│  └─────────────┘      └─────────────┘       └─────────────┘  │
│                                                                 │
│  Training            Model Registry        Deployment         │
│  ┌─────────────┐    ┌─────────────┐      ┌─────────────┐    │
│  │ SageMaker   │───▶│ Model       │─────▶│ Endpoint    │    │
│  │ Training    │    │ Registry    │      │ (Real-time) │    │
│  └─────────────┘    └─────────────┘      └─────────────┘    │
│                                                                 │
│  Batch Transform                                               │
│  ┌─────────────┐    ┌─────────────┐      ┌─────────────┐    │
│  │ SageMaker   │───▶│ Results     │─────▶│ S3/Redshift │    │
│  │ Batch       │    │             │      │             │    │
│  └─────────────┘    └─────────────┘      └─────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Feature Engineering Pipeline:

# SageMaker Processing for feature engineering
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor

sklearn_processor = SKLearnProcessor(
    framework_version='0.23-1',
    role='arn:aws:iam::role/SageMakerRole',
    instance_count=2,
    instance_type='ml.m5.xlarge'
)

sklearn_processor.run(
    code='feature_engineering.py',
    inputs=[
        ProcessingInput(
            source='s3://data-lake/raw/customers/',
            destination='/opt/ml/processing/input/customers'
        ),
        ProcessingInput(
            source='s3://data-lake/raw/transactions/',
            destination='/opt/ml/processing/input/transactions'
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name='features',
            source='/opt/ml/processing/output/features',
            destination='s3://data-lake/features/customers/'
        )
    ]
)

Q17: How do you implement data lake for analytics and reporting?

Answer:

Analytics Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Lake Analytics Architecture                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Ad-Hoc Analytics    BI Analytics       Real-Time Analytics    │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐        │
│  │  Athena     │    │ QuickSight  │    │ Kinesis     │        │
│  │  (SQL)      │    │ (Dashboards)│    │ Analytics   │        │
│  └─────────────┘    └─────────────┘    └─────────────┘        │
│         │                │                    │                │
│         └────────────────┼────────────────────┘                │
│                          ▼                                      │
│                   ┌─────────────┐                              │
│                   │  Data Lake  │                              │
│                   │  (S3)       │                              │
│                   └─────────────┘                              │
└─────────────────────────────────────────────────────────────────┘

Athena Query Optimization:

-- Create optimized table for analytics
CREATE EXTERNAL TABLE sales_analytics (
    transaction_id STRING,
    customer_id STRING,
    product_id STRING,
    amount DECIMAL(10,2),
    quantity INT,
    transaction_date TIMESTAMP
)
PARTITIONED BY (year INT, month INT, region STRING)
STORED AS PARQUET
LOCATION 's3://data-lake/gold/sales/'
TBLPROPERTIES (
    'parquet.compression'='SNAPPY',
    'projection.enabled'='true',
    'projection.year.type'='integer',
    'projection.year.range'='2020,2030'
);

QuickSight Integration:

# Create QuickSight dataset from Athena
quicksight = boto3.client('quicksight')

response = quicksight.create_data_source(
    AwsAccountId='123456789',
    DataSourceId='data-lake-source',
    Name='Data Lake Analytics',
    Type='ATHENA',
    Parameters={
        'Athena': {
            'WorkGroup': 'primary'
        }
    },
    Permissions=[
        {
            'Principal': 'arn:aws:quicksight:us-east-1:123456789:user/default/admin',
            'Actions': [
                'quicksight:DescribeDataSource',
                'quicksight:DescribeDataSourcePermissions',
                'quicksight:UpdateDataSource',
                'quicksight:DeleteDataSource'
            ]
        }
    ]
)

Q18: How do you implement data lake for IoT and time-series data?

Answer:

IoT Data Lake Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              IoT Data Lake Architecture                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  IoT Devices        Ingestion          Storage                 │
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐         │
│  │ Sensors     │──▶│ IoT Core     │──▶│ Kinesis     │         │
│  │ (MQTT)      │   │              │   │ Streams     │         │
│  └─────────────┘   └─────────────┘   └──────┬──────┘         │
│                                              │                 │
│  ┌─────────────┐   ┌─────────────┐          │                 │
│  │ Gateways    │──▶│ IoT Analytics│─────────┤                 │
│  └─────────────┘   └─────────────┘          │                 │
│                                              ▼                 │
│                                       ┌─────────────┐         │
│                                       │   S3        │         │
│                                       │  (IoT Lake) │         │
│                                       └──────┬──────┘         │
│                                              │                 │
│                             ┌────────────────┼────────────┐   │
│                             ▼                ▼            ▼   │
│                      ┌──────────┐    ┌──────────┐  ┌────────┐│
│                      │ TimeStream│   │ Athena   │  │ QuickSight│
│                      │ (Metrics) │   │ (Query)  │  │ (Viz)  ││
│                      └──────────┘    └──────────┘  └────────┘│
└─────────────────────────────────────────────────────────────────┘

IoT Data Partitioning:

# Optimal partitioning for IoT data
# Partition by device_type/year/month/day/hour
partition_schema = """
device_type STRING,
year INT,
month INT,
day INT,
hour INT
"""

# Write IoT data with optimal partitioning
df.write \
    .partitionBy("device_type", "year", "month", "day", "hour") \
    .parquet("s3://data-lake/iot/")

Q19: How do you implement data lake for unstructured data?

Answer:

Unstructured Data Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Unstructured Data Lake Architecture                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Data Types           Processing          Storage              │
│  ┌─────────────┐     ┌─────────────┐    ┌─────────────┐       │
│  │ Images      │────▶│ Rekognition │───▶│ S3          │       │
│  │ Videos      │     │             │    │ (Binary)    │       │
│  └─────────────┘     └─────────────┘    └─────────────┘       │
│                                                                 │
│  ┌─────────────┐     ┌─────────────┐    ┌─────────────┐       │
│  │ Documents   │────▶│ Textract    │───▶│ Metadata    │       │
│  │ PDFs        │     │ Comprehend  │    │ (DynamoDB)  │       │
│  └─────────────┘     └─────────────┘    └─────────────┘       │
│                                                                 │
│  ┌─────────────┐     ┌─────────────┐    ┌─────────────┐       │
│  │ Audio       │────▶│ Transcribe  │───▶│ Text Index  │       │
│  │             │     │ Comprehend  │    │ (OpenSearch)│       │
│  └─────────────┘     └─────────────┘    └─────────────┘       │
└─────────────────────────────────────────────────────────────────┘

Metadata Extraction:

import boto3

# Extract metadata from images
rekognition = boto3.client('rekognition')

def extract_image_metadata(bucket, key):
    response = rekognition.detect_labels(
        Image={'S3Object': {'Bucket': bucket, 'Name': key}},
        MaxLabels=10
    )
    
    metadata = {
        'labels': [label['Name'] for label in response['Labels']],
        'confidence': [label['Confidence'] for label in response['Labels']]
    }
    
    return metadata

# Store metadata in DynamoDB
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('image_metadata')

table.put_item(
    Item={
        'image_id': key,
        'bucket': bucket,
        'metadata': metadata,
        'processed_at': datetime.now().isoformat()
    }
)

Q20: How do you implement data lake with data sharing capabilities?

Answer:

Data Sharing Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Lake Data Sharing                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Data Producer Account      Data Consumer Accounts             │
│  ┌─────────────────────┐   ┌─────────────────────┐            │
│  │ S3 Bucket           │──▶│ Cross-Account       │            │
│  │ (Source)            │   │ Access              │            │
│  └─────────────────────┘   └─────────────────────┘            │
│                                                                 │
│  ┌─────────────────────┐   ┌─────────────────────┐            │
│  │ Lake Formation      │──▶│ Shared Permissions  │            │
│  │ Grants              │   │                     │            │
│  └─────────────────────┘   └─────────────────────┘            │
│                                                                 │
│  ┌─────────────────────┐   ┌─────────────────────┐            │
│  │ Redshift Spectrum   │──▶│ Federated Queries   │            │
│  │ (Shared Tables)     │   │                     │            │
│  └─────────────────────┘   └─────────────────────┘            │
└─────────────────────────────────────────────────────────────────┘

Lake Formation Cross-Account Sharing:

# Grant cross-account access
lakeformation = boto3.client('lakeformation')

lakeformation.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::consumer-account:role/DataAnalyst'},
    Resource={
        'Table': {
            'DatabaseName': 'shared_db',
            'Name': 'shared_table'
        }
    },
    Permissions=['SELECT', 'DESCRIBE']
)

Q21: How do you implement data versioning in a data lake?

Answer:

Data Versioning Strategies:

1. S3 Versioning:

# Enable versioning
s3 = boto3.client('s3')

s3.put_bucket_versioning(
    Bucket='data-lake-bucket',
    VersioningConfiguration={'Status': 'Enabled'}
)

# Get specific version
response = s3.get_object(
    Bucket='data-lake-bucket',
    Key='sales/data.parquet',
    VersionId='abc123'
)

2. Delta Lake for ACID Transactions:

# Delta Lake versioning
from delta.tables import DeltaTable

# Write with versioning
df.write \
    .format("delta") \
    .mode("overwrite") \
    .save("s3://data-lake/sales/")

# Read specific version
spark.read \
    .format("delta") \
    .option("versionAsOf", 5) \
    .load("s3://data-lake/sales/")

# Time travel
spark.read \
    .format("delta") \
    .option("timestampAsOf", "2024-01-15") \
    .load("s3://data-lake/sales/")

3. Glue Table Versions:

# Get table version
response = glue.get_table_version(
    DatabaseName='analytics_db',
    TableName='sales_data',
    VersionId='4'
)

# Restore table version
glue.batch_update_table_version(
    DatabaseName='analytics_db',
    TableName='sales_data',
    VersionDeltas=[{
        'VersionId': '4',
        'ViewToUpdate': response['TableVersion']['Table']
    }]
)

Q22: How do you implement data lake for regulatory compliance?

Answer:

Compliance Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Lake Compliance Architecture                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Data Classification    Access Control      Audit Logging      │
│  ┌─────────────────┐   ┌─────────────────┐ ┌─────────────────┐│
│  │ • PII           │   │ • Column-level  │ │ • CloudTrail    ││
│  │ • PHI           │   │ • Row-level     │ │ • S3 access logs││
│  │ • Financial     │   │ • Time-based    │ │ • Glue audit    ││
│  └─────────────────┘   └─────────────────┘ └─────────────────┘│
│                                                                 │
│  Data Retention       Encryption          Masking             │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│  │ • Lifecycle     │ │ • SSE-KMS       │ │ • Dynamic       │ │
│  │ • Archival      │ │ • Client-side   │ │ • Static        │ │
│  │ • Deletion      │ │ • Field-level   │ │ • Tokenization  │ │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Data Masking:

from pyspark.sql.functions import md5, concat, lit

# Mask PII columns
def mask_pii(df):
    return df \
        .withColumn("masked_email", 
            concat(
                md5(col("email")),
                lit("@masked.com")
            )
        ) \
        .withColumn("masked_ssn",
            concat(
                lit("XXX-XX-"),
                col("ssn").substr(-4, 4)
            )
        )

Q23: How do you implement data lake backup and recovery?

Answer:

Backup Strategy:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Lake Backup Architecture                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Backup Strategy          RPO           RTO                    │
│  ┌─────────────────────┬───────────┬───────────┐              │
│  │ S3 Versioning       │ Real-time │ Minutes   │              │
│  │ Cross-Region Replication│ 15 min │ Minutes   │              │
│  │ Daily Snapshots     │ 24 hours  │ Hours     │              │
│  │ Weekly Archives     │ 7 days    │ Hours     │              │
│  └─────────────────────┴───────────┴───────────┘              │
│                                                                 │
│  Recovery Scenarios                                            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Accidental deletion: S3 versioning rollback          │   │
│  │ • Corruption: Point-in-time recovery                   │   │
│  │ • Regional outage: Cross-region failover               │   │
│  │ • Ransomware: Restore from immutable backup            │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Automated Backup:

# Daily snapshot Lambda
def lambda_handler(event, context):
    s3 = boto3.client('s3')
    
    # Get all objects
    paginator = s3.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket='data-lake-bucket')
    
    for page in pages:
        for obj in page.get('Contents', []):
            # Copy with versioning
            copy_source = {'Bucket': 'data-lake-bucket', 'Key': obj['Key']}
            
            s3.copy_object(
                CopySource=copy_source,
                Bucket='data-lake-backup-bucket',
                Key=f"daily/{datetime.now().strftime('%Y-%m-%d')}/{obj['Key']}",
                MetadataDirective='COPY'
            )

Q24: How do you optimize data lake performance?

Answer:

Performance Optimization:

1. File Format Optimization:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              File Format Comparison                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Format      Compression    Query    Write    Columnar         │
│  ───────────────────────────────────────────────────────────   │
│  Parquet     Snappy/Gzip    Fast     Medium   Yes              │
│  ORC         Zlib           Fast     Slow     Yes              │
│  Avro        Snappy         Medium   Fast     No               │
│  JSON        None           Slow     Fast     No               │
│  CSV         Gzip           Slow     Fast     No               │
│                                                                 │
│  Recommendation: Parquet with Snappy for analytics workloads  │
└─────────────────────────────────────────────────────────────────┘

2. Partitioning Strategy:

# Optimal partitioning
# Partition by high-cardinality columns used in WHERE clauses
df.write \
    .partitionBy("year", "month", "day", "region") \
    .parquet("s3://data-lake/sales/")

# Optimal partition size: 128MB - 1GB per file

3. Caching:

# Athena result caching
# Enable in workgroup configuration
workgroup_config = {
    'Name': 'optimized',
    'Configuration': {
        'ResultConfiguration': {
            'OutputLocation': 's3://athena-results/'
        },
        'EnforceWorkGroupConfiguration': True,
        'PublishCloudWatchMetricsEnabled': True,
        'BytesScannedCutoffPerQuery': 1000000000,  # 1GB limit
        'RequesterPaysEnabled': False,
        'EngineVersion': {
            'SelectedEngineVersion': 'AUTO',
            'EffectiveEngineVersion': 'Athena engine version 3'
        }
    }
}

Q25: How do you implement data lake best practices?

Answer:

Best Practices Checklist:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Lake Best Practices                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Architecture                                                  │
│  ☐ Implement layered architecture (Bronze/Silver/Gold)        │
│  ☐ Use proper partitioning strategy                            │
│  ☐ Implement data quality gates                                │
│  ☐ Design for scalability and cost optimization               │
│                                                                 │
│  Security                                                      │
│  ☐ Implement least-privilege access                           │
│  ☐ Enable encryption at rest and in transit                    │
│  ☐ Use Lake Formation for fine-grained permissions            │
│  ☐ Enable audit logging                                        │
│                                                                 │
│  Operations                                                    │
│  ☐ Implement monitoring and alerting                          │
│  ☐ Automate data pipeline orchestration                       │
│  ☐ Implement backup and recovery                              │
│  ☐ Document data lineage and metadata                         │
│                                                                 │
│  Cost Optimization                                              │
│  ☐ Use appropriate storage classes                            │
│  ☐ Implement lifecycle policies                                │
│  ☐ Optimize file sizes and formats                            │
│  ☐ Monitor and optimize query performance                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Implementation Checklist:

class DataLakeBestPractices:
    def __init__(self, data_lake_bucket):
        self.bucket = data_lake_bucket
    
    def implement_best_practices(self):
        # 1. Enable versioning
        self.enable_versioning()
        
        # 2. Implement lifecycle policies
        self.implement_lifecycle()
        
        # 3. Enable encryption
        self.enable_encryption()
        
        # 4. Configure access logging
        self.enable_access_logging()
        
        # 5. Set up monitoring
        self.setup_monitoring()
    
    def enable_versioning(self):
        s3 = boto3.client('s3')
        s3.put_bucket_versioning(
            Bucket=self.bucket,
            VersioningConfiguration={'Status': 'Enabled'}
        )
    
    def implement_lifecycle(self):
        s3 = boto3.client('s3')
        lifecycle_config = {
            'Rules': [
                {
                    'ID': 'OptimizeStorage',
                    'Status': 'Enabled',
                    'Filter': {'Prefix': ''},
                    'Transitions': [
                        {'Days': 30, 'StorageClass': 'STANDARD_IA'},
                        {'Days': 90, 'StorageClass': 'GLACIER'}
                    ]
                }
            ]
        }
        s3.put_bucket_lifecycle_configuration(
            Bucket=self.bucket,
            LifecycleConfiguration=lifecycle_config
        )

Summary

Mastering AWS data lake architecture requires understanding:

Architecture Design: Layered architecture, partitioning, file formats
Security & Governance: Lake Formation, encryption, access control
Cost Optimization: Storage classes, lifecycle policies, query optimization
Operations: Monitoring, automation, disaster recovery
Best Practices: Data quality, lineage, versioning

These concepts form the foundation for building scalable, secure, and cost-effective data lakes on AWS.

AWS Data Lake Interview Questions

🏔️ AWS Data Lake Interview

Data Lake Architecture Overview

Q1: How do you design a multi-layer data lake architecture on AWS?

Q2: Compare S3 storage classes for data lake workloads.

Q3: How do you implement data cataloging with AWS Glue Data Catalog?

Q4: How do you implement data lake security and access control?

Q5: How do you optimize query performance in Athena for data lakes?

Q6: Design a data lake for real-time and batch analytics.

Q7: How do you implement data quality in a data lake?

Q8: How do you implement data lineage in a data lake?

Q9: How do you handle schema evolution in a data lake?

Q10: Design a multi-account data lake architecture.

Q11: How do you implement data lake governance with Lake Formation?

Q12: How do you optimize costs in a data lake?

Q13: How do you implement data lake disaster recovery?

Q14: How do you implement data lake monitoring and observability?

Q15: How do you implement data lake automation?

Q16: How do you implement data lake for machine learning workloads?

Q17: How do you implement data lake for analytics and reporting?

Q18: How do you implement data lake for IoT and time-series data?

Q19: How do you implement data lake for unstructured data?

Q20: How do you implement data lake with data sharing capabilities?

Q21: How do you implement data versioning in a data lake?

Q22: How do you implement data lake for regulatory compliance?

Q23: How do you implement data lake backup and recovery?

Q24: How do you optimize data lake performance?

Q25: How do you implement data lake best practices?

Summary