Cloud Cost Optimization: Reserved, Spot, Right-Sizing
Difficulty: Senior Level | Companies: Netflix, Amazon, Google, Microsoft, FinOps Foundation
Interview Question
"Design a cost optimization strategy for a cloud environment with $1M+ monthly spend. How do you handle reserved instances, spot instances, and right-sizing?"
โน๏ธKey Concepts
This question tests your understanding of cloud economics, FinOps practices, and cost optimization strategies.
Complete Cost Optimization Architecture
Architecture Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ COST OPTIMIZATION ARCHITECTURE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโ VISIBILITY LAYER โโโโโโโโโโโโโโโโโโ โ
โ โ Cost Explorer โ Budgets โ Pricing Calculator โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโ OPTIMIZATION LAYER โโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Right-Sizing โ โ โ
โ โ โ (CPU/Memory optimization) โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Reserved Instances โ โ โ
โ โ โ (1yr/3yr commitments) โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Spot Instances โ โ โ
โ โ โ (Fault-tolerant workloads) โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Scheduling โ โ โ
โ โ โ (Start/stop non-production) โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโ GOVERNANCE LAYER โโโโโโโโโโโโโโโโโโ โ
โ โ Tagging โ Policies โ Alerts โ Reports โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Mathematical Foundation: Cost Models
Reserved Instance Savings:
- On-demand price: P_od = $0.10/hour
- Reserved price (1yr): P_1yr = $0.06/hour (40% savings)
- Reserved price (3yr): P_3yr = $0.04/hour (60% savings)
- Monthly savings (1yr): S_1yr = (P_od - P_1yr) ร 730 hours = $29.20/instance
- Monthly savings (3yr): S_3yr = (P_od - P_3yr) ร 730 hours = $43.80/instance
Spot Instance Savings:
- On-demand price: P_od = $0.10/hour
- Spot price: P_spot = $0.03/hour (70% savings)
- Monthly savings: S_spot = (P_od - P_spot) ร 730 hours = $51.10/instance
Right-Sizing Impact:
- Current instance: m5.4xlarge (16 vCPU, 64GB) = $0.768/hour
- Right-sized: m5.xlarge (4 vCPU, 16GB) = $0.192/hour
- Monthly savings: S_resize = (0.768 - 0.192) ร 730 = $420.48/instance
Total Cost Optimization:
- Current monthly cost: C_current = $1,000,000
- Right-sizing savings (20%): $200,000
- Reserved instances (30%): $300,000
- Spot instances (15%): $150,000
- Scheduling (10%): $100,000
- Total optimized cost: C_optimized = $250,000 (75% reduction)
AWS Cost Explorer Integration
# Cost monitoring and optimization
import boto3
from typing import Dict, Any, List
from datetime import datetime, timedelta
from dataclasses import dataclass
@dataclass
class CostRecommendation:
resource_id: str
resource_type: str
current_cost: float
recommended_action: str
potential_savings: float
confidence: str
class CostOptimizer:
"""Cloud cost optimization manager"""
def __init__(self):
self.ce = boto3.client('ce')
self.ec2 = boto3.client('ec2')
self.rds = boto3.client('rds')
def get_cost_and_usage(self, days: int = 30) -> Dict[str, Any]:
"""Get cost and usage report"""
end_date = datetime.utcnow().strftime('%Y-%m-%d')
start_date = (datetime.utcnow() - timedelta(days=days)).strftime('%Y-%m-%d')
response = self.ce.get_cost_and_usage(
TimePeriod={
'Start': start_date,
'End': end_date
},
Granularity='MONTHLY',
Metrics=['UnblendedCost', 'UsageQuantity'],
GroupBy=[
{
'Type': 'DIMENSION',
'Key': 'SERVICE'
}
]
)
return response
def get_right_sizing_recommendations(self) -> List[CostRecommendation]:
"""Get right-sizing recommendations"""
response = self.ce.get_recommendations(
AccountScope='PAYER',
LookBackPeriodInDays=14,
TermInDays=30,
Module='EC2_INSTANCE'
)
recommendations = []
for rec in response['InstanceRecommendations']:
recommendations.append(CostRecommendation(
resource_id=rec['ResourceId'],
resource_type='EC2',
current_cost=rec['CurrentInstance']['HourlyCost'],
recommended_action=rec['RecommendedOption'],
potential_savings=rec['SavingsOpportunity']['SavingsPercent'],
confidence=rec['Confidence']
))
return recommendations
def get_unused_resources(self) -> List[Dict[str, Any]]:
"""Get unused resources"""
unused_resources = []
# Check unused EBS volumes
volumes = self.ec2.describe_volumes(
Filters=[{'Name': 'status', 'Values': ['available']}]
)
for volume in volumes['Volumes']:
unused_resources.append({
'resource_id': volume['VolumeId'],
'resource_type': 'EBS Volume',
'size_gb': volume['Size'],
'monthly_cost': volume['Size'] * 0.10 # $0.10/GB/month
})
# Check unused Elastic IPs
addresses = self.ec2.describe_addresses()
for address in addresses['Addresses']:
if 'InstanceId' not in address:
unused_resources.append({
'resource_id': address['AllocationId'],
'resource_type': 'Elastic IP',
'monthly_cost': 3.60 # $3.60/month for unused IP
})
return unused_resources
def get_savings_plans_coverage(self) -> Dict[str, Any]:
"""Get Savings Plans coverage"""
response = self.ce.get_savings_plans_coverage(
TimePeriod={
'Start': (datetime.utcnow() - timedelta(days=30)).strftime('%Y-%m-%d'),
'End': datetime.utcnow().strftime('%Y-%m-%d')
},
GroupBy=[
{
'Type': 'DIMENSION',
'Key': 'INSTANCE_TYPE_FAMILY'
}
]
)
return response
Right-Sizing Implementation
# Right-sizing analysis
import boto3
from typing import Dict, Any, List
from dataclasses import dataclass
from datetime import datetime, timedelta
@dataclass
class RightSizeRecommendation:
instance_id: str
current_type: str
recommended_type: str
current_cost: float
recommended_cost: float
cpu_utilization: float
memory_utilization: float
class RightSizer:
"""Instance right-sizing analyzer"""
def __init__(self):
self.cloudwatch = boto3.client('cloudwatch')
self.ec2 = boto3.client('ec2')
def analyze_instances(self) -> List[RightSizeRecommendation]:
"""Analyze all instances for right-sizing"""
recommendations = []
# Get all running instances
instances = self.ec2.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
recommendation = self._analyze_instance(instance)
if recommendation:
recommendations.append(recommendation)
return recommendations
def _analyze_instance(self, instance: Dict[str, Any]) -> RightSizeRecommendation:
"""Analyze single instance"""
instance_id = instance['InstanceId']
instance_type = instance['InstanceType']
# Get CPU utilization
cpu_util = self._get_cpu_utilization(instance_id)
# Get memory utilization (requires CloudWatch agent)
memory_util = self._get_memory_utilization(instance_id)
# Determine if right-sizing is needed
if cpu_util < 20 and memory_util < 30:
recommended_type = self._recommend_smaller_instance(instance_type)
if recommended_type != instance_type:
return RightSizeRecommendation(
instance_id=instance_id,
current_type=instance_type,
recommended_type=recommended_type,
current_cost=self._get_instance_cost(instance_type),
recommended_cost=self._get_instance_cost(recommended_type),
cpu_utilization=cpu_util,
memory_utilization=memory_util
)
return None
def _get_cpu_utilization(self, instance_id: str) -> float:
"""Get average CPU utilization"""
response = self.cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[
{
'Name': 'InstanceId',
'Value': instance_id
}
],
StartTime=datetime.utcnow() - timedelta(days=14),
EndTime=datetime.utcnow(),
Period=86400,
Statistics=['Average']
)
if response['Datapoints']:
return sum(point['Average'] for point in response['Datapoints']) / \
len(response['Datapoints'])
return 0.0
def _get_memory_utilization(self, instance_id: str) -> float:
"""Get memory utilization (requires CloudWatch agent)"""
# In production, this requires CloudWatch agent installed
return 0.0
def _recommend_smaller_instance(self, current_type: str) -> str:
"""Recommend smaller instance type"""
# Instance family mapping
smaller_instances = {
'm5.4xlarge': 'm5.xlarge',
'm5.2xlarge': 'm5.large',
'c5.4xlarge': 'c5.xlarge',
'c5.2xlarge': 'c5.large',
'r5.4xlarge': 'r5.xlarge',
'r5.2xlarge': 'r5.large'
}
return smaller_instances.get(current_type, current_type)
def _get_instance_cost(self, instance_type: str) -> float:
"""Get instance hourly cost"""
# Pricing data (simplified)
pricing = {
'm5.xlarge': 0.192,
'm5.large': 0.096,
'c5.xlarge': 0.17,
'c5.large': 0.085,
'r5.xlarge': 0.252,
'r5.large': 0.126,
'm5.2xlarge': 0.384,
'm5.4xlarge': 0.768,
'c5.2xlarge': 0.34,
'c5.4xlarge': 0.68,
'r5.2xlarge': 0.504,
'r5.4xlarge': 1.008
}
return pricing.get(instance_type, 0.0)
def calculate_savings(self, recommendations: List[RightSizeRecommendation]) -> float:
"""Calculate total potential savings"""
total_savings = 0
for rec in recommendations:
monthly_savings = (rec.current_cost - rec.recommended_cost) * 730
total_savings += monthly_savings
return total_savings
Reserved Instance Management
# Reserved Instance management
import boto3
from typing import Dict, Any, List
from datetime import datetime, timedelta
from dataclasses import dataclass
@dataclass
class RIRecommendation:
instance_type: str
region: str
term: str # 1yr or 3yr
payment_option: str # AllUpfront, PartialUpfront, NoUpfront
coverage: float
monthly_savings: float
class RIManager:
"""Reserved Instance manager"""
def __init__(self):
self.ce = boto3.client('ce')
self.ec2 = boto3.client('ec2')
def get_ri_recommendations(self) -> List[RIRecommendation]:
"""Get RI recommendations"""
response = self.ce.get_reservation_purchase_recommendation(
LookBackPeriodInDays=30,
AccountScope='PAYER',
Service='Amazon Elastic Compute Cloud - Compute',
PaymentOption='ALL_UPFRONT',
TermInYears='ONE_YEAR'
)
recommendations = []
for rec in response['Recommendations']:
recommendations.append(RIRecommendation(
instance_type=rec['InstanceDetails']['EC2InstanceDetails']['InstanceType'],
region=rec['InstanceDetails']['EC2InstanceDetails']['Region'],
term='1yr',
payment_option='AllUpfront',
coverage=rec['RecommendationDetails']['CoveragePercentage'],
monthly_savings=float(rec['RecommendationDetails']['MonthlyRecommendationSavings'])
))
return recommendations
def get_current_ri_utilization(self) -> Dict[str, Any]:
"""Get RI utilization report"""
response = self.ce.get_reservation_utilization(
TimePeriod={
'Start': (datetime.utcnow() - timedelta(days=30)).strftime('%Y-%m-%d'),
'End': datetime.utcnow().strftime('%Y-%m-%d')
}
)
return response
def get_ri_coverage(self) -> Dict[str, Any]:
"""Get RI coverage report"""
response = self.ce.get_reservation_coverage(
TimePeriod={
'Start': (datetime.utcnow() - timedelta(days=30)).strftime('%Y-%m-%d'),
'End': datetime.utcnow().strftime('%Y-%m-%d')
},
GroupBy=[
{
'Type': 'DIMENSION',
'Key': 'INSTANCE_TYPE_FAMILY'
}
]
)
return response
def calculate_ri_savings(self, instance_type: str, count: int,
term: str = '1yr') -> float:
"""Calculate RI savings"""
# Pricing data
pricing = {
'm5.large': {'od': 0.096, 'ri_1yr': 0.058, 'ri_3yr': 0.038},
'm5.xlarge': {'od': 0.192, 'ri_1yr': 0.116, 'ri_3yr': 0.076},
'c5.large': {'od': 0.085, 'ri_1yr': 0.051, 'ri_3yr': 0.033},
'c5.xlarge': {'od': 0.17, 'ri_1yr': 0.102, 'ri_3yr': 0.066}
}
if instance_type not in pricing:
return 0.0
instance_pricing = pricing[instance_type]
od_cost = instance_pricing['od'] * 730 * count
ri_cost = instance_pricing[f'ri_{term}'] * 730 * count
return od_cost - ri_cost
Spot Instance Strategy
# Spot instance management
import boto3
from typing import Dict, Any, List
from datetime import datetime, timedelta
from dataclasses import dataclass
@dataclass
class SpotRecommendation:
instance_type: str
spot_price: float
on_demand_price: float
savings_percent: float
interruption_frequency: str
class SpotManager:
"""Spot instance manager"""
def __init__(self):
self.ec2 = boto3.client('ec2')
self.pricing = boto3.client('pricing', region_name='us-east-1')
def get_spot_price_history(self, instance_type: str = None) -> List[Dict[str, Any]]:
"""Get spot price history"""
response = self.ec2.describe_spot_price_history(
InstanceTypes=[instance_type] if instance_type else [],
StartTime=datetime.utcnow() - timedelta(days=7),
ProductDescriptions=['Linux/UNIX']
)
return response['SpotPriceHistory']
def get_spot_recommendations(self) -> List[SpotRecommendation]:
"""Get spot instance recommendations"""
recommendations = []
# Analyze different instance types
instance_types = ['m5.large', 'm5.xlarge', 'c5.large', 'c5.xlarge']
for instance_type in instance_types:
spot_prices = self.get_spot_price_history(instance_type)
if spot_prices:
avg_spot_price = sum(float(p['SpotPrice']) for p in spot_prices) / len(spot_prices)
on_demand_price = self._get_on_demand_price(instance_type)
if on_demand_price > 0:
savings_percent = ((on_demand_price - avg_spot_price) / on_demand_price) * 100
recommendations.append(SpotRecommendation(
instance_type=instance_type,
spot_price=avg_spot_price,
on_demand_price=on_demand_price,
savings_percent=savings_percent,
interruption_frequency='low' # Simplified
))
return recommendations
def _get_on_demand_price(self, instance_type: str) -> float:
"""Get on-demand price for instance type"""
# Pricing data (simplified)
pricing = {
'm5.large': 0.096,
'm5.xlarge': 0.192,
'c5.large': 0.085,
'c5.xlarge': 0.17
}
return pricing.get(instance_type, 0.0)
def calculate_spot_savings(self, recommendations: List[SpotRecommendation]) -> float:
"""Calculate total spot savings"""
total_savings = 0
for rec in recommendations:
monthly_savings = (rec.on_demand_price - rec.spot_price) * 730
total_savings += monthly_savings
return total_savings
โ ๏ธSpot Instance Strategy
Use spot instances for fault-tolerant, flexible workloads. Implement proper interruption handling and use multiple instance types for availability.
Scheduling Optimization
# Scheduling for cost optimization
import boto3
from typing import Dict, Any, List
from datetime import datetime, timedelta
class ScheduleManager:
"""Resource scheduling manager"""
def __init__(self):
self.ec2 = boto3.client('ec2')
self.rds = boto3.client('rds')
def schedule_non_production(self):
"""Schedule non-production resources"""
# Get non-production instances
instances = self.ec2.describe_instances(
Filters=[
{'Name': 'tag:Environment', 'Values': ['staging', 'dev']},
{'Name': 'instance-state-name', 'Values': ['running']}
]
)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
# Check if it's outside business hours
if self._is_outside_business_hours():
self._stop_instance(instance['InstanceId'])
def schedule_development_databases(self):
"""Schedule development RDS instances"""
# Get development RDS instances
dbs = self.rds.describe_db_instances(
Filters=[
{'Name': 'tag:Environment', 'Values': ['dev', 'staging']}
]
)
for db in dbs['DBInstances']:
if self._is_outside_business_hours():
self._stop_rds_instance(db['DBInstanceIdentifier'])
def _is_outside_business_hours(self) -> bool:
"""Check if current time is outside business hours"""
current_hour = datetime.utcnow().hour
# Business hours: 9 AM - 6 PM UTC
return not (9 <= current_hour <= 18)
def _stop_instance(self, instance_id: str):
"""Stop EC2 instance"""
self.ec2.stop_instances(InstanceIds=[instance_id])
def _stop_rds_instance(self, db_instance_id: str):
"""Stop RDS instance"""
self.rds.stop_db_instance(
DBInstanceIdentifier=db_instance_id
)
def calculate_savings(self) -> float:
"""Calculate scheduling savings"""
# Assume 100 non-production instances
# Running 24/7 vs 9 hours/day, 5 days/week
hours_per_week_full = 24 * 7 # 168 hours
hours_per_week_business = 9 * 5 # 45 hours
reduction = 1 - (hours_per_week_business / hours_per_week_full)
# Average instance cost: $0.20/hour
monthly_cost_full = 100 * 0.20 * 730 # $14,600
monthly_cost_scheduled = monthly_cost_full * (1 - reduction)
return monthly_cost_full - monthly_cost_scheduled
Cost Monitoring Dashboard
# Cost monitoring and alerting
import boto3
from typing import Dict, Any
from datetime import datetime, timedelta
class CostMonitor:
"""Cost monitoring and alerting"""
def __init__(self):
self.ce = boto3.client('ce')
self.sns = boto3.client('sns')
self.cloudwatch = boto3.client('cloudwatch')
def create_cost_budget(self, budget_amount: float):
"""Create cost budget"""
budgets = boto3.client('budgets')
budgets.create_budget(
AccountId='123456789012',
Budget={
'BudgetName': 'monthly-cost-budget',
'BudgetLimit': {
'Amount': str(budget_amount),
'Unit': 'USD'
},
'TimeUnit': 'MONTHLY',
'BudgetType': 'COST',
'CostFilters': {
'TagKey': ['Environment'],
'TagValues': ['production']
}
},
NotificationsWithSubscribers=[
{
'Notification': {
'NotificationType': 'ACTUAL',
'ComparisonOperator': 'GREATER_THAN',
'Threshold': 80,
'ThresholdType': 'PERCENTAGE'
},
'Subscribers': [
{
'SubscriptionType': 'SNS',
'Address': 'arn:aws:sns:us-east-1:123456789012:cost-alerts'
}
]
}
]
)
def monitor_daily_costs(self):
"""Monitor daily costs"""
end_date = datetime.utcnow().strftime('%Y-%m-%d')
start_date = (datetime.utcnow() - timedelta(days=1)).strftime('%Y-%m-%d')
response = self.ce.get_cost_and_usage(
TimePeriod={
'Start': start_date,
'End': end_date
},
Granularity='DAILY',
Metrics=['UnblendedCost']
)
total_cost = sum(
float(result['Total']['UnblendedCost']['Amount'])
for result in response['ResultsByTime']
)
# Alert if daily cost exceeds threshold
if total_cost > 10000: # $10,000 daily threshold
self._send_cost_alert(total_cost)
return total_cost
def _send_cost_alert(self, cost: float):
"""Send cost alert"""
self.sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789012:cost-alerts',
Message=f'Daily cost exceeded threshold: ${cost:.2f}',
Subject='Cost Alert'
)
โ Cost Optimization Benefits
Cost optimization can reduce cloud spend by 50-75%. Use a combination of right-sizing, reserved instances, spot instances, and scheduling for maximum savings.
Summary
| Strategy | Savings | Risk Level | Complexity |
|---|---|---|---|
| Right-Sizing | 20-30% | Low | Medium |
| Reserved Instances | 40-60% | Medium | Low |
| Spot Instances | 70-90% | High | High |
| Scheduling | 60-70% | Low | Low |
| Storage Tiering | 50-70% | Low | Medium |