Cloud Cost Optimization
Difficulty: Senior Level | Companies: AWS, Google, Microsoft, Netflix, Uber
Cost Optimization Strategy
Cloud costs can be optimized without sacrificing performance. The key is matching resources to actual usage patterns.
โน๏ธ
Most organizations overspend by 30-50% on cloud. Regular optimization can save significant budget without reducing capacity.
Cost Optimization Pillars
Architecture Diagram
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Cost Optimization Strategy โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโค
โ Right-Sizingโ Pricing โ Reservationsโ Architecture โ
โ โ Models โ โ โ
โ โข Instance โ โข On-Demand โ โข Reserved โ โข Serverless โ
โ type โ โข Spot โ Instances โ โข Auto-scaling โ
โ โข Storage โ โข Savings โ โข Savings โ โข Spot fleets โ
โ class โ Plans โ Plans โ โ
โ โข Network โ โข Dedicate โ โข Commit โ โข Right-size โ
โ bandwidth โ hosts โ discounts โ databases โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโ
Pattern 1: Right-Sizing Analysis
Identify and resize over-provisioned resources.
# Right-sizing analysis with AWS Compute Optimizer
import boto3
from datetime import datetime
class RightSizer:
def __init__(self):
self.compute_optimizer = boto3.client('compute-optimizer')
self.cloudwatch = boto3.client('cloudwatch')
def analyze_ec2_instances(self) -> list:
"""Get right-sizing recommendations."""
response = self.compute_optimizer.get_ec2_instance_recommendations()
recommendations = []
for rec in response['instanceRecommendations']:
current = rec['currentInstanceType']
recommended = rec['recommendationOptions'][0]
recommendations.append({
'instanceId': rec['instanceArn'].split('/')[-1],
'currentType': current,
'recommendedType': recommended['recommendedInstanceType'],
'savings': recommended['savingsOpportunity']['estimatedMonthlySavings'],
'savingsPercent': recommended['savingsOpportunity']['savingsPercentage'],
'performanceRisk': recommended.get('performanceRisk', 'N/A'),
})
return sorted(recommendations, key=lambda x: x['savings'], reverse=True)
def analyze_idle_resources(self) -> list:
"""Find idle or underutilized resources."""
idle_resources = []
# Check for low-utilization EC2 instances
instances = self._get_running_instances()
for instance in instances:
cpu_avg = self._get_cpu_average(instance['InstanceId'], days=14)
network_avg = self._get_network_average(instance['InstanceId'], days=14)
if cpu_avg < 5 and network_avg < 10: # Less than 5% CPU, 10% network
idle_resources.append({
'resource': instance['InstanceId'],
'type': 'EC2',
'currentType': instance['InstanceType'],
'utilization': f"CPU: {cpu_avg}%, Network: {network_avg}%",
'recommendation': 'Consider stopping or downsizing',
'estimatedSavings': self._estimate_savings(instance),
})
# Check for unattached EBS volumes
volumes = self._get_unattached_volumes()
for volume in volumes:
idle_resources.append({
'resource': volume['VolumeId'],
'type': 'EBS',
'size': f"{volume['Size']} GB",
'recommendation': 'Delete if not needed',
'estimatedSavings': volume['Size'] * 0.08, # gp3 cost per GB
})
return idle_resources
โน๏ธ
Use AWS Compute Optimizer or GCP Recommender for automated right-sizing recommendations. Review monthly.
Pattern 2: Spot Instance Strategy
Use spot instances for fault-tolerant workloads.
# Auto Scaling Group with Spot and On-Demand mix
apiVersion: autoscaling/v1
kind: AutoScalingGroup
Metadata:
AWS::CloudFormation::Interface:
DefaultSecurityGroups:
- !Ref InstanceSecurityGroup
Properties:
AutoScalingGroupName: app-spot-fleet
MinSize: 2
MaxSize: 20
DesiredCapacity: 6
MixedInstancesPolicy:
InstancesDistribution:
OnDemandAllocationStrategy: prioritized
OnDemandBaseCapacity: 2 # Base capacity always on-demand
OnDemandPercentageAboveBaseCapacity: 20 # 20% on-demand above base
SpotAllocationStrategy: diversified
SpotInstancePools: 3 # Use multiple pools for availability
SpotMaxPrice: "0.10" # Optional max price
LaunchTemplate:
LaunchTemplateSpecification:
LaunchTemplateName: app-launch-template
Version: !GetAtt LaunchTemplate.LatestVersionNumber
Overrides:
- InstanceType: m5.large
- InstanceType: m5a.large
- InstanceType: m4.large
- InstanceType: c5.large
- InstanceType: c5a.large
// Spot instance interruption handler
export class SpotInterruptionHandler {
constructor(
private asgClient: AutoScalingClient,
private sqsClient: SQSClient,
) {}
async handleInterruption(event: SNSEvent): Promise<void> {
const message = JSON.parse(event.Records[0].Sns.Message);
if (message['detail-type'] === 'EC2 Spot Instance Interruption Warning') {
const instanceId = message.detail.instance-id;
// Drain the instance gracefully
await this.drainInstance(instanceId);
// The ASG will automatically replace it
console.log(`Instance ${instanceId} marked for interruption`);
}
}
private async drainInstance(instanceId: string): Promise<void> {
// Remove from load balancer
await this.sqsClient.send(new RemoveFromLoadBalancerCommand({
InstanceId: instanceId,
}));
// Wait for in-flight requests to complete
await new Promise(resolve => setTimeout(resolve, 30000));
// Deregister from Auto Scaling
await this.asgClient.send(new DeregisterInstanceCommand({
AutoScalingGroupName: 'app-spot-fleet',
InstanceId: instanceId,
}));
}
}
Pattern 3: Savings Plans Commitment
Calculate optimal commitment levels.
# Savings Plans analysis
class SavingsPlanAnalyzer:
def __init__(self):
self.ce_client = boto3.client('ce')
def analyze_commitment(self, lookback_days: int = 30) -> dict:
"""Analyze optimal Savings Plans commitment."""
# Get usage history
usage = self._get_usage_history(lookback_days)
# Calculate baseline (always-on) usage
baseline = self._calculate_baseline(usage)
# Recommend commitment level
recommendation = {
'baseline_hours': baseline['hours'],
'recommended_commitment': self._calculate_optimal_commitment(baseline),
'estimated_savings': self._calculate_savings(baseline),
'break_even_months': self._calculate_break_even(baseline),
}
return recommendation
def _calculate_baseline(self, usage: list) -> dict:
"""Calculate consistent baseline usage."""
daily_hours = [u['instance_hours'] for u in usage]
# Use 25th percentile as baseline (conservative)
sorted_hours = sorted(daily_hours)
baseline_percentile = sorted_hours[len(sorted_hours) // 4]
return {
'hours': baseline_percentile,
'instance_type': usage[0]['instance_type'],
'region': usage[0]['region'],
}
def _calculate_optimal_commitment(self, baseline: dict) -> dict:
"""Recommend commitment type and amount."""
# 1-year No Upfront for flexibility
# 3-year All Upfront for maximum savings
monthly_hours = baseline['hours'] * 30
return {
'1yr_no_upfront': {
'hours': monthly_hours,
'discount': 'up to 40%',
},
'1yr_partial_upfront': {
'hours': monthly_hours,
'discount': 'up to 44%',
},
'3yr_all_upfront': {
'hours': monthly_hours,
'discount': 'up to 60%',
},
}
Pattern 4: Storage Tiering
Automatically move data to cheaper storage tiers.
# S3 Intelligent-Tiering and lifecycle policies
import boto3
s3 = boto3.client('s3')
# Create bucket with lifecycle rules
s3.put_bucket_lifecycle_configuration(
Bucket='data-lake',
LifecycleConfiguration={
'Rules': [
{
'ID': 'intelligent-tiering',
'Status': 'Enabled',
'Filter': {'Prefix': 'data/'},
'IntelligentTiering': {
'Tierings': [
{'AccessTier': 'ARCHIVE_ACCESS', 'Days': 90},
{'AccessTier': 'DEEP_ARCHIVE_ACCESS', 'Days': 180},
]
},
},
{
'ID': 'transition-to-glacier',
'Status': 'Enabled',
'Filter': {'Prefix': 'backups/'},
'Transitions': [
{
'Days': 30,
'StorageClass': 'STANDARD_IA',
},
{
'Days': 90,
'StorageClass': 'GLACIER',
},
{
'Days': 365,
'StorageClass': 'DEEP_ARCHIVE',
},
],
'Expiration': {'Days': 730},
},
{
'ID': 'cleanup-temp',
'Status': 'Enabled',
'Filter': {'Prefix': 'temp/'},
'Expiration': {'Days': 7},
},
]
},
)
Pattern 5: Cost Monitoring and Alerts
Set up automated cost alerts.
# AWS Budgets CloudFormation
AWSTemplateFormatVersion: '2010-09-09'
Resources:
MonthlyBudget:
Type: AWS::Budgets::Budget
Properties:
Budget:
BudgetLimit:
Amount: 10000
Unit: USD
TimeUnit: MONTHLY
BudgetType: COST
CostFilters:
TagKeyValue: [
'user:Environment$production'
]
NotificationsWithSubscribers:
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 80
Subscribers:
- SubscriptionType: EMAIL
Address: finops@example.com
- SubscriptionType: SNS
Address: !Ref CostAlertTopic
- Notification:
NotificationType: FORECASTED
ComparisonOperator: GREATER_THAN
Threshold: 100
Subscribers:
- SubscriptionType: EMAIL
Address: finops@example.com
CostAlertTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: cost-alerts
# anomaly detection
AnomalyDetector:
Type: AWS::CE::AnomalyDetector
Properties:
AnomalyDetector:
DimensionalValueValues:
- AWS Services
MonitorType: DIMENSIONAL
AnomalySubscription:
Type: AWS::CE::AnomalySubscription
Properties:
Frequency: DAILY
MonitorArnList:
- !Ref AnomalyDetector
Subscribers:
- Address: finops@example.com
Type: EMAIL
Threshold: 100
Cost Optimization Checklist
- Right-Size Monthly - Review Compute Optimizer recommendations
- Reserved Instances - Commit to 1-year for stable workloads
- Spot Instances - Use for batch and fault-tolerant workloads
- Storage Tiering - Implement lifecycle policies
- Clean Up - Delete idle resources and unattached volumes
- Monitor - Set up budget alerts and anomaly detection
Follow-Up Questions
- How do you balance cost optimization with performance requirements for latency-sensitive applications?
- What strategies would you use to showback costs to different teams in a shared environment?
- How do you implement automated cost enforcement without impacting development velocity?