Cloud Cost Optimization

Difficulty: Senior Level | Companies: AWS, Google, Microsoft, Netflix, Uber

Cost Optimization Strategy

Cloud costs can be optimized without sacrificing performance. The key is matching resources to actual usage patterns.

ℹ️

Most organizations overspend by 30-50% on cloud. Regular optimization can save significant budget without reducing capacity.

Cost Optimization Pillars

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                  Cost Optimization Strategy                 │
├─────────────┬─────────────┬─────────────┬─────────────────┤
│  Right-Sizing│  Pricing    │  Reservations│  Architecture  │
│             │  Models     │             │                 │
│ • Instance  │ • On-Demand │ • Reserved  │ • Serverless   │
│   type      │ • Spot      │   Instances │ • Auto-scaling │
│ • Storage   │ • Savings   │ • Savings   │ • Spot fleets  │
│   class     │   Plans     │   Plans     │                 │
│ • Network   │ • Dedicate  │ • Commit    │ • Right-size    │
│   bandwidth │   hosts     │   discounts │   databases    │
└─────────────┴─────────────┴─────────────┴─────────────────┘

Pattern 1: Right-Sizing Analysis

Identify and resize over-provisioned resources.

# Right-sizing analysis with AWS Compute Optimizer
import boto3
from datetime import datetime

class RightSizer:
    def __init__(self):
        self.compute_optimizer = boto3.client('compute-optimizer')
        self.cloudwatch = boto3.client('cloudwatch')
    
    def analyze_ec2_instances(self) -> list:
        """Get right-sizing recommendations."""
        response = self.compute_optimizer.get_ec2_instance_recommendations()
        
        recommendations = []
        for rec in response['instanceRecommendations']:
            current = rec['currentInstanceType']
            recommended = rec['recommendationOptions'][0]
            
            recommendations.append({
                'instanceId': rec['instanceArn'].split('/')[-1],
                'currentType': current,
                'recommendedType': recommended['recommendedInstanceType'],
                'savings': recommended['savingsOpportunity']['estimatedMonthlySavings'],
                'savingsPercent': recommended['savingsOpportunity']['savingsPercentage'],
                'performanceRisk': recommended.get('performanceRisk', 'N/A'),
            })
        
        return sorted(recommendations, key=lambda x: x['savings'], reverse=True)
    
    def analyze_idle_resources(self) -> list:
        """Find idle or underutilized resources."""
        idle_resources = []
        
        # Check for low-utilization EC2 instances
        instances = self._get_running_instances()
        for instance in instances:
            cpu_avg = self._get_cpu_average(instance['InstanceId'], days=14)
            network_avg = self._get_network_average(instance['InstanceId'], days=14)
            
            if cpu_avg < 5 and network_avg < 10:  # Less than 5% CPU, 10% network
                idle_resources.append({
                    'resource': instance['InstanceId'],
                    'type': 'EC2',
                    'currentType': instance['InstanceType'],
                    'utilization': f"CPU: {cpu_avg}%, Network: {network_avg}%",
                    'recommendation': 'Consider stopping or downsizing',
                    'estimatedSavings': self._estimate_savings(instance),
                })
        
        # Check for unattached EBS volumes
        volumes = self._get_unattached_volumes()
        for volume in volumes:
            idle_resources.append({
                'resource': volume['VolumeId'],
                'type': 'EBS',
                'size': f"{volume['Size']} GB",
                'recommendation': 'Delete if not needed',
                'estimatedSavings': volume['Size'] * 0.08,  # gp3 cost per GB
            })
        
        return idle_resources

ℹ️

Use AWS Compute Optimizer or GCP Recommender for automated right-sizing recommendations. Review monthly.

Pattern 2: Spot Instance Strategy

Use spot instances for fault-tolerant workloads.

# Auto Scaling Group with Spot and On-Demand mix
apiVersion: autoscaling/v1
kind: AutoScalingGroup
Metadata:
  AWS::CloudFormation::Interface:
    DefaultSecurityGroups:
      - !Ref InstanceSecurityGroup
Properties:
  AutoScalingGroupName: app-spot-fleet
  MinSize: 2
  MaxSize: 20
  DesiredCapacity: 6
  MixedInstancesPolicy:
    InstancesDistribution:
      OnDemandAllocationStrategy: prioritized
      OnDemandBaseCapacity: 2  # Base capacity always on-demand
      OnDemandPercentageAboveBaseCapacity: 20  # 20% on-demand above base
      SpotAllocationStrategy: diversified
      SpotInstancePools: 3  # Use multiple pools for availability
      SpotMaxPrice: "0.10"  # Optional max price
    LaunchTemplate:
      LaunchTemplateSpecification:
        LaunchTemplateName: app-launch-template
        Version: !GetAtt LaunchTemplate.LatestVersionNumber
      Overrides:
        - InstanceType: m5.large
        - InstanceType: m5a.large
        - InstanceType: m4.large
        - InstanceType: c5.large
        - InstanceType: c5a.large

// Spot instance interruption handler
export class SpotInterruptionHandler {
  constructor(
    private asgClient: AutoScalingClient,
    private sqsClient: SQSClient,
  ) {}

  async handleInterruption(event: SNSEvent): Promise<void> {
    const message = JSON.parse(event.Records[0].Sns.Message);
    
    if (message['detail-type'] === 'EC2 Spot Instance Interruption Warning') {
      const instanceId = message.detail.instance-id;
      
      // Drain the instance gracefully
      await this.drainInstance(instanceId);
      
      // The ASG will automatically replace it
      console.log(`Instance ${instanceId} marked for interruption`);
    }
  }

  private async drainInstance(instanceId: string): Promise<void> {
    // Remove from load balancer
    await this.sqsClient.send(new RemoveFromLoadBalancerCommand({
      InstanceId: instanceId,
    }));
    
    // Wait for in-flight requests to complete
    await new Promise(resolve => setTimeout(resolve, 30000));
    
    // Deregister from Auto Scaling
    await this.asgClient.send(new DeregisterInstanceCommand({
      AutoScalingGroupName: 'app-spot-fleet',
      InstanceId: instanceId,
    }));
  }
}

Pattern 3: Savings Plans Commitment

Calculate optimal commitment levels.

# Savings Plans analysis
class SavingsPlanAnalyzer:
    def __init__(self):
        self.ce_client = boto3.client('ce')
    
    def analyze_commitment(self, lookback_days: int = 30) -> dict:
        """Analyze optimal Savings Plans commitment."""
        
        # Get usage history
        usage = self._get_usage_history(lookback_days)
        
        # Calculate baseline (always-on) usage
        baseline = self._calculate_baseline(usage)
        
        # Recommend commitment level
        recommendation = {
            'baseline_hours': baseline['hours'],
            'recommended_commitment': self._calculate_optimal_commitment(baseline),
            'estimated_savings': self._calculate_savings(baseline),
            'break_even_months': self._calculate_break_even(baseline),
        }
        
        return recommendation
    
    def _calculate_baseline(self, usage: list) -> dict:
        """Calculate consistent baseline usage."""
        daily_hours = [u['instance_hours'] for u in usage]
        
        # Use 25th percentile as baseline (conservative)
        sorted_hours = sorted(daily_hours)
        baseline_percentile = sorted_hours[len(sorted_hours) // 4]
        
        return {
            'hours': baseline_percentile,
            'instance_type': usage[0]['instance_type'],
            'region': usage[0]['region'],
        }
    
    def _calculate_optimal_commitment(self, baseline: dict) -> dict:
        """Recommend commitment type and amount."""
        # 1-year No Upfront for flexibility
        # 3-year All Upfront for maximum savings
        
        monthly_hours = baseline['hours'] * 30
        
        return {
            '1yr_no_upfront': {
                'hours': monthly_hours,
                'discount': 'up to 40%',
            },
            '1yr_partial_upfront': {
                'hours': monthly_hours,
                'discount': 'up to 44%',
            },
            '3yr_all_upfront': {
                'hours': monthly_hours,
                'discount': 'up to 60%',
            },
        }

Pattern 4: Storage Tiering

Automatically move data to cheaper storage tiers.

# S3 Intelligent-Tiering and lifecycle policies
import boto3

s3 = boto3.client('s3')

# Create bucket with lifecycle rules
s3.put_bucket_lifecycle_configuration(
    Bucket='data-lake',
    LifecycleConfiguration={
        'Rules': [
            {
                'ID': 'intelligent-tiering',
                'Status': 'Enabled',
                'Filter': {'Prefix': 'data/'},
                'IntelligentTiering': {
                    'Tierings': [
                        {'AccessTier': 'ARCHIVE_ACCESS', 'Days': 90},
                        {'AccessTier': 'DEEP_ARCHIVE_ACCESS', 'Days': 180},
                    ]
                },
            },
            {
                'ID': 'transition-to-glacier',
                'Status': 'Enabled',
                'Filter': {'Prefix': 'backups/'},
                'Transitions': [
                    {
                        'Days': 30,
                        'StorageClass': 'STANDARD_IA',
                    },
                    {
                        'Days': 90,
                        'StorageClass': 'GLACIER',
                    },
                    {
                        'Days': 365,
                        'StorageClass': 'DEEP_ARCHIVE',
                    },
                ],
                'Expiration': {'Days': 730},
            },
            {
                'ID': 'cleanup-temp',
                'Status': 'Enabled',
                'Filter': {'Prefix': 'temp/'},
                'Expiration': {'Days': 7},
            },
        ]
    },
)

Pattern 5: Cost Monitoring and Alerts

Set up automated cost alerts.

# AWS Budgets CloudFormation
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  MonthlyBudget:
    Type: AWS::Budgets::Budget
    Properties:
      Budget:
        BudgetLimit:
          Amount: 10000
          Unit: USD
        TimeUnit: MONTHLY
        BudgetType: COST
        CostFilters:
          TagKeyValue: [
            'user:Environment$production'
          ]
      NotificationsWithSubscribers:
        - Notification:
            NotificationType: ACTUAL
            ComparisonOperator: GREATER_THAN
            Threshold: 80
          Subscribers:
            - SubscriptionType: EMAIL
              Address: finops@example.com
            - SubscriptionType: SNS
              Address: !Ref CostAlertTopic
        - Notification:
            NotificationType: FORECASTED
            ComparisonOperator: GREATER_THAN
            Threshold: 100
          Subscribers:
            - SubscriptionType: EMAIL
              Address: finops@example.com
  
  CostAlertTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: cost-alerts
  
  # anomaly detection
  AnomalyDetector:
    Type: AWS::CE::AnomalyDetector
    Properties:
      AnomalyDetector:
        DimensionalValueValues:
          - AWS Services
        MonitorType: DIMENSIONAL
  
  AnomalySubscription:
    Type: AWS::CE::AnomalySubscription
    Properties:
      Frequency: DAILY
      MonitorArnList:
        - !Ref AnomalyDetector
      Subscribers:
        - Address: finops@example.com
          Type: EMAIL
      Threshold: 100

Cost Optimization Checklist

Right-Size Monthly - Review Compute Optimizer recommendations
Reserved Instances - Commit to 1-year for stable workloads
Spot Instances - Use for batch and fault-tolerant workloads
Storage Tiering - Implement lifecycle policies
Clean Up - Delete idle resources and unattached volumes
Monitor - Set up budget alerts and anomaly detection

Follow-Up Questions

How do you balance cost optimization with performance requirements for latency-sensitive applications?
What strategies would you use to showback costs to different teams in a shared environment?
How do you implement automated cost enforcement without impacting development velocity?