IaC at Scale: Terraform Modules, State Management, Drift
Difficulty: Senior Level | Companies: HashiCorp, AWS, Google, Microsoft, Spotify
Interview Question
"Design an Infrastructure as Code strategy for a multi-account AWS environment with 100+ Terraform configurations. How do you handle state management, drift detection, and module reuse?"
โน๏ธKey Concepts
This question tests your understanding of IaC best practices, Terraform patterns, and infrastructure governance.
Complete IaC Architecture
Architecture Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ INFRASTRUCTURE AS CODE ARCHITECTURE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโ SOURCE CONTROL โโโโโโโโโโโโโโโโโโโ โ
โ โ GitHub/GitLab โ Branch Strategy โ Code Review โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโ MODULE LAYER โโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Reusable Modules โ โ โ
โ โ โ โ โ โ
โ โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ โ
โ โ โ โ VPC โ โ ECS โ โ RDS โ โ โ โ
โ โ โ โ Module โ โ Module โ โ Module โ โ โ โ
โ โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ โ
โ โ โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโ STATE MANAGEMENT โโโโโโโโโโโโโโโโโ โ
โ โ S3 Backend โ State Locking โ Encryption โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโ CI/CD INTEGRATION โโโโโโโโโโโโโโโโ โ
โ โ Plan โ Apply โ Destroy โ Drift Detection โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Mathematical Foundation: IaC Metrics
Configuration Metrics:
- Total configurations: C = 100
- Average resources per config: R = 50
- Total resources: T = C ร R = 5,000
- Average lines of code per config: L = 500
- Total lines: L_total = C ร L = 50,000
Drift Detection:
- Drift detection frequency: F = daily
- Average drift rate: D = 5%
- Drifts per day: Dr = T ร D = 250
- Time to detect: T_detect = 24 hours
- Time to remediate: T_remediate = 2 hours
Module Reuse:
- Common modules: M = 20
- Usage per module: U = 10
- Total module usages: U_total = M ร U = 200
- Code reduction: R = (L_total - (M ร 200)) / L_total = 60%
Terraform Module Structure
# modules/vpc/main.tf
terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
variable "name" {
description = "Name of the VPC"
type = string
}
variable "cidr_block" {
description = "CIDR block for the VPC"
type = string
default = "10.0.0.0/16"
}
variable "azs" {
description = "List of availability zones"
type = list(string)
}
variable "private_subnets" {
description = "List of private subnet CIDRs"
type = list(string)
}
variable "public_subnets" {
description = "List of public subnet CIDRs"
type = list(string)
}
variable "enable_nat_gateway" {
description = "Enable NAT Gateway"
type = bool
default = true
}
variable "single_nat_gateway" {
description = "Use a single NAT Gateway"
type = bool
default = false
}
resource "aws_vpc" "this" {
cidr_block = var.cidr_block
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = var.name
}
}
resource "aws_subnet" "private" {
count = length(var.private_subnets)
vpc_id = aws_vpc.this.id
cidr_block = var.private_subnets[count.index]
availability_zone = var.azs[count.index % length(var.azs)]
tags = {
Name = "${var.name}-private-${var.azs[count.index % length(var.azs)]}"
}
}
resource "aws_subnet" "public" {
count = length(var.public_subnets)
vpc_id = aws_vpc.this.id
cidr_block = var.public_subnets[count.index]
availability_zone = var.azs[count.index % length(var.azs)]
map_public_ip_on_launch = true
tags = {
Name = "${var.name}-public-${var.azs[count.index % length(var.azs)]}"
}
}
resource "aws_internet_gateway" "this" {
vpc_id = aws_vpc.this.id
tags = {
Name = var.name
}
}
resource "aws_eip" "nat" {
count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : length(var.azs)) : 0
domain = "vpc"
tags = {
Name = "${var.name}-nat-${count.index}"
}
}
resource "aws_nat_gateway" "this" {
count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : length(var.azs)) : 0
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
tags = {
Name = "${var.name}-nat-${count.index}"
}
depends_on = [aws_internet_gateway.this]
}
output "vpc_id" {
description = "The ID of the VPC"
value = aws_vpc.this.id
}
output "private_subnet_ids" {
description = "List of private subnet IDs"
value = aws_subnet.private[*].id
}
output "public_subnet_ids" {
description = "List of public subnet IDs"
value = aws_subnet.public[*].id
}
output "nat_gateway_ips" {
description = "List of Elastic IPs for NAT Gateways"
value = aws_eip.nat[*].public_ip
}
# modules/ecs/main.tf
variable "name" {
description = "Name of the ECS cluster"
type = string
}
variable "vpc_id" {
description = "VPC ID"
type = string
}
variable "subnet_ids" {
description = "Subnet IDs for the ECS service"
type = list(string)
}
variable "container_image" {
description = "Docker image for the container"
type = string
}
variable "container_port" {
description = "Port exposed by the container"
type = number
default = 8080
}
variable "task_cpu" {
description = "CPU units for the task"
type = number
default = 512
}
variable "task_memory" {
description = "Memory for the task"
type = number
default = 1024
}
variable "desired_count" {
description = "Number of instances to run"
type = number
default = 2
}
resource "aws_ecs_cluster" "this" {
name = var.name
setting {
name = "containerInsights"
value = "enabled"
}
}
resource "aws_ecs_task_definition" "this" {
family = var.name
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = var.task_cpu
memory = var.task_memory
container_definitions = jsonencode([
{
name = var.name
image = var.container_image
portMappings = [
{
containerPort = var.container_port
hostPort = var.container_port
protocol = "tcp"
}
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.this.name
"awslogs-region" = data.aws_region.current.name
"awslogs-stream-prefix" = "ecs"
}
}
}
])
}
resource "aws_ecs_service" "this" {
name = var.name
cluster = aws_ecs_cluster.this.id
task_definition = aws_ecs_task_definition.this.arn
desired_count = var.desired_count
launch_type = "FARGATE"
network_configuration {
subnets = var.subnet_ids
security_groups = [aws_security_group.this.id]
assign_public_ip = false
}
}
resource "aws_security_group" "this" {
name_prefix = "${var.name}-"
vpc_id = var.vpc_id
ingress {
from_port = var.container_port
to_port = var.container_port
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_cloudwatch_log_group" "this" {
name = "/ecs/${var.name}"
retention_in_days = 30
}
output "cluster_id" {
description = "ECS cluster ID"
value = aws_ecs_cluster.this.id
}
output "service_name" {
description = "ECS service name"
value = aws_ecs_service.this.name
}
State Management
# backend.tf
terraform {
backend "s3" {
bucket = "terraform-state-bucket"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
# state locking with DynamoDB
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
# State management automation
import boto3
import json
from typing import Dict, Any, List
from dataclasses import dataclass
from datetime import datetime
@dataclass
class TerraformState:
state_file: str
backend: str
last_modified: datetime
locked: bool
class StateManager:
"""Terraform state management"""
def __init__(self, bucket_name: str, state_key: str):
self.s3 = boto3.client('s3')
self.dynamodb = boto3.resource('dynamodb')
self.bucket_name = bucket_name
self.state_key = state_key
self.lock_table = self.dynamodb.Table('terraform-locks')
def get_state(self) -> Dict[str, Any]:
"""Get Terraform state"""
response = self.s3.get_object(
Bucket=self.bucket_name,
Key=self.state_key
)
return json.loads(response['Body'].read())
def update_state(self, state: Dict[str, Any]):
"""Update Terraform state"""
self.s3.put_object(
Bucket=self.bucket_name,
Key=self.state_key,
Body=json.dumps(state, indent=2),
ServerSideEncryption='aws:kms'
)
def lock_state(self, lock_id: str) -> bool:
"""Lock Terraform state"""
try:
self.lock_table.put_item(
Item={
'LockID': self.state_key,
'lock_id': lock_id,
'created_at': datetime.utcnow().isoformat()
},
ConditionExpression='attribute_not_exists(LockID)'
)
return True
except Exception as e:
print(f"Lock failed: {e}")
return False
def unlock_state(self, lock_id: str) -> bool:
"""Unlock Terraform state"""
try:
self.lock_table.delete_item(
Key={'LockID': self.state_key},
ConditionExpression='lock_id = :lock_id',
ExpressionAttributeValues={':lock_id': lock_id}
)
return True
except Exception as e:
print(f"Unlock failed: {e}")
return False
def detect_drift(self, current_resources: List[str]) -> Dict[str, Any]:
"""Detect drift between state and actual resources"""
state = self.get_state()
state_resources = [r['address'] for r in state.get('resources', [])]
added = [r for r in current_resources if r not in state_resources]
removed = [r for r in state_resources if r not in current_resources]
return {
'drifted': len(added) > 0 or len(removed) > 0,
'added': added,
'removed': removed
}
def backup_state(self):
"""Backup Terraform state"""
timestamp = datetime.utcnow().strftime('%Y%m%d-%H%M%S')
backup_key = f"backups/{self.state_key}.{timestamp}"
self.s3.copy_object(
Bucket=self.bucket_name,
CopySource={'Bucket': self.bucket_name, 'Key': self.state_key},
Key=backup_key,
ServerSideEncryption='aws:kms'
)
return backup_key
Drift Detection
# Drift detection automation
import boto3
import json
from typing import Dict, Any, List
from datetime import datetime
class DriftDetector:
"""Infrastructure drift detection"""
def __init__(self):
self.ec2 = boto3.client('ec2')
self.rds = boto3.client('rds')
self.s3 = boto3.client('s3')
def detect_ec2_drift(self, expected_instances: List[Dict[str, Any]]) -> Dict[str, Any]:
"""Detect EC2 instance drift"""
# Get actual instances
response = self.ec2.describe_instances()
actual_instances = []
for reservation in response['Reservations']:
for instance in reservation['Instances']:
actual_instances.append({
'instance_id': instance['InstanceId'],
'instance_type': instance['InstanceType'],
'state': instance['State']['Name']
})
# Compare
expected_ids = {i['instance_id'] for i in expected_instances}
actual_ids = {i['instance_id'] for i in actual_instances}
added = actual_ids - expected_ids
removed = expected_ids - actual_ids
return {
'drifted': len(added) > 0 or len(removed) > 0,
'added': list(added),
'removed': list(removed),
'details': {
'expected_count': len(expected_instances),
'actual_count': len(actual_instances)
}
}
def detect_s3_drift(self, expected_buckets: List[str]) -> Dict[str, Any]:
"""Detect S3 bucket drift"""
response = self.s3.list_buckets()
actual_buckets = [b['Name'] for b in response['Buckets']]
expected_set = set(expected_buckets)
actual_set = set(actual_buckets)
added = actual_set - expected_set
removed = expected_set - actual_set
return {
'drifted': len(added) > 0 or len(removed) > 0,
'added': list(added),
'removed': list(removed)
}
def generate_drift_report(self, drift_results: List[Dict[str, Any]]) -> str:
"""Generate drift report"""
report = "# Infrastructure Drift Report\n\n"
report += f"Generated: {datetime.utcnow().isoformat()}\n\n"
for result in drift_results:
report += f"## {result['type']}\n"
report += f"Drifted: {result['drifted']}\n"
if result['drifted']:
report += f"Added: {result.get('added', [])}\n"
report += f"Removed: {result.get('removed', [])}\n"
report += "\n"
return report
โ ๏ธDrift Detection
Implement automated drift detection to identify unauthorized changes. Use scheduled scans and alerts to maintain infrastructure compliance.
CI/CD Integration
# GitHub Actions for Terraform
name: Terraform CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
TF_VERSION: "1.5.0"
AWS_REGION: "us-east-1"
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Terraform Init
run: terraform init
- name: Terraform Format
run: terraform fmt -check
- name: Terraform Validate
run: terraform validate
- name: Terraform Plan
run: terraform plan -out=tfplan
- name: Upload Plan
uses: actions/upload-artifact@v3
with:
name: tfplan
path: tfplan
apply:
needs: plan
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Download Plan
uses: actions/download-artifact@v3
with:
name: tfplan
- name: Terraform Init
run: terraform init
- name: Terraform Apply
run: terraform apply -auto-approve tfplan
drift-detection:
runs-on: ubuntu-latest
schedule:
- cron: '0 0 * * *' # Daily
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Terraform Init
run: terraform init
- name: Check for Drift
run: |
terraform plan -detailed-exitcode
if [ $? -eq 2 ]; then
echo "Drift detected!"
# Send notification
fi
โ IaC Benefits
Infrastructure as Code provides version control, reproducibility, and automation for infrastructure. Use modules for reuse and state management for tracking.
Summary
| Component | Purpose | Implementation |
|---|---|---|
| Modules | Reusable components | Terraform modules |
| State | Infrastructure state | S3 backend |
| Locking | Concurrency control | DynamoDB |
| Drift | Change detection | Scheduled scans |
| CI/CD | Automation | GitHub Actions |