Interview Question (Hard) β Asked at: Google, Microsoft, Amazon, Netflix, Uber
"Design a Kubernetes-based ML infrastructure that supports GPU workloads, model serving, and auto-scaling. How do you handle resource management, cost optimization, and multi-tenancy?"
Kubernetes for ML Architecture
Kubernetes provides the foundation for scalable, production-grade ML infrastructure. It handles resource scheduling, scaling, and management of ML workloads.
ML on Kubernetes Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes ML Infrastructure β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Control Plane β β
β β API Server | Scheduler | Controller Manager β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββΌβββββββββββββββββββββ β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββββββ ββββββββββββ β
β β GPU β β CPU-only β β Hybrid β β
β β Nodes β β Nodes β β Nodes β β
β β (T4/A10G)β β β β β β
β ββββββββββββ ββββββββββββββββ ββββββββββββ β
β β β β β
β β ββββββββββββββββ΄ββββββββββββββββ β β
β β β ML Workloads β β β
β β β Training | Serving | Batch β β β
β β ββββββββββββββββββββββββββββββββ β β
β β β β
β ββββββββββββββββ¬ββββββββββββββββββββββββ β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ML Operators & Controllers β β
β β Kubeflow | KServe | Volcano | NVIDIA Operator β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GPU Scheduling
NVIDIA GPU Operator Setup
# nvidia-gpu-operator-values.yaml
operator:
defaultRuntime: containerd
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
driver:
enabled: true
version: "525.85.12"
toolkit:
enabled: true
devicePlugin:
enabled: true
config:
map:
default: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 10
gpuFeatureDiscovery:
enabled: true
dcgmExporter:
enabled: true
metrics:
serviceMonitor:
enabled: true
GPU-Aware Pod Scheduling
# kubernetes/gpu-training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: gpu-training-job
namespace: ml-training
spec:
completions: 1
parallelism: 1
backoffLimit: 3
template:
metadata:
labels:
app: ml-training
gpu-type: nvidia-tesla-t4
spec:
restartPolicy: Never
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
accelerator: nvidia-tesla-t4
containers:
- name: training
image: registry.example.com/ml-training:latest
command: ["python", "train.py"]
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "compute,utility"
resources:
requests:
memory: "16Gi"
cpu: "4000m"
nvidia.com/gpu: "1"
limits:
memory: "32Gi"
cpu: "8000m"
nvidia.com/gpu: "1"
volumeMounts:
- name: training-data
mountPath: /data
- name: model-output
mountPath: /output
volumes:
- name: training-data
persistentVolumeClaim:
claimName: training-data-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-output-pvc
---
apiVersion: v1
kind: Pod
metadata:
name: gpu-inference-pod
namespace: ml-serving
spec:
containers:
- name: inference
image: registry.example.com/ml-inference:latest
resources:
requests:
nvidia.com/gpu: "1"
memory: "8Gi"
cpu: "2000m"
limits:
nvidia.com/gpu: "1"
memory: "16Gi"
cpu: "4000m"
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
nodeSelector:
accelerator: nvidia-tesla-t4
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
GPU Sharing with Time Slicing
# kubernetes/gpu-sharing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-device-plugin-config
namespace: kube-system
data:
any: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
βΉοΈ
GPU sharing through time-slicing allows multiple pods to share a single GPU. This improves GPU utilization but may increase latency. Use it for development/testing or low-latency-tolerant workloads.
KServe Model Serving
KServe InferenceService
# kserve/inferenceservice.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detection
namespace: ml-serving
annotations:
serving.kserve.io/enable prometheus: "true"
spec:
predictor:
model:
modelFormat:
name: onnx
storageUri: "s3://ml-models/fraud-detection/v1.0"
runtime: kserve-onnxruntime
container:
resources:
requests:
memory: "4Gi"
cpu: "2000m"
nvidia.com/gpu: "1"
limits:
memory: "8Gi"
cpu: "4000m"
nvidia.com/gpu: "1"
args:
- --concurrency=32
- --max-batch-size=32
- --batch-delay=10
transformer:
pre:
containers:
- name: preprocessor
image: registry.example.com/ml-preprocessor:latest
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
explainer:
alibi:
type: AnchorTabular
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
---
# kserve/serving-runtime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: kserve-onnxruntime
namespace: kserve
spec:
supportedModelFormats:
- name: onnx
version: "1"
containers:
- name: kserve-container
image: kserve/onnxserver:latest
args:
- --http_port=8080
- --grpc_port=8001
- --rest_api_log_level=WARNING
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
replicas: 1
minReplicas: 1
maxReplicas: 10
scaleMetric: rps
scaleTarget: 100
---
# kserve/trigger.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: triton-inference-server
spec:
supportedModelFormats:
- name: ensemble
version: "1"
- name: python
version: "1"
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:23.10-py3
args:
- tritonserver
- --model-repository=/models
- --log-verbose=1
- --strict-model-config=false
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: grpc
- containerPort: 8002
name: metrics
resources:
requests:
cpu: "1"
memory: "4Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 60
periodSeconds: 30
KServe with Auto-Scaling
# kserve/autoscaling.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: fraud-detection-hpa
namespace: ml-serving
spec:
scaleTargetRef:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
name: fraud-detection
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "100"
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "80"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 4
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 120
selectPolicy: Min
---
# kserve/kpa.yaml
apiVersion: autoscaling.internal.knative.dev/v1alpha1
kind: PodAutoscaler
metadata:
name: fraud-detection-pa
namespace: ml-serving
spec:
scaleTargetRef:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
name: fraud-detection
autoscalerClass: kpa.autoscaling.knative.dev
metric: rps
target: 100
minScale: 1
maxScale: 20
β οΈ
KServe provides production-ready model serving with auto-scaling, canary deployments, and explainability. Use it for standardized ML serving across your organization.
Resource Management
Resource Quotas and Limit Ranges
# kubernetes/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: ml-training-quota
namespace: ml-training
spec:
hard:
requests.cpu: "100"
requests.memory: "200Gi"
requests.nvidia.com/gpu: "20"
limits.cpu: "200"
limits.memory: "400Gi"
limits.nvidia.com/gpu: "40"
pods: "50"
services: "20"
persistentvolumeclaims: "30"
---
apiVersion: v1
kind: LimitRange
metadata:
name: ml-training-limits
namespace: ml-training
spec:
limits:
- type: Container
default:
cpu: "2000m"
memory: "8Gi"
defaultRequest:
cpu: "1000m"
memory: "4Gi"
max:
cpu: "8000m"
memory: "32Gi"
nvidia.com/gpu: "2"
min:
cpu: "500m"
memory: "1Gi"
- type: Pod
max:
cpu: "16000m"
memory: "64Gi"
nvidia.com/gpu: "4"
---
# kubernetes/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ml-network-policy
namespace: ml-serving
spec:
podSelector:
matchLabels:
app: ml-serving
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: frontend
ports:
- protocol: TCP
port: 8080
- from:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 9090
egress:
- to:
- namespaceSelector:
matchLabels:
name: data-store
ports:
- protocol: TCP
port: 5432
- to: []
ports:
- protocol: TCP
port: 53
- protocol: UDP
port: 53
Priority Classes for ML Workloads
# kubernetes/priority-classes.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ml-critical
value: 1000000
globalDefault: false
description: "Critical ML serving workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ml-high
value: 100000
globalDefault: false
description: "High priority ML training workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ml-normal
value: 10000
globalDefault: true
description: "Normal ML workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ml-low
value: 1000
globalDefault: false
description: "Low priority batch processing"
preemptionPolicy: Never
---
# Usage in pods
apiVersion: v1
kind: Pod
metadata:
name: critical-serving
spec:
priorityClassName: ml-critical
containers:
- name: inference
image: registry.example.com/ml-inference:latest
βΉοΈ
Use Priority Classes to ensure critical ML workloads get resources first. Set appropriate preemption policies to avoid disrupting running jobs unnecessarily.
Multi-Tenancy
Namespace-Based Multi-Tenancy
# kubernetes/ml-tenants.yaml
apiVersion: v1
kind: Namespace
metadata:
name: ml-tenant-data-science
labels:
tenant: data-science
istio-injection: enabled
---
apiVersion: v1
kind: Namespace
metadata:
name: ml-tenant-recommendations
labels:
tenant: recommendations
istio-injection: enabled
---
# Tenant isolation with ResourceQuota
apiVersion: v1
kind: ResourceQuota
metadata:
name: data-science-quota
namespace: ml-tenant-data-science
spec:
hard:
requests.cpu: "50"
requests.memory: "100Gi"
requests.nvidia.com/gpu: "10"
pods: "25"
---
# Network isolation
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: tenant-isolation
namespace: ml-tenant-data-science
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
tenant: data-science
egress:
- to:
- namespaceSelector:
matchLabels:
tenant: data-science
- to: []
ports:
- protocol: TCP
port: 53
- protocol: UDP
port: 53
RBAC for ML Teams
# kubernetes/ml-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ml-developer
namespace: ml-development
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ml-ops
namespace: ml-production
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps", "secrets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["batch"]
resources: ["jobs", "cronjobs"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ml-developer-binding
namespace: ml-development
subjects:
- kind: Group
name: ml-developers
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: ml-developer
apiGroup: rbac.authorization.k8s.io
Cost Optimization
Spot Instance Strategy
# kubernetes/spot-training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: spot-training-job
namespace: ml-training
spec:
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/capacity-type
operator: In
values:
- spot
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ml-training
topologyKey: kubernetes.io/hostname
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: spot
operator: Exists
effect: NoSchedule
containers:
- name: training
image: registry.example.com/ml-training:latest
resources:
requests:
nvidia.com/gpu: "1"
memory: "16Gi"
cpu: "4000m"
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
cpu: "8000m"
restartPolicy: Never
terminationGracePeriodSeconds: 30
backoffLimit: 5
Cost Monitoring
# kubernetes/cost-monitoring.py
from kubernetes import client, config
from prometheus_client import Gauge
import time
class KubernetesCostMonitor:
def __init__(self):
config.load_incluster_config()
self.v1 = client.CoreV1Api()
self.custom_api = client.CustomObjectsApi()
# Prometheus metrics
self.pod_cost = Gauge(
'kubernetes_pod_cost_dollars',
'Estimated pod cost per hour',
['namespace', 'pod', 'node', 'gpu_type']
)
self.namespace_cost = Gauge(
'kubernetes_namespace_cost_dollars_hour',
'Total namespace cost per hour',
['namespace']
)
def calculate_pod_cost(self, pod, node_info) -> float:
"""Calculate estimated pod cost per hour."""
cost_per_hour = 0.0
# Node cost allocation
node_cost = node_info.get('cost_per_hour', 0.0)
node_gpus = node_info.get('gpus', 0)
# GPU cost
gpu_requests = sum(
int(container.resources.requests.get('nvidia.com/gpu', 0))
for container in pod.spec.containers
if container.resources.requests
)
if node_gpus > 0:
gpu_cost = (gpu_requests / node_gpus) * node_cost
cost_per_hour += gpu_cost
# CPU cost (proportional)
cpu_requests = sum(
float(container.resources.requests.get('cpu', '0').rstrip('m')) / 1000
for container in pod.spec.containers
if container.resources.requests
)
# Memory cost
memory_requests = sum(
self._parse_memory(container.resources.requests.get('memory', '0Gi'))
for container in pod.spec.containers
if container.resources.requests
)
return cost_per_hour
def _parse_memory(self, memory_str: str) -> float:
"""Parse memory string to GB."""
if memory_str.endswith('Gi'):
return float(memory_str[:-2])
elif memory_str.endswith('Mi'):
return float(memory_str[:-2]) / 1024
return 0.0
def update_metrics(self):
"""Update cost metrics for all pods."""
pods = self.v1.list_pod_for_all_namespaces(
label_selector='app=ml-serving'
)
namespace_costs = {}
for pod in pods.items:
# Get node info
node = self.v1.read_node(pod.spec.node_name)
node_info = {
'cost_per_hour': float(node.metadata.labels.get('cost-per-hour', '0')),
'gpus': int(node.metadata.labels.get('nvidia.com/gpu', '0'))
}
cost = self.calculate_pod_cost(pod, node_info)
self.pod_cost.labels(
namespace=pod.metadata.namespace,
pod=pod.metadata.name,
node=pod.spec.node_name,
gpu_type=node.metadata.labels.get('accelerator', 'none')
).set(cost)
# Aggregate by namespace
ns = pod.metadata.namespace
namespace_costs[ns] = namespace_costs.get(ns, 0) + cost
# Update namespace costs
for ns, cost in namespace_costs.items():
self.namespace_cost.labels(namespace=ns).set(cost)
βΉοΈ
Cost optimization is critical for ML workloads. Use spot instances for training, right-size resources, and implement auto-scaling to optimize costs while maintaining performance.
Production ML Infrastructure
Complete ML Infrastructure Stack
# kubernetes/ml-infrastructure.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-pipeline-controller
namespace: kubeflow
spec:
replicas: 1
selector:
matchLabels:
app: ml-pipeline-controller
template:
metadata:
labels:
app: ml-pipeline-controller
spec:
serviceAccountName: ml-pipeline
containers:
- name: controller
image: gcr.io/ml-pipeline/pipeline-controller:latest
ports:
- containerPort: 8080
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-serving-operator
namespace: kserve
spec:
replicas: 1
selector:
matchLabels:
app: model-serving-operator
template:
metadata:
labels:
app: model-serving-operator
spec:
serviceAccountName: kserve-controller
containers:
- name: manager
image: kserve/kserve:latest
ports:
- containerPort: 8080
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
Summary
Kubernetes provides the foundation for production ML infrastructure:
- GPU Scheduling: NVIDIA GPU Operator, time-slicing, MIG
- Model Serving: KServe with auto-scaling and canary deployments
- Resource Management: Quotas, limits, and priority classes
- Multi-Tenancy: Namespace isolation with RBAC
- Cost Optimization: Spot instances, right-sizing, monitoring
Implement Kubernetes-based ML infrastructure for scalable, production-grade deployments.