K8s for ML: GPU Scheduling, Model Serving, Auto-Scaling

Interview Question (Hard) — Asked at: Google, Microsoft, Amazon, Netflix, Uber

"Design a Kubernetes-based ML infrastructure that supports GPU workloads, model serving, and auto-scaling. How do you handle resource management, cost optimization, and multi-tenancy?"

Kubernetes for ML Architecture

Kubernetes provides the foundation for scalable, production-grade ML infrastructure. It handles resource scheduling, scaling, and management of ML workloads.

ML on Kubernetes Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Kubernetes ML Infrastructure                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                   Control Plane                          │   │
│  │    API Server | Scheduler | Controller Manager          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│         ┌────────────────────┼────────────────────┐            │
│         ▼                    ▼                    ▼            │
│  ┌──────────┐      ┌──────────────┐      ┌──────────┐        │
│  │  GPU     │      │   CPU-only   │      │  Hybrid  │        │
│  │  Nodes   │      │    Nodes     │      │  Nodes   │        │
│  │ (T4/A10G)│      │              │      │          │        │
│  └──────────┘      └──────────────┘      └──────────┘        │
│       │                  │                   │                │
│       │   ┌──────────────┴───────────────┐  │                │
│       │   │       ML Workloads            │  │                │
│       │   │  Training | Serving | Batch   │  │                │
│       │   └──────────────────────────────┘  │                │
│       │                                      │                │
│       └──────────────┬───────────────────────┘                │
│                      ▼                                        │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │              ML Operators & Controllers                  │  │
│  │     Kubeflow | KServe | Volcano | NVIDIA Operator       │  │
│  └─────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

GPU Scheduling

NVIDIA GPU Operator Setup

# nvidia-gpu-operator-values.yaml
operator:
  defaultRuntime: containerd
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

driver:
  enabled: true
  version: "525.85.12"

toolkit:
  enabled: true

devicePlugin:
  enabled: true
  config:
    map:
      default: |
        version: v1
        sharing:
          timeSlicing:
            resources:
            - name: nvidia.com/gpu
              replicas: 10

gpuFeatureDiscovery:
  enabled: true

dcgmExporter:
  enabled: true

metrics:
  serviceMonitor:
    enabled: true

GPU-Aware Pod Scheduling

# kubernetes/gpu-training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-training-job
  namespace: ml-training
spec:
  completions: 1
  parallelism: 1
  backoffLimit: 3
  template:
    metadata:
      labels:
        app: ml-training
        gpu-type: nvidia-tesla-t4
    spec:
      restartPolicy: Never
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      nodeSelector:
        accelerator: nvidia-tesla-t4
      containers:
      - name: training
        image: registry.example.com/ml-training:latest
        command: ["python", "train.py"]
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: "compute,utility"
        resources:
          requests:
            memory: "16Gi"
            cpu: "4000m"
            nvidia.com/gpu: "1"
          limits:
            memory: "32Gi"
            cpu: "8000m"
            nvidia.com/gpu: "1"
        volumeMounts:
        - name: training-data
          mountPath: /data
        - name: model-output
          mountPath: /output
      volumes:
      - name: training-data
        persistentVolumeClaim:
          claimName: training-data-pvc
      - name: model-output
        persistentVolumeClaim:
          claimName: model-output-pvc
---
apiVersion: v1
kind: Pod
metadata:
  name: gpu-inference-pod
  namespace: ml-serving
spec:
  containers:
  - name: inference
    image: registry.example.com/ml-inference:latest
    resources:
      requests:
        nvidia.com/gpu: "1"
        memory: "8Gi"
        cpu: "2000m"
      limits:
        nvidia.com/gpu: "1"
        memory: "16Gi"
        cpu: "4000m"
    env:
    - name: CUDA_VISIBLE_DEVICES
      value: "0"
  nodeSelector:
    accelerator: nvidia-tesla-t4
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

GPU Sharing with Time Slicing

# kubernetes/gpu-sharing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-device-plugin-config
  namespace: kube-system
data:
  any: |
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4

ℹ️

GPU sharing through time-slicing allows multiple pods to share a single GPU. This improves GPU utilization but may increase latency. Use it for development/testing or low-latency-tolerant workloads.

KServe Model Serving

KServe InferenceService

# kserve/inferenceservice.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detection
  namespace: ml-serving
  annotations:
    serving.kserve.io/enable prometheus: "true"
spec:
  predictor:
    model:
      modelFormat:
        name: onnx
      storageUri: "s3://ml-models/fraud-detection/v1.0"
      runtime: kserve-onnxruntime
      container:
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
            nvidia.com/gpu: "1"
          limits:
            memory: "8Gi"
            cpu: "4000m"
            nvidia.com/gpu: "1"
      args:
      - --concurrency=32
      - --max-batch-size=32
      - --batch-delay=10
  transformer:
    pre:
      containers:
      - name: preprocessor
        image: registry.example.com/ml-preprocessor:latest
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
  explainer:
    alibi:
      type: AnchorTabular
      resources:
        requests:
          memory: "4Gi"
          cpu: "2000m"
        limits:
          memory: "8Gi"
          cpu: "4000m"
---
# kserve/serving-runtime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: kserve-onnxruntime
  namespace: kserve
spec:
  supportedModelFormats:
  - name: onnx
    version: "1"
  containers:
  - name: kserve-container
    image: kserve/onnxserver:latest
    args:
    - --http_port=8080
    - --grpc_port=8001
    - --rest_api_log_level=WARNING
    resources:
      requests:
        cpu: "1"
        memory: "2Gi"
      limits:
        cpu: "2"
        memory: "4Gi"
  replicas: 1
  minReplicas: 1
  maxReplicas: 10
  scaleMetric: rps
  scaleTarget: 100
---
# kserve/trigger.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: triton-inference-server
spec:
  supportedModelFormats:
  - name: ensemble
    version: "1"
  - name: python
    version: "1"
  containers:
  - name: triton
    image: nvcr.io/nvidia/tritonserver:23.10-py3
    args:
    - tritonserver
    - --model-repository=/models
    - --log-verbose=1
    - --strict-model-config=false
    ports:
    - containerPort: 8000
      name: http
    - containerPort: 8001
      name: grpc
    - containerPort: 8002
      name: metrics
    resources:
      requests:
        cpu: "1"
        memory: "4Gi"
        nvidia.com/gpu: "1"
      limits:
        cpu: "4"
        memory: "8Gi"
        nvidia.com/gpu: "1"
    readinessProbe:
      httpGet:
        path: /v2/health/ready
        port: 8000
      initialDelaySeconds: 30
      periodSeconds: 10
    livenessProbe:
      httpGet:
        path: /v2/health/live
        port: 8000
      initialDelaySeconds: 60
      periodSeconds: 30

KServe with Auto-Scaling

# kserve/autoscaling.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: fraud-detection-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    name: fraud-detection
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: "80"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      - type: Pods
        value: 4
        periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 120
      selectPolicy: Min
---
# kserve/kpa.yaml
apiVersion: autoscaling.internal.knative.dev/v1alpha1
kind: PodAutoscaler
metadata:
  name: fraud-detection-pa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    name: fraud-detection
  autoscalerClass: kpa.autoscaling.knative.dev
  metric: rps
  target: 100
  minScale: 1
  maxScale: 20

⚠️

KServe provides production-ready model serving with auto-scaling, canary deployments, and explainability. Use it for standardized ML serving across your organization.

Resource Management

Resource Quotas and Limit Ranges

# kubernetes/resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: ml-training-quota
  namespace: ml-training
spec:
  hard:
    requests.cpu: "100"
    requests.memory: "200Gi"
    requests.nvidia.com/gpu: "20"
    limits.cpu: "200"
    limits.memory: "400Gi"
    limits.nvidia.com/gpu: "40"
    pods: "50"
    services: "20"
    persistentvolumeclaims: "30"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: ml-training-limits
  namespace: ml-training
spec:
  limits:
  - type: Container
    default:
      cpu: "2000m"
      memory: "8Gi"
    defaultRequest:
      cpu: "1000m"
      memory: "4Gi"
    max:
      cpu: "8000m"
      memory: "32Gi"
      nvidia.com/gpu: "2"
    min:
      cpu: "500m"
      memory: "1Gi"
  - type: Pod
    max:
      cpu: "16000m"
      memory: "64Gi"
      nvidia.com/gpu: "4"
---
# kubernetes/network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ml-network-policy
  namespace: ml-serving
spec:
  podSelector:
    matchLabels:
      app: ml-serving
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: frontend
    ports:
    - protocol: TCP
      port: 8080
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 9090
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: data-store
    ports:
    - protocol: TCP
      port: 5432
  - to: []
    ports:
    - protocol: TCP
      port: 53
    - protocol: UDP
      port: 53

Priority Classes for ML Workloads

# kubernetes/priority-classes.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ml-critical
value: 1000000
globalDefault: false
description: "Critical ML serving workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ml-high
value: 100000
globalDefault: false
description: "High priority ML training workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ml-normal
value: 10000
globalDefault: true
description: "Normal ML workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ml-low
value: 1000
globalDefault: false
description: "Low priority batch processing"
preemptionPolicy: Never
---
# Usage in pods
apiVersion: v1
kind: Pod
metadata:
  name: critical-serving
spec:
  priorityClassName: ml-critical
  containers:
  - name: inference
    image: registry.example.com/ml-inference:latest

ℹ️

Use Priority Classes to ensure critical ML workloads get resources first. Set appropriate preemption policies to avoid disrupting running jobs unnecessarily.

Multi-Tenancy

Namespace-Based Multi-Tenancy

# kubernetes/ml-tenants.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ml-tenant-data-science
  labels:
    tenant: data-science
    istio-injection: enabled
---
apiVersion: v1
kind: Namespace
metadata:
  name: ml-tenant-recommendations
  labels:
    tenant: recommendations
    istio-injection: enabled
---
# Tenant isolation with ResourceQuota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: data-science-quota
  namespace: ml-tenant-data-science
spec:
  hard:
    requests.cpu: "50"
    requests.memory: "100Gi"
    requests.nvidia.com/gpu: "10"
    pods: "25"
---
# Network isolation
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tenant-isolation
  namespace: ml-tenant-data-science
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          tenant: data-science
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          tenant: data-science
  - to: []
    ports:
    - protocol: TCP
      port: 53
    - protocol: UDP
      port: 53

RBAC for ML Teams

# kubernetes/ml-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ml-developer
  namespace: ml-development
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ml-ops
  namespace: ml-production
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps", "secrets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["batch"]
  resources: ["jobs", "cronjobs"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments", "statefulsets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["networking.k8s.io"]
  resources: ["ingresses"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ml-developer-binding
  namespace: ml-development
subjects:
- kind: Group
  name: ml-developers
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: ml-developer
  apiGroup: rbac.authorization.k8s.io

Cost Optimization

Spot Instance Strategy

# kubernetes/spot-training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: spot-training-job
  namespace: ml-training
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node.kubernetes.io/capacity-type
                operator: In
                values:
                - spot
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - ml-training
              topologyKey: kubernetes.io/hostname
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      - key: spot
        operator: Exists
        effect: NoSchedule
      containers:
      - name: training
        image: registry.example.com/ml-training:latest
        resources:
          requests:
            nvidia.com/gpu: "1"
            memory: "16Gi"
            cpu: "4000m"
          limits:
            nvidia.com/gpu: "1"
            memory: "32Gi"
            cpu: "8000m"
      restartPolicy: Never
      terminationGracePeriodSeconds: 30
  backoffLimit: 5

Cost Monitoring

# kubernetes/cost-monitoring.py
from kubernetes import client, config
from prometheus_client import Gauge
import time

class KubernetesCostMonitor:
    def __init__(self):
        config.load_incluster_config()
        self.v1 = client.CoreV1Api()
        self.custom_api = client.CustomObjectsApi()
        
        # Prometheus metrics
        self.pod_cost = Gauge(
            'kubernetes_pod_cost_dollars',
            'Estimated pod cost per hour',
            ['namespace', 'pod', 'node', 'gpu_type']
        )
        
        self.namespace_cost = Gauge(
            'kubernetes_namespace_cost_dollars_hour',
            'Total namespace cost per hour',
            ['namespace']
        )
    
    def calculate_pod_cost(self, pod, node_info) -> float:
        """Calculate estimated pod cost per hour."""
        
        cost_per_hour = 0.0
        
        # Node cost allocation
        node_cost = node_info.get('cost_per_hour', 0.0)
        node_gpus = node_info.get('gpus', 0)
        
        # GPU cost
        gpu_requests = sum(
            int(container.resources.requests.get('nvidia.com/gpu', 0))
            for container in pod.spec.containers
            if container.resources.requests
        )
        
        if node_gpus > 0:
            gpu_cost = (gpu_requests / node_gpus) * node_cost
            cost_per_hour += gpu_cost
        
        # CPU cost (proportional)
        cpu_requests = sum(
            float(container.resources.requests.get('cpu', '0').rstrip('m')) / 1000
            for container in pod.spec.containers
            if container.resources.requests
        )
        
        # Memory cost
        memory_requests = sum(
            self._parse_memory(container.resources.requests.get('memory', '0Gi'))
            for container in pod.spec.containers
            if container.resources.requests
        )
        
        return cost_per_hour
    
    def _parse_memory(self, memory_str: str) -> float:
        """Parse memory string to GB."""
        if memory_str.endswith('Gi'):
            return float(memory_str[:-2])
        elif memory_str.endswith('Mi'):
            return float(memory_str[:-2]) / 1024
        return 0.0
    
    def update_metrics(self):
        """Update cost metrics for all pods."""
        
        pods = self.v1.list_pod_for_all_namespaces(
            label_selector='app=ml-serving'
        )
        
        namespace_costs = {}
        
        for pod in pods.items:
            # Get node info
            node = self.v1.read_node(pod.spec.node_name)
            node_info = {
                'cost_per_hour': float(node.metadata.labels.get('cost-per-hour', '0')),
                'gpus': int(node.metadata.labels.get('nvidia.com/gpu', '0'))
            }
            
            cost = self.calculate_pod_cost(pod, node_info)
            
            self.pod_cost.labels(
                namespace=pod.metadata.namespace,
                pod=pod.metadata.name,
                node=pod.spec.node_name,
                gpu_type=node.metadata.labels.get('accelerator', 'none')
            ).set(cost)
            
            # Aggregate by namespace
            ns = pod.metadata.namespace
            namespace_costs[ns] = namespace_costs.get(ns, 0) + cost
        
        # Update namespace costs
        for ns, cost in namespace_costs.items():
            self.namespace_cost.labels(namespace=ns).set(cost)

ℹ️

Cost optimization is critical for ML workloads. Use spot instances for training, right-size resources, and implement auto-scaling to optimize costs while maintaining performance.

Production ML Infrastructure

Complete ML Infrastructure Stack

# kubernetes/ml-infrastructure.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-pipeline-controller
  namespace: kubeflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ml-pipeline-controller
  template:
    metadata:
      labels:
        app: ml-pipeline-controller
    spec:
      serviceAccountName: ml-pipeline
      containers:
      - name: controller
        image: gcr.io/ml-pipeline/pipeline-controller:latest
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-serving-operator
  namespace: kserve
spec:
  replicas: 1
  selector:
    matchLabels:
      app: model-serving-operator
  template:
    metadata:
      labels:
        app: model-serving-operator
    spec:
      serviceAccountName: kserve-controller
      containers:
      - name: manager
        image: kserve/kserve:latest
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"

Summary

Kubernetes provides the foundation for production ML infrastructure:

GPU Scheduling: NVIDIA GPU Operator, time-slicing, MIG
Model Serving: KServe with auto-scaling and canary deployments
Resource Management: Quotas, limits, and priority classes
Multi-Tenancy: Namespace isolation with RBAC
Cost Optimization: Spot instances, right-sizing, monitoring

Implement Kubernetes-based ML infrastructure for scalable, production-grade deployments.