Spark on K8s: Dynamic Allocation, Shuffle Service, GPU

Difficulty: Expert | Companies: Google, Netflix, Uber, Airbnb, Lyft

ℹ️Interview Context

Spark on Kubernetes is increasingly common in cloud-native environments. Interviewers expect knowledge of K8s-specific configurations, dynamic allocation, and resource scheduling.

Question

How does Spark integrate with Kubernetes for resource management? Explain dynamic allocation on K8s, external shuffle service, and GPU scheduling. What are the key differences between Spark on YARN vs. Spark on K8s?

Detailed Answer

1. Spark on K8s Architecture

2. Basic Spark on K8s Configuration

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkOnK8s") \
    .master("k8s://https://kubernetes.default.svc:6443") \
    .config("spark.kubernetes.container.image", "spark:3.5.0") \
    .config("spark.kubernetes.namespace", "spark-jobs") \
    .config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark") \
    .config("spark.executor.instances", "10") \
    .config("spark.executor.memory", "8g") \
    .config("spark.executor.cores", "4") \
    .config("spark.driver.memory", "4g") \
    .config("spark.driver.cores", "2") \
    .getOrCreate()

# K8s-specific configurations:
# spark.kubernetes.container.image: Docker image for executor
# spark.kubernetes.namespace: K8s namespace
# spark.kubernetes.authenticate.driver.serviceAccountName: Service account
# spark.kubernetes.driver.podTemplateFile: Pod template for driver
# spark.kubernetes.executor.podTemplateFile: Pod template for executor

3. Dynamic Allocation on K8s

# Dynamic Allocation on K8s requires external shuffle service
spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.dynamicAllocation.minExecutors", "2")
spark.conf.set("spark.dynamicAllocation.maxExecutors", "100")
spark.conf.set("spark.dynamicAllocation.executorIdleTimeout", "60s")
spark.conf.set("spark.dynamicAllocation.schedulerBacklogTimeout", "1s")
spark.conf.set("spark.dynamicAllocation.sustainedSchedulerBacklogTimeout", "1s")

# External Shuffle Service (required for dynamic allocation on K8s)
spark.conf.set("spark.shuffle.service.enabled", "true")
spark.conf.set("spark.kubernetes.shuffleService.enabled", "true")

# How dynamic allocation works on K8s:
# 1. Spark requests executor pods from K8s API
# 2. K8s schedules pods on available nodes
# 3. When pods are idle, Spark requests pod termination
# 4. K8s terminates pods and frees resources
# 5. Spark continues with remaining executors

# Executor pod lifecycle:
<div className="my-6 flex justify-center">
<svg viewBox="0 0 800 120" width="100%" style={{ maxWidth: 750 }} xmlns="http://www.w3.org/2000/svg">
  <defs>
    <linearGradient id="k8s-life-grad" x1="0" y1="0" x2="1" y2="1">
      <stop offset="0%" stopColor="#6366f1"/>
      <stop offset="100%" stopColor="#4f46e5"/>
    </linearGradient>
    <filter id="k8s-life-shadow">
      <feDropShadow dx="0" dy="2" stdDeviation="3" floodOpacity="0.12"/>
    </filter>
    <marker id="k8s-life-arrow" viewBox="0 0 10 7" refX="10" refY="3.5" markerWidth="8" markerHeight="6" orient="auto-start-reverse">
      <polygon points="0 0, 10 3.5, 0 7" fill="#6b7280"/>
    </marker>
  </defs>
  <rect x="20" y="30" width="110" height="40" rx="10" fill="url(#k8s-life-grad)" filter="url(#k8s-life-shadow)"/>
  <text x="75" y="55" textAnchor="middle" fill="#fff" fontFamily="Inter,system-ui,sans-serif" fontSize="10" fontWeight="600">Request</text>
  <line x1="130" y1="50" x2="155" y2="50" stroke="#6b7280" strokeWidth="2" markerEnd="url(#k8s-life-arrow)"/>
  <rect x="160" y="30" width="110" height="40" rx="10" fill="#3b82f6" filter="url(#k8s-life-shadow)"/>
  <text x="215" y="55" textAnchor="middle" fill="#fff" fontFamily="Inter,system-ui,sans-serif" fontSize="10" fontWeight="600">Pending</text>
  <line x1="270" y1="50" x2="295" y2="50" stroke="#6b7280" strokeWidth="2" markerEnd="url(#k8s-life-arrow)"/>
  <rect x="300" y="30" width="110" height="40" rx="10" fill="#8b5cf6" filter="url(#k8s-life-shadow)"/>
  <text x="355" y="55" textAnchor="middle" fill="#fff" fontFamily="Inter,system-ui,sans-serif" fontSize="10" fontWeight="600">Running</text>
  <line x1="410" y1="50" x2="435" y2="50" stroke="#6b7280" strokeWidth="2" markerEnd="url(#k8s-life-arrow)"/>
  <rect x="440" y="30" width="110" height="40" rx="10" fill="#10b981" filter="url(#k8s-life-shadow)"/>
  <text x="495" y="55" textAnchor="middle" fill="#fff" fontFamily="Inter,system-ui,sans-serif" fontSize="10" fontWeight="600">Active</text>
  <line x1="550" y1="50" x2="575" y2="50" stroke="#6b7280" strokeWidth="2" markerEnd="url(#k8s-life-arrow)"/>
  <rect x="580" y="30" width="110" height="40" rx="10" fill="#f59e0b" filter="url(#k8s-life-shadow)"/>
  <text x="635" y="55" textAnchor="middle" fill="#fff" fontFamily="Inter,system-ui,sans-serif" fontSize="10" fontWeight="600">Idle</text>
  <line x1="690" y1="50" x2="715" y2="50" stroke="#6b7280" strokeWidth="2" markerEnd="url(#k8s-life-arrow)"/>
  <rect x="720" y="30" width="70" height="40" rx="10" fill="#ef4444" filter="url(#k8s-life-shadow)"/>
  <text x="755" y="55" textAnchor="middle" fill="#fff" fontFamily="Inter,system-ui,sans-serif" fontSize="10" fontWeight="600">Done</text>
  <text x="75" y="90" textAnchor="middle" fill="#6b7280" fontFamily="Inter,system-ui,sans-serif" fontSize="9">Spark requests pod</text>
  <text x="215" y="90" textAnchor="middle" fill="#6b7280" fontFamily="Inter,system-ui,sans-serif" fontSize="9">Waiting for resources</text>
  <text x="355" y="90" textAnchor="middle" fill="#6b7280" fontFamily="Inter,system-ui,sans-serif" fontSize="9">Node allocated</text>
  <text x="495" y="90" textAnchor="middle" fill="#6b7280" fontFamily="Inter,system-ui,sans-serif" fontSize="9">Assigned tasks</text>
  <text x="635" y="90" textAnchor="middle" fill="#6b7280" fontFamily="Inter,system-ui,sans-serif" fontSize="9">No tasks for timeout</text>
  <text x="755" y="90" textAnchor="middle" fill="#6b7280" fontFamily="Inter,system-ui,sans-serif" fontSize="9">K8s terminates</text>
</svg>
</div>

4. External Shuffle Service

# External Shuffle Service (ESS) runs as separate pods
# Stores shuffle data so executor pods can be terminated

<div className="my-6 flex justify-center">
<svg viewBox="0 0 800 200" width="100%" style={{ maxWidth: 720 }} xmlns="http://www.w3.org/2000/svg">
  <defs>
    <linearGradient id="k8s-ess-cluster" x1="0" y1="0" x2="0" y2="1">
      <stop offset="0%" stopColor="#326ce5"/>
      <stop offset="100%" stopColor="#2356b5"/>
    </linearGradient>
    <linearGradient id="k8s-ess-exec" x1="0" y1="0" x2="0" y2="1">
      <stop offset="0%" stopColor="#3b82f6"/>
      <stop offset="100%" stopColor="#2563eb"/>
    </linearGradient>
    <linearGradient id="k8s-ess-svc" x1="0" y1="0" x2="0" y2="1">
      <stop offset="0%" stopColor="#f59e0b"/>
      <stop offset="100%" stopColor="#d97706"/>
    </linearGradient>
    <filter id="k8s-ess-shadow">
      <feDropShadow dx="0" dy="3" stdDeviation="4" floodOpacity="0.15"/>
    </filter>
    <marker id="k8s-ess-arrow" viewBox="0 0 10 7" refX="10" refY="3.5" markerWidth="8" markerHeight="6" orient="auto-start-reverse">
      <polygon points="0 0, 10 3.5, 0 7" fill="#6b7280"/>
    </marker>
  </defs>
  <rect x="10" y="10" width="780" height="180" rx="18" fill="#f0f5ff" filter="url(#k8s-ess-shadow)" stroke="#326ce5" strokeWidth="1.5"/>
  <text x="400" y="35" textAnchor="middle" fill="#326ce5" fontFamily="Inter,system-ui,sans-serif" fontSize="13" fontWeight="700">K8s Cluster</text>
  <rect x="40" y="50" width="200" height="55" rx="12" fill="url(#k8s-ess-exec)" filter="url(#k8s-ess-shadow)"/>
  <text x="140" y="72" textAnchor="middle" fill="#fff" fontFamily="Inter,system-ui,sans-serif" fontSize="11" fontWeight="600">Executor 1 (pod)</text>
  <text x="140" y="90" textAnchor="middle" fill="rgba(255,255,255,0.8)" fontFamily="Inter,system-ui,sans-serif" fontSize="10">Shuffle data →</text>
  <rect x="300" y="50" width="200" height="55" rx="12" fill="url(#k8s-ess-exec)" filter="url(#k8s-ess-shadow)"/>
  <text x="400" y="72" textAnchor="middle" fill="#fff" fontFamily="Inter,system-ui,sans-serif" fontSize="11" fontWeight="600">Executor 2 (pod)</text>
  <text x="400" y="90" textAnchor="middle" fill="rgba(255,255,255,0.8)" fontFamily="Inter,system-ui,sans-serif" fontSize="10">Shuffle data →</text>
  <rect x="560" y="50" width="200" height="55" rx="12" fill="url(#k8s-ess-exec)" filter="url(#k8s-ess-shadow)"/>
  <text x="660" y="72" textAnchor="middle" fill="#fff" fontFamily="Inter,system-ui,sans-serif" fontSize="11" fontWeight="600">Executor 3 (pod)</text>
  <text x="660" y="90" textAnchor="middle" fill="rgba(255,255,255,0.8)" fontFamily="Inter,system-ui,sans-serif" fontSize="10">Shuffle data →</text>
  <line x1="140" y1="105" x2="400" y2="135" stroke="#6b7280" strokeWidth="1.5" markerEnd="url(#k8s-ess-arrow)"/>
  <line x1="400" y1="105" x2="400" y2="135" stroke="#6b7280" strokeWidth="1.5" markerEnd="url(#k8s-ess-arrow)"/>
  <line x1="660" y1="105" x2="400" y2="135" stroke="#6b7280" strokeWidth="1.5" markerEnd="url(#k8s-ess-arrow)"/>
  <rect x="100" y="135" width="600" height="45" rx="12" fill="url(#k8s-ess-svc)" filter="url(#k8s-ess-shadow)"/>
  <text x="400" y="155" textAnchor="middle" fill="#fff" fontFamily="Inter,system-ui,sans-serif" fontSize="11" fontWeight="700">External Shuffle Service</text>
  <text x="400" y="172" textAnchor="middle" fill="rgba(255,255,255,0.85)" fontFamily="Inter,system-ui,sans-serif" fontSize="10">Shuffle Block 1 · Shuffle Block 2 · Shuffle Block 3</text>
</svg>
</div>

# ESS configuration:
spark.conf.set("spark.shuffle.service.enabled", "true")
spark.conf.set("spark.kubernetes.shuffleService.enabled", "true")
spark.conf.set("spark.kubernetes.shuffleService.podTemplateFile", 
    "file:///path/to/shuffle-service-pod-template.yaml")

# Shuffle Service pod template (shuffle-service-pod-template.yaml):
# apiVersion: v1
# kind: Pod
# metadata:
#   name: spark-shuffle-service
# spec:
#   containers:
#   - name: shuffle-service
#     image: spark:3.5.0
#     command: ["/opt/spark/bin/spark-class", 
#               "org.apache.spark.deploy.ExternalShuffleService"]
#     resources:
#       requests:
#         memory: "2Gi"
#         cpu: "1"
#       limits:
#         memory: "4Gi"
#         cpu: "2"
#     ports:
#     - containerPort: 7337
#       name: shuffle
#     volumeMounts:
#     - name: shuffle-storage
#       mountPath: /data/shuffle
#   volumes:
#   - name: shuffle-storage
#     persistentVolumeClaim:
#       claimName: shuffle-storage-pvc

5. GPU Scheduling

# GPU scheduling on K8s:
spark.conf.set("spark.executor.resource.gpu.amount", "1")
spark.conf.set("spark.executor.resource.gpu.discoveryScript", 
    "/opt/spark/scripts/discover-gpus.py")

# GPU discovery script:
# #!/usr/bin/env python3
# import subprocess
# import json
#
# def discover_gpus():
#     result = subprocess.run(
#         ["nvidia-smi", "--query-gpu=index", "--format=csv,noheader"],
#         capture_output=True, text=True
#     )
#     gpus = [int(x) for x in result.stdout.strip().split("\n")]
#     return {"name": "gpu", "addresses": [str(gpu) for gpu in gpus]}
#
# if __name__ == "__main__":
#     print(json.dumps(discover_gpus()))

# Pod template for GPU executor:
# apiVersion: v1
# kind: Pod
# metadata:
#   name: spark-gpu-executor
# spec:
#   containers:
#   - name: spark-executor
#     resources:
#       requests:
#         memory: "16Gi"
#         cpu: "4"
#         nvidia.com/gpu: "1"
#       limits:
#         memory: "16Gi"
#         cpu: "4"
#         nvidia.com/gpu: "1"
#     env:
#     - name: NVIDIA_VISIBLE_DEVICES
#       value: "all"

# Use GPU in Spark:
# from pyspark.ml.classification import GBTClassifier
# from pyspark.ml.feature import VectorAssembler
#
# # Spark MLlib automatically uses GPU for supported algorithms
# assembler = VectorAssembler(inputCols=features, outputCol="features")
# gbt = GBTClassifier(featuresCol="features", labelCol="label")
# pipeline = Pipeline(stages=[assembler, gbt])
# model = pipeline.fit(training_data)  # Uses GPU if available

6. K8s vs YARN Comparison

Key differences:

1. Resource management: YARN manages containers, K8s manages pods

2. Dynamic allocation: YARN has native support, K8s needs ESS

3. Networking: YARN uses YARN RM, K8s uses CNI plugins

4. Storage: YARN uses HDFS, K8s uses PVCs/CSI

5. Multi-tenancy: YARN uses queues, K8s uses namespaces

Architecture Diagram


### 7. Performance Tuning

```python
# K8s-specific performance tuning:

# 1. Pod scheduling optimization
spark.conf.set("spark.kubernetes.driver.podTemplateFile", "driver-template.yaml")
spark.conf.set("spark.kubernetes.executor.podTemplateFile", "executor-template.yaml")

# 2. Node affinity (for GPU or high-memory nodes)
# Pod template:
# spec:
#   affinity:
#     nodeAffinity:
#       requiredDuringSchedulingIgnoredDuringExecution:
#         nodeSelectorTerms:
#         - matchExpressions:
#           - key: accelerator
#             operator: In
#             values:
#             - nvidia-tesla-v100

# 3. Pod priority and preemption
# apiVersion: scheduling.k8s.io/v1
# kind: PriorityClass
# metadata:
#   name: spark-high-priority
# value: 1000000
# globalDefault: false
# description: "Priority class for Spark jobs"

# 4. Resource quotas
# apiVersion: v1
# kind: ResourceQuota
# metadata:
#   name: spark-quota
# spec:
#   hard:
#     requests.cpu: "100"
#     requests.memory: "200Gi"
#     limits.cpu: "200"
#     limits.memory: "400Gi"
#     pods: "50"

# 5. Network policies
# apiVersion: networking.k8s.io/v1
# kind: NetworkPolicy
# metadata:
#   name: spark-network-policy
# spec:
#   podSelector:
#     matchLabels:
#       app: spark
#   policyTypes:
#   - Ingress
#   - Egress
#   ingress:
#   - from:
#     - podSelector:
#         matchLabels:
#           app: spark-driver

8. Monitoring and Debugging

# Monitor Spark on K8s:

# 1. Spark UI (accessible via K8s service or port-forward)
# kubectl port-forward <driver-pod> 4040:4040

# 2. Executor logs
# kubectl logs <executor-pod> -f

# 3. K8s events
# kubectl get events -n spark-jobs

# 4. Pod status
# kubectl get pods -n spark-jobs -l app=spark

# 5. Resource usage
# kubectl top pods -n spark-jobs

# Debugging common issues:
# Issue: Executor pods OOMKilled
# Solution: Increase executor memory or reduce partition size
# spark.executor.memory = "16g"  # Increase from 8g

# Issue: Driver pod OOMKilled
# Solution: Increase driver memory or reduce collect operations
# spark.driver.memory = "8g"  # Increase from 4g

# Issue: Shuffle service overload
# Solution: Increase shuffle service resources or partition count
# spark.sql.shuffle.partitions = "400"  # Increase from 200

# Issue: Pod scheduling timeout
# Solution: Add node affinity or increase resource limits

⚠️Common Pitfall

Dynamic allocation on K8s requires external shuffle service. Without it, terminated executors lose shuffle data, causing job failures. Always enable ESS when using dynamic allocation.

💡Interview Tip

When discussing Spark on K8s, mention that Spark 3.x introduced native K8s support with spark-submit --deploy-mode cluster, making it first-class alongside YARN and Mesos.

Summary

Component	Purpose	Configuration
Executor Pods	Task execution	`spark.executor.instances`
Dynamic Allocation	Auto-scale executors	`spark.dynamicAllocation.enabled`
External Shuffle Service	Shuffle persistence	`spark.shuffle.service.enabled`
GPU Scheduling	GPU resource management	`spark.executor.resource.gpu.amount`
Pod Templates	Custom pod configuration	`spark.kubernetes.executor.podTemplateFile`

Spark on K8s provides cloud-native resource management with dynamic scaling, GPU support, and Kubernetes-native monitoring, making it ideal for modern data platforms.