You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Denis Krivenko (Jira)" <ji...@apache.org> on 2022/01/10 19:07:00 UTC
[jira] [Created] (SPARK-37856) Executor pods keep existing if driver container was restarted
Denis Krivenko created SPARK-37856:
--------------------------------------
Summary: Executor pods keep existing if driver container was restarted
Key: SPARK-37856
URL: https://issues.apache.org/jira/browse/SPARK-37856
Project: Spark
Issue Type: Bug
Components: Kubernetes
Affects Versions: 3.2.0, 3.1.2
Environment: * Kubernetes 1.20
* Spark 3.1.2
* Hadoop 3.2.0
* Java 11
* Scala 2.12
and
* Kubernetes 1.20
* Spark 3.2.0
* Hadoop 3.3.1
* Java 11
* Scala 2.12
Reporter: Denis Krivenko
I run Spark Thrift Server on Kubernetes cluster, so the driver pod runs continuously and it creates and manages executor pods. From time to time OOM issue occurs on a driver pod or executor pods.
When it happens on
* executor - the executor pod is getting deleted and the driver creates a new executor pod instead. It works as expected.
* driver - Kubernetes restarts the driver container and the driver creates new executor pods. All previous executors stop, but still exist with *Error* state for Spark 3.1.2 or with *Completed* state for Spark 3.2.0
The behavior can be reproduced by restarting a pod container with the command
{code:java}
kubectl exec POD_NAME -c CONTAINER_NAME -- /sbin/killall5{code}
Property _spark.kubernetes.executor.deleteOnTermination_ is set to *true* by default.
If I delete driver pod all executor pods (in any state) are also deleted completely.
+Pod list+
{code:java}
NAME READY STATUS RESTARTS AGE
spark-thrift-server-85cf5d689b-vvrwd 1/1 Running 1 3d15h
spark-thrift-server-198cc57e3f9a7400-exec-10 1/1 Running 0 86m
spark-thrift-server-198cc57e3f9a7400-exec-6 1/1 Running 0 12h
spark-thrift-server-198cc57e3f9a7400-exec-8 1/1 Running 0 9h
spark-thrift-server-198cc57e3f9a7400-exec-9 1/1 Running 0 3h12m
spark-thrift-server-1a9aee7e31f36eea-exec-17 0/1 Completed 0 38h
spark-thrift-server-1a9aee7e31f36eea-exec-18 0/1 Completed 0 38h
spark-thrift-server-1a9aee7e31f36eea-exec-19 0/1 Completed 0 36h
spark-thrift-server-1a9aee7e31f36eea-exec-21 0/1 Completed 0 24h
{code}
+Driver pod+
{code:java}
apiVersion: v1
kind: Pod
metadata:
name: spark-thrift-server-85cf5d689b-vvrwd
uid: b69a7c68-a767-4e3b-939c-061347b1c25e
spec:
...
status:
containerStatuses:
- containerID: containerd://7206acf424aa30b6f8533c0e32c99ebfdc5ee80648e76289f6bd2f87460ddcd3
image: xxx/spark:3.2.0
lastState:
terminated:
containerID: containerd://fe3cacb8e6470ac37dcd50d525ae3d54c8b6bfef3558325bc22e7b40daab1703
exitCode: 143
finishedAt: "2022-01-09T16:09:50Z"
reason: OOMKilled
startedAt: "2022-01-07T00:32:21Z"
name: spark-thrift-server
ready: true
restartCount: 1
started: true
state:
running:
startedAt: "2022-01-09T16:09:51Z" {code}
Executor pod
{code:java}
apiVersion: v1
kind: Pod
metadata:
name: spark-thrift-server-1a9aee7e31f36eea-exec-17
ownerReferences:
- apiVersion: v1
controller: true
kind: Pod
name: spark-thrift-server-85cf5d689b-vvrwd
uid: b69a7c68-a767-4e3b-939c-061347b1c25e
spec:
...
status:
containerStatuses:
- containerID: containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19
image: xxx/spark:3.2.0
lastState: {}
name: spark-kubernetes-executor
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19
exitCode: 0
finishedAt: "2022-01-09T16:08:57Z"
reason: Completed
startedAt: "2022-01-09T01:39:15Z" {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org