You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Denis Krivenko (Jira)" <ji...@apache.org> on 2022/01/10 19:09:00 UTC
[jira] [Updated] (SPARK-37856) Executor pods keep existing if driver container was restarted
[ https://issues.apache.org/jira/browse/SPARK-37856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Denis Krivenko updated SPARK-37856:
-----------------------------------
Environment:
Kubernetes 1.20 | Spark 3.1.2 | Hadoop 3.2.0 | Java 11 | Scala 2.12
Kubernetes 1.20 | Spark 3.2.0 | Hadoop 3.3.1 | Java 11 | Scala 2.12
was:
* Kubernetes 1.20
* Spark 3.1.2
* Hadoop 3.2.0
* Java 11
* Scala 2.12
and
* Kubernetes 1.20
* Spark 3.2.0
* Hadoop 3.3.1
* Java 11
* Scala 2.12
> Executor pods keep existing if driver container was restarted
> -------------------------------------------------------------
>
> Key: SPARK-37856
> URL: https://issues.apache.org/jira/browse/SPARK-37856
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes
> Affects Versions: 3.1.2, 3.2.0
> Environment: Kubernetes 1.20 | Spark 3.1.2 | Hadoop 3.2.0 | Java 11 | Scala 2.12
> Kubernetes 1.20 | Spark 3.2.0 | Hadoop 3.3.1 | Java 11 | Scala 2.12
> Reporter: Denis Krivenko
> Priority: Minor
>
> I run Spark Thrift Server on Kubernetes cluster, so the driver pod runs continuously and it creates and manages executor pods. From time to time OOM issue occurs on a driver pod or executor pods.
> When it happens on
> * executor - the executor pod is getting deleted and the driver creates a new executor pod instead. It works as expected.
> * driver - Kubernetes restarts the driver container and the driver creates new executor pods. All previous executors stop, but still exist with *Error* state for Spark 3.1.2 or with *Completed* state for Spark 3.2.0
> The behavior can be reproduced by restarting a pod container with the command
> {code:java}
> kubectl exec POD_NAME -c CONTAINER_NAME -- /sbin/killall5{code}
> Property _spark.kubernetes.executor.deleteOnTermination_ is set to *true* by default.
> If I delete driver pod all executor pods (in any state) are also deleted completely.
> +Pod list+
> {code:java}
> NAME READY STATUS RESTARTS AGE
> spark-thrift-server-85cf5d689b-vvrwd 1/1 Running 1 3d15h
> spark-thrift-server-198cc57e3f9a7400-exec-10 1/1 Running 0 86m
> spark-thrift-server-198cc57e3f9a7400-exec-6 1/1 Running 0 12h
> spark-thrift-server-198cc57e3f9a7400-exec-8 1/1 Running 0 9h
> spark-thrift-server-198cc57e3f9a7400-exec-9 1/1 Running 0 3h12m
> spark-thrift-server-1a9aee7e31f36eea-exec-17 0/1 Completed 0 38h
> spark-thrift-server-1a9aee7e31f36eea-exec-18 0/1 Completed 0 38h
> spark-thrift-server-1a9aee7e31f36eea-exec-19 0/1 Completed 0 36h
> spark-thrift-server-1a9aee7e31f36eea-exec-21 0/1 Completed 0 24h
> {code}
> +Driver pod+
> {code:java}
> apiVersion: v1
> kind: Pod
> metadata:
> name: spark-thrift-server-85cf5d689b-vvrwd
> uid: b69a7c68-a767-4e3b-939c-061347b1c25e
> spec:
> ...
> status:
> containerStatuses:
> - containerID: containerd://7206acf424aa30b6f8533c0e32c99ebfdc5ee80648e76289f6bd2f87460ddcd3
> image: xxx/spark:3.2.0
> lastState:
> terminated:
> containerID: containerd://fe3cacb8e6470ac37dcd50d525ae3d54c8b6bfef3558325bc22e7b40daab1703
> exitCode: 143
> finishedAt: "2022-01-09T16:09:50Z"
> reason: OOMKilled
> startedAt: "2022-01-07T00:32:21Z"
> name: spark-thrift-server
> ready: true
> restartCount: 1
> started: true
> state:
> running:
> startedAt: "2022-01-09T16:09:51Z" {code}
> Executor pod
> {code:java}
> apiVersion: v1
> kind: Pod
> metadata:
> name: spark-thrift-server-1a9aee7e31f36eea-exec-17
> ownerReferences:
> - apiVersion: v1
> controller: true
> kind: Pod
> name: spark-thrift-server-85cf5d689b-vvrwd
> uid: b69a7c68-a767-4e3b-939c-061347b1c25e
> spec:
> ...
> status:
> containerStatuses:
> - containerID: containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19
> image: xxx/spark:3.2.0
> lastState: {}
> name: spark-kubernetes-executor
> ready: false
> restartCount: 0
> started: false
> state:
> terminated:
> containerID: containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19
> exitCode: 0
> finishedAt: "2022-01-09T16:08:57Z"
> reason: Completed
> startedAt: "2022-01-09T01:39:15Z" {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org