You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Denis Krivenko (Jira)" <ji...@apache.org> on 2022/01/10 19:07:00 UTC

[jira] [Created] (SPARK-37856) Executor pods keep existing if driver container was restarted

Denis Krivenko created SPARK-37856:
--------------------------------------

             Summary: Executor pods keep existing if driver container was restarted
                 Key: SPARK-37856
                 URL: https://issues.apache.org/jira/browse/SPARK-37856
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 3.2.0, 3.1.2
         Environment: * Kubernetes 1.20
 * Spark 3.1.2
 * Hadoop 3.2.0
 * Java 11
 * Scala 2.12

and
 * Kubernetes 1.20
 * Spark 3.2.0
 * Hadoop 3.3.1
 * Java 11
 * Scala 2.12
            Reporter: Denis Krivenko


I run Spark Thrift Server on Kubernetes cluster, so the driver pod runs continuously and it creates and manages executor pods. From time to time OOM issue occurs on a driver pod or executor pods.

When it happens on
 * executor - the executor pod is getting deleted and the driver creates a new executor pod instead. It works as expected.
 * driver     - Kubernetes restarts the driver container and the driver creates new executor pods. All previous executors stop, but still exist with *Error* state for Spark 3.1.2 or with *Completed* state for Spark 3.2.0

The behavior can be reproduced by restarting a pod container with the command
{code:java}
kubectl exec POD_NAME -c CONTAINER_NAME -- /sbin/killall5{code}
Property _spark.kubernetes.executor.deleteOnTermination_ is set to *true* by default.

If I delete driver pod all executor pods (in any state) are also deleted completely.

+Pod list+
{code:java}
NAME                                           READY   STATUS      RESTARTS   AGE
spark-thrift-server-85cf5d689b-vvrwd           1/1     Running     1          3d15h
spark-thrift-server-198cc57e3f9a7400-exec-10   1/1     Running     0          86m
spark-thrift-server-198cc57e3f9a7400-exec-6    1/1     Running     0          12h
spark-thrift-server-198cc57e3f9a7400-exec-8    1/1     Running     0          9h
spark-thrift-server-198cc57e3f9a7400-exec-9    1/1     Running     0          3h12m
spark-thrift-server-1a9aee7e31f36eea-exec-17   0/1     Completed   0          38h
spark-thrift-server-1a9aee7e31f36eea-exec-18   0/1     Completed   0          38h
spark-thrift-server-1a9aee7e31f36eea-exec-19   0/1     Completed   0          36h
spark-thrift-server-1a9aee7e31f36eea-exec-21   0/1     Completed   0          24h
 {code}
+Driver pod+
{code:java}
apiVersion: v1
kind: Pod
metadata:
  name: spark-thrift-server-85cf5d689b-vvrwd
  uid: b69a7c68-a767-4e3b-939c-061347b1c25e
spec:
  ...
status:
  containerStatuses:
  - containerID: containerd://7206acf424aa30b6f8533c0e32c99ebfdc5ee80648e76289f6bd2f87460ddcd3
    image: xxx/spark:3.2.0
    lastState:
      terminated:
        containerID: containerd://fe3cacb8e6470ac37dcd50d525ae3d54c8b6bfef3558325bc22e7b40daab1703
        exitCode: 143
        finishedAt: "2022-01-09T16:09:50Z"
        reason: OOMKilled
        startedAt: "2022-01-07T00:32:21Z"
    name: spark-thrift-server
    ready: true
    restartCount: 1
    started: true
    state:
      running:
        startedAt: "2022-01-09T16:09:51Z" {code}
Executor pod
{code:java}
apiVersion: v1
kind: Pod
metadata:
  name: spark-thrift-server-1a9aee7e31f36eea-exec-17
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Pod
    name: spark-thrift-server-85cf5d689b-vvrwd
    uid: b69a7c68-a767-4e3b-939c-061347b1c25e
spec:
  ...
status:
  containerStatuses:
  - containerID: containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19
    image: xxx/spark:3.2.0
    lastState: {}
    name: spark-kubernetes-executor
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://75c68190147ba980f4b9014eef3989ddc2ee30de321fd1119957b6684a995c19
        exitCode: 0
        finishedAt: "2022-01-09T16:08:57Z"
        reason: Completed
        startedAt: "2022-01-09T01:39:15Z" {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org