You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Vikas Garg (Jira)" <ji...@apache.org> on 2021/08/24 20:11:00 UTC
[jira] [Updated] (SPARK-36577) Spark on K8s: Driver pod keeps running when executor allocator terminates with Fatal exception

     [ https://issues.apache.org/jira/browse/SPARK-36577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vikas Garg updated SPARK-36577:
-------------------------------
    Description: 
In Spark on Kubernetes, the class ExecutorPodsSnapshotsStoreImpl creates a thread which is responsible for creating new executor pods. The thread catches all 'NonFatal' exceptions and logs and ignores these NonFatal exceptions. However, from Fatal exceptions, it only handles {color:#20999d}IllegalArgumentException {color}exception and terminates the driver pod in that case. Other Fatal exceptions are not handled at all, which means that if such a Fatal exception occurs, the executor creation thread abruptly terminates, while the main thread keeps running. Thus, the Spark application/job would keep running indefinitely without making any progress.

To fix this, 2 of the possible options are:

*Option#1*: Fail the driver pod whenever any Fatal exception happens. However, this approach has following disadvantages:
 # A few number of executors may have already been created when this Fatal exception happens. These executors can still take the job to completion, although slower than expected as all executors were not launched. Thus, we would fail the job instead of letting the job succeed slowly.
 # JVM can sometimes recover from Fatal exceptions on its own. Thus, we are not giving a chance to driver pod to recover from failure, rather we are killing it on first occurrence of {{Fatal}} exception.

*Option#2*: Fail the driver pod only when there are 0 executors running currently

In this approach, we fail the driver pod only when number of currently running executors is 0. This is so that we don’t kill a job which can potentially complete. Thus, 1 single running executor can keep the driver pod from dying. This may mean that job may make very slow progress when actual number of requested executors is very large.

  was:
In Spark on Kubernetes, the class ExecutorPodsSnapshotsStoreImpl creates a thread which is responsible for creating new executor pods. The thread catches all 'NonFatal' exceptions and logs and ignores these NonFatal exceptions. However, from Fatal exceptions, it only handles {color:#20999d}IllegalArgumentException {color}exception and terminates the driver pod in that case. Other Fatal exceptions are not handled at all, which means that if such a Fatal exception occurs, the executor creation thread abruptly terminates, while the main thread keeps running. Thus, the Spark application/job would keep running indefinitely without making any progress.

To fix this, 2 of the possible options are (looking for feedback on these):

*Option#1*: Fail the driver pod whenever any Fatal exception happens. However, this approach has following disadvantages:
 # A few number of executors may have already been created when this Fatal exception happens. These executors can still take the job to completion, although slower than expected as all executors were not launched. Thus, we would fail the job instead of letting the job succeed slowly.
 # JVM can sometimes recover from Fatal exceptions on its own. Thus, we are not giving a chance to driver pod to recover from failure, rather we are killing it on first occurrence of {{Fatal}} exception.

*Option#2*: Fail the driver pod only when there are 0 executors running currently

In this approach, we fail the driver pod only when number of currently running executors is 0. This is so that we don’t kill a job which can potentially complete. Thus, 1 single running executor can keep the driver pod from dying. This may mean that job may make very slow progress when actual number of requested executors is very large.


> Spark on K8s: Driver pod keeps running when executor allocator terminates with Fatal exception
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-36577
>                 URL: https://issues.apache.org/jira/browse/SPARK-36577
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2
>            Reporter: Vikas Garg
>            Priority: Critical
>
> In Spark on Kubernetes, the class ExecutorPodsSnapshotsStoreImpl creates a thread which is responsible for creating new executor pods. The thread catches all 'NonFatal' exceptions and logs and ignores these NonFatal exceptions. However, from Fatal exceptions, it only handles {color:#20999d}IllegalArgumentException {color}exception and terminates the driver pod in that case. Other Fatal exceptions are not handled at all, which means that if such a Fatal exception occurs, the executor creation thread abruptly terminates, while the main thread keeps running. Thus, the Spark application/job would keep running indefinitely without making any progress.
> To fix this, 2 of the possible options are:
> *Option#1*: Fail the driver pod whenever any Fatal exception happens. However, this approach has following disadvantages:
>  # A few number of executors may have already been created when this Fatal exception happens. These executors can still take the job to completion, although slower than expected as all executors were not launched. Thus, we would fail the job instead of letting the job succeed slowly.
>  # JVM can sometimes recover from Fatal exceptions on its own. Thus, we are not giving a chance to driver pod to recover from failure, rather we are killing it on first occurrence of {{Fatal}} exception.
> *Option#2*: Fail the driver pod only when there are 0 executors running currently
> In this approach, we fail the driver pod only when number of currently running executors is 0. This is so that we don’t kill a job which can potentially complete. Thus, 1 single running executor can keep the driver pod from dying. This may mean that job may make very slow progress when actual number of requested executors is very large.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org