You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "hujiahua (Jira)" <ji...@apache.org> on 2021/11/12 07:24:00 UTC

[jira] [Updated] (SPARK-37300) TaskSchedulerImpl should ignore task finished event if its task was already finished state

     [ https://issues.apache.org/jira/browse/SPARK-37300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

hujiahua updated SPARK-37300:
-----------------------------
    Description: When a executor finished a task of some stage, the driver will receive a StatusUpdate event to handle it. At the same time the driver found the executor heartbeat timed out, so the dirver also need handle ExecutorLost event simultaneously. There was a race condition issues here, which will make TaskSetManager.successful and TaskSetManager.tasksSuccessful wrong result. More detailed description and discussion can be viewed at https://issues.apache.org/jira/browse/SPARK-36575 and https://github.com/apache/spark/pull/33872  (was: When a executor finished a task of some stage, the driver will receive a StatusUpdate event to handle it. At the same time the driver found the executor heartbeat timed out, so the dirver also need handle ExecutorLost event simultaneously. There was a race condition issues here, which will make TaskSetManager.successful and TaskSetManager.tasksSuccessful wrong result.

The problem is that TaskResultGetter.enqueueSuccessfulTask use asynchronous thread to handle successful task, that mean the synchronized lock of TaskSchedulerImpl was released prematurely during midway https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L61. So TaskSchedulerImpl may handle executorLost first, then the asynchronous thread will go on to handle successful task. It cause TaskSetManager.successful and TaskSetManager.tasksSuccessful wrong result.)

> TaskSchedulerImpl should ignore task finished event if its task was already finished state
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-37300
>                 URL: https://issues.apache.org/jira/browse/SPARK-37300
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.2.0
>            Reporter: hujiahua
>            Priority: Major
>
> When a executor finished a task of some stage, the driver will receive a StatusUpdate event to handle it. At the same time the driver found the executor heartbeat timed out, so the dirver also need handle ExecutorLost event simultaneously. There was a race condition issues here, which will make TaskSetManager.successful and TaskSetManager.tasksSuccessful wrong result. More detailed description and discussion can be viewed at https://issues.apache.org/jira/browse/SPARK-36575 and https://github.com/apache/spark/pull/33872



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org