You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2022/07/14 21:33:00 UTC

[jira] [Updated] (SPARK-37580) Reset numFailures when one of task attempts succeeds

     [ https://issues.apache.org/jira/browse/SPARK-37580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun updated SPARK-37580:
----------------------------------
    Summary: Reset numFailures when one of task attempts succeeds  (was: Optimize current TaskSetManager abort logic when task failed count reach the threshold)

> Reset numFailures when one of task attempts succeeds
> ----------------------------------------------------
>
>                 Key: SPARK-37580
>                 URL: https://issues.apache.org/jira/browse/SPARK-37580
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.3.0
>            Reporter: wangshengjie
>            Assignee: wangshengjie
>            Priority: Major
>             Fix For: 3.3.0
>
>
> In production environment, we found some logic leak about TaskSetManager abort. For example:
> If one task has failed 3 times(max failed threshold is 4 in default), and there is a retry task and speculative task both in running state, then one of these 2 task attempts succeed and to cancel another. But executor which task need to be cancelled lost(oom in our situcation), this task marked as failed, and TaskSetManager handle this failed task attempt, it has failed 4 times so abort this stage and cause job failed.
> I created the patch for this bug and will soon be sent the pull request.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org