You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2022/07/14 21:31:00 UTC

[jira] [Updated] (SPARK-37580) Optimize current TaskSetManager abort logic when task failed count reach the threshold

     [ https://issues.apache.org/jira/browse/SPARK-37580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun updated SPARK-37580:
----------------------------------
    Fix Version/s: 3.3.0

> Optimize current TaskSetManager abort logic when task failed count reach the threshold
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-37580
>                 URL: https://issues.apache.org/jira/browse/SPARK-37580
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.2.0
>            Reporter: wangshengjie
>            Assignee: wangshengjie
>            Priority: Major
>             Fix For: 3.3.0
>
>
> In production environment, we found some logic leak about TaskSetManager abort. For example:
> If one task has failed 3 times(max failed threshold is 4 in default), and there is a retry task and speculative task both in running state, then one of these 2 task attempts succeed and to cancel another. But executor which task need to be cancelled lost(oom in our situcation), this task marked as failed, and TaskSetManager handle this failed task attempt, it has failed 4 times so abort this stage and cause job failed.
> I created the patch for this bug and will soon be sent the pull request.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org