You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "wangshengjie (Jira)" <ji...@apache.org> on 2021/12/08 09:53:00 UTC

[jira] [Created] (SPARK-37580) Optimize current TaskSetManager abort logic when task failed count reach the threshold

wangshengjie created SPARK-37580:
------------------------------------

             Summary: Optimize current TaskSetManager abort logic when task failed count reach the threshold
                 Key: SPARK-37580
                 URL: https://issues.apache.org/jira/browse/SPARK-37580
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.2.0
            Reporter: wangshengjie


In production environment, we found some logic leak about TaskSetManager abort. For example:

If one task has failed 3 times(max failed threshold is 4 in default), and there is a retry task and speculative task both in running state, then one of these 2 task attempts succeed and to cancel another. But executor which task need to be cancelled lost(oom in our situcation), this task marked as failed, and TaskSetManager handle this failed task attempt, it has failed 4 times so abort this stage and cause job failed.

I created the patch for this bug and will soon be sent the pull request.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org