You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2023/03/05 14:07:00 UTC

[jira] [Assigned] (SPARK-42577) A large stage could run indefinitely due to executor lost

     [ https://issues.apache.org/jira/browse/SPARK-42577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-42577:
------------------------------------

    Assignee: Apache Spark

> A large stage could run indefinitely due to executor lost
> ---------------------------------------------------------
>
>                 Key: SPARK-42577
>                 URL: https://issues.apache.org/jira/browse/SPARK-42577
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.3, 3.1.3, 3.2.3, 3.3.2
>            Reporter: wuyi
>            Assignee: Apache Spark
>            Priority: Major
>
> When a stage is extremely large and Spark runs on spot instances or problematic clusters with frequent worker/executor loss,  the stage could run indefinitely due to task rerun caused by the executor loss. This happens, when the external shuffle service is on, and the large stages runs hours to complete, when spark tries to submit a child stage, it will find the parent stage - the large one, has missed some partitions, so the large stage has to rerun. When it completes again, it finds new missing partitions due to the same reason.
> We should add a attempt limitation for this kind of scenario.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org