You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2022/12/16 18:52:00 UTC

[jira] [Updated] (SPARK-40979) Keep removed executor info in decommission state

     [ https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun updated SPARK-40979:
----------------------------------
        Parent: SPARK-41550
    Issue Type: Sub-task  (was: Improvement)

> Keep removed executor info in decommission state
> ------------------------------------------------
>
>                 Key: SPARK-40979
>                 URL: https://issues.apache.org/jira/browse/SPARK-40979
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core
>    Affects Versions: 3.4.0
>            Reporter: Zhongwei Zhu
>            Assignee: Zhongwei Zhu
>            Priority: Major
>             Fix For: 3.4.0
>
>
> Removed executor due to decommission should be kept in a separate set. To avoid OOM, set size will be limited to 1K or 10K.
> FetchFailed caused by decom executor could be divided into 2 categories:
>  # When FetchFailed reached DAGScheduler, the executor is still alive or is lost but the lost info hasn't reached TaskSchedulerImpl. This is already handled in SPARK-40979
>  # FetchFailed is caused by decom executor loss, so the decom info is already removed in TaskSchedulerImpl. If we keep such info in a short period, it is good enough. Even we limit the size of removed executors to 10K, it could be only at most 10MB memory usage. In real case, it's rare to have cluster size of over 10K and the chance that all these executors decomed and lost at the same time would be small.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org