You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "wuyi (Jira)" <ji...@apache.org> on 2022/12/02 06:17:00 UTC

[jira] [Updated] (SPARK-35011) False active executor in UI that caused by BlockManager reregistration

     [ https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

wuyi updated SPARK-35011:
-------------------------
    Summary: False active executor in UI that caused by BlockManager reregistration  (was: Avoid Block Manager registerations when StopExecutor msg is in-flight.)

> False active executor in UI that caused by BlockManager reregistration
> ----------------------------------------------------------------------
>
>                 Key: SPARK-35011
>                 URL: https://issues.apache.org/jira/browse/SPARK-35011
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.1.1, 3.2.0
>            Reporter: Sumeet
>            Assignee: wuyi
>            Priority: Major
>              Labels: BlockManager, core
>             Fix For: 3.3.0
>
>
> *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, driver reports dead executors as alive.
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on executorEndpoint
>  * "CoarseGrainedSchedulerBackend" removes that executor from Driver's internal data structures and publishes "SparkListenerExecutorRemoved" on the "listenerBus".
>  * Executor has still not processed "StopExecutor" from the Driver
>  * Driver receives heartbeat from the Executor, since it cannot find the "executorId" in its data structures, it responds with "HeartbeatResponse(reregisterBlockManager = true)"
>  * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" and "SparkListenerBlockManagerAdded" is published on the "listenerBus"
>  * Executor starts processing the "StopExecutor" and exits
>  * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and updates "AppStatusStore"
>  * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list of executors which returns the dead executor as alive.
>  
> *Proposed Solution:*
> Maintain a Cache of recently removed executors on Driver. During the registration in BlockManagerMasterEndpoint if the BlockManager belongs to a recently removed executor, return None indicating the registration is ignored since the executor will be shutting down soon.
> On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed executor, return true indicating the driver knows about it, thereby preventing reregisteration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org