You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Mridul Muralidharan (Jira)" <ji...@apache.org> on 2021/04/05 22:35:00 UTC

[jira] [Resolved] (SPARK-34949) Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

     [ https://issues.apache.org/jira/browse/SPARK-34949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mridul Muralidharan resolved SPARK-34949.
-----------------------------------------
    Fix Version/s: 3.1.2
                   3.2.0
       Resolution: Fixed

Issue resolved by pull request 32043
[https://github.com/apache/spark/pull/32043]

> Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-34949
>                 URL: https://issues.apache.org/jira/browse/SPARK-34949
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.0, 3.1.1
>         Environment: Resource Manager: K8s
>            Reporter: Sumeet
>            Priority: Major
>              Labels: Executor, heartbeat
>             Fix For: 3.2.0, 3.1.2
>
>
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the "listenerBus"
>  * "CoarseGrainedExecutorBackend" starts the executor shutdown
>  * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and removes the executor from "executorLastSeen"
>  * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" cannot find the "executorId" in "executorLastSeen" and hence responds with "HeartbeatResponse(reregisterBlockManager = true)"
>  * The Executor now calls "env.blockManager.reregister()" and reregisters itself thus creating inconsistency
>  
> *Proposed Solution:*
> The "reportHeartBeat" method is not aware of the fact that Executor is shutting down, it should check "executorShutdown" before reregistering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org