You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sumeet (Jira)" <ji...@apache.org> on 2021/04/03 05:50:00 UTC

[jira] [Created] (SPARK-34949) Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

Sumeet created SPARK-34949:
------------------------------

             Summary: Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down
                 Key: SPARK-34949
                 URL: https://issues.apache.org/jira/browse/SPARK-34949
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.2.0
         Environment: Resource Manager: K8s
            Reporter: Sumeet


*Problem:*

I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive. 

[spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] also returned a value greater than 1. 

 

*Cause:*
 * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the "listenerBus"
 * "CoarseGrainedExecutorBackend" starts the executor shutdown
 * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and removes the executor from "executorLastSeen"
 * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" cannot find the "executorId" in "executorLastSeen" and hence responds with "HeartbeatResponse(reregisterBlockManager = true)"
 * The Executor now calls "env.blockManager.reregister()" and reregisters itself thus creating inconsistency

 

*Proposed Solution:*

The "reportHeartBeat" method is not aware of the fact that Executor is shutting down, it should check "executorShutdown" before reregistering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org