You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Mridul Muralidharan (Jira)" <ji...@apache.org> on 2021/04/05 22:35:00 UTC
[jira] [Resolved] (SPARK-34949) Executor.reportHeartBeat
reregisters blockManager even when Executor is shutting down
[ https://issues.apache.org/jira/browse/SPARK-34949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mridul Muralidharan resolved SPARK-34949.
-----------------------------------------
Fix Version/s: 3.1.2
3.2.0
Resolution: Fixed
Issue resolved by pull request 32043
[https://github.com/apache/spark/pull/32043]
> Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down
> -------------------------------------------------------------------------------------
>
> Key: SPARK-34949
> URL: https://issues.apache.org/jira/browse/SPARK-34949
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.2.0, 3.1.1
> Environment: Resource Manager: K8s
> Reporter: Sumeet
> Priority: Major
> Labels: Executor, heartbeat
> Fix For: 3.2.0, 3.1.2
>
>
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive.
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] also returned a value greater than 1.
>
> *Cause:*
> * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the "listenerBus"
> * "CoarseGrainedExecutorBackend" starts the executor shutdown
> * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and removes the executor from "executorLastSeen"
> * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" cannot find the "executorId" in "executorLastSeen" and hence responds with "HeartbeatResponse(reregisterBlockManager = true)"
> * The Executor now calls "env.blockManager.reregister()" and reregisters itself thus creating inconsistency
>
> *Proposed Solution:*
> The "reportHeartBeat" method is not aware of the fact that Executor is shutting down, it should check "executorShutdown" before reregistering.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org