You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yeachan Park (Jira)" <ji...@apache.org> on 2023/03/09 15:35:00 UTC
[jira] [Created] (SPARK-42738) BlockManagerMaster requests removal of non-existent executor after executor has already gracefully decommissioned
Yeachan Park created SPARK-42738:
------------------------------------
Summary: BlockManagerMaster requests removal of non-existent executor after executor has already gracefully decommissioned
Key: SPARK-42738
URL: https://issues.apache.org/jira/browse/SPARK-42738
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.3.2
Reporter: Yeachan Park
During testing of graceful decommissioning, we noticed that we are getting a request to remove executors that have already been gracefully decommissioned:
{code:bash}
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 3
{code}
We enabled:
- spark.decommission.enabled
- spark.storage.decommission.rddBlocks.enabled
- spark.storage.decommission.shuffleBlocks.enabled
- spark.storage.decommission.enabled
and set spark.storage.decommission.fallbackStorage.path to a path in our bucket. We are running Spark on Kubernetes.
The full driver logs can be found below:
{code:bash}
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 3
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(3, 100.96.5.11, 44707, None)) as being decommissioning.
23/03/09 15:22:42 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 1 decommissioned message
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 1
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(1, 100.96.5.9, 44491, None)) as being decommissioning.
23/03/09 15:22:42 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 2 decommissioned message
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 2
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(2, 100.96.5.10, 39011, None)) as being decommissioning.
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.5.11: Executor decommission.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 3 is removed. Remove reason statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 3 (epoch 0)
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 1 on 100.96.5.9: Executor decommission.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 1 is removed. Remove reason statistics: (gracefully decommissioned: 2, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 2 on 100.96.5.10: Executor decommission.
23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 2 is removed. Remove reason statistics: (gracefully decommissioned: 3, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, 100.96.5.11, 44707, None)
23/03/09 15:22:44 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
23/03/09 15:22:44 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch 0)
23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 1 (epoch 1)
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, 100.96.5.9, 44491, None)
23/03/09 15:22:45 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 1)
23/03/09 15:22:45 INFO DAGScheduler: Executor lost: 2 (epoch 2)
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, 100.96.5.10, 39011, None)
23/03/09 15:22:45 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 2)
23/03/09 15:22:52 INFO BlockManagerMaster: Removal of executor 1 requested
23/03/09 15:22:52 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 1
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org