You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Lukas Waldmann (JIRA)" <ji...@apache.org> on 2019/03/21 12:19:00 UTC

[jira] [Updated] (SPARK-27228) Spark long delay on close, possible problem with killing executors

     [ https://issues.apache.org/jira/browse/SPARK-27228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lukas Waldmann updated SPARK-27228:
-----------------------------------
    Description: 
When using dynamic allocations after all jobs finishes spark delays for several minutes before finally finishes. Log suggest that executors are not cleared up properly.
{quote}{{19/03/21 09:51:38 INFO SparkSession: PROCESSING FINISHED 19/03/21 09:51:38 INFO ExecutorAllocationManager: Request to remove executorIds: 355 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 355 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 355 19/03/21 09:51:38 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 355. 19/03/21 09:51:38 INFO ExecutorAllocationManager: Removing executor 355 because it has been idle for 60 seconds (new desired total will be 65) 19/03/21 09:51:38 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 228. 19/03/21 09:51:38 INFO DAGScheduler: Executor lost: 228 (epoch 446) 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 228 from BlockManagerMaster. 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(228, data-15.bdp.gin.merck.com, 45882, None) 19/03/21 09:51:38 INFO BlockManagerMaster: Removed 228 successfully in removeExecutor 19/03/21 09:51:38 INFO SparkUI: Stopped Spark web UI at [http://data-04.bdp.gin.merck.com:44304|http://data-04.bdp.gin.merck.com:44304/] 19/03/21 09:51:38 INFO YarnClusterScheduler: Executor 228 on data-15.bdp.gin.merck.com killed by driver. 19/03/21 09:51:38 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 346. 19/03/21 09:51:38 INFO DAGScheduler: Executor lost: 346 (epoch 446) 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 346 from BlockManagerMaster. 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(346, datanode-02.bdp.gin.merck.com, 41186, None) 19/03/21 09:51:38 INFO BlockManagerMaster: Removed 346 successfully in removeExecutor 19/03/21 09:51:38 INFO YarnClusterScheduler: Executor 346 on datanode-02.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 332. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 332 (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 332 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(332, data-10.bdp.gin.merck.com, 38713, None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 332 successfully in removeExecutor 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 332 on data-10.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 240. 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 240 on data-22.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 240 (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 240 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(240, data-22.bdp.gin.merck.com, 43344, None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 240 successfully in removeExecutor 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 327. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 327 (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 327 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(327, data-20.bdp.gin.merck.com, 34235, None) 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 327 on data-20.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 327 successfully in removeExecutor 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 355. 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 355 on data-20.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 355 (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 355 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(355, data-20.bdp.gin.merck.com, 43141, None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 355 successfully in removeExecutor 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 168. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 168 (epoch 446) 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 168 on data-07.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 168 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(168, data-07.bdp.gin.merck.com, 44833, None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 168 successfully in removeExecutor 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 332 with no recent heartbeats: 173942 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 332 (already removed): Executor heartbeat timed out after 173942 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 346 with no recent heartbeats: 172853 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 346 (already removed): Executor heartbeat timed out after 172853 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 355 with no recent heartbeats: 173169 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 355 (already removed): Executor heartbeat timed out after 173169 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 168 with no recent heartbeats: 174129 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 168 (already removed): Executor heartbeat timed out after 174129 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 327 with no recent heartbeats: 169555 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 327 (already removed): Executor heartbeat timed out after 169555 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 240 with no recent heartbeats: 177937 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 240 (already removed): Executor heartbeat timed out after 177937 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 228 with no recent heartbeats: 178171 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 228 (already removed): Executor heartbeat timed out after 178171 ms 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 332 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 332 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 346 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 346 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 355 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 355 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 168 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 168 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 327 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 327 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 240 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 240 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 228 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 228 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:56:43 INFO YarnClusterSchedulerBackend: Shutting down all executors 19/03/21 09:56:43 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down 19/03/21 09:56:43 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices (serviceOption=None, services=List(), started=false) 19/03/21 09:56:43 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 19/03/21 09:56:43 INFO MemoryStore: MemoryStore cleared 19/03/21 09:56:43 INFO BlockManager: BlockManager stopped 19/03/21 09:56:43 INFO BlockManagerMaster: BlockManagerMaster stopped 19/03/21 09:56:43 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 19/03/21 09:56:43 INFO SparkContext: Successfully stopped SparkContext 19/03/21 09:56:43 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 19/03/21 }}
{quote}

  was:
When using dynamic allocations after all jobs finishes spark delays for several minutes before finally finishes. Log suggest that executors are not cleared up properly.
{quote}{{19/03/21 09:51:38 INFO SparkSession: PROCESSING FINISHED 19/03/21 09:51:38 INFO ExecutorAllocationManager: Request to remove executorIds: 355 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 355 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 355 19/03/21 09:51:38 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 355. 19/03/21 09:51:38 INFO ExecutorAllocationManager: Removing executor 355 because it has been idle for 60 seconds (new desired total will be 65) 19/03/21 09:51:38 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 228. 19/03/21 09:51:38 INFO DAGScheduler: Executor lost: 228 (epoch 446) 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 228 from BlockManagerMaster. 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(228, data-15.bdp.gin.merck.com, 45882, None) 19/03/21 09:51:38 INFO BlockManagerMaster: Removed 228 successfully in removeExecutor 19/03/21 09:51:38 INFO SparkUI: Stopped Spark web UI at [http://data-04.bdp.gin.merck.com:44304|http://data-04.bdp.gin.merck.com:44304/] 19/03/21 09:51:38 INFO YarnClusterScheduler: Executor 228 on data-15.bdp.gin.merck.com killed by driver. 19/03/21 09:51:38 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 346. 19/03/21 09:51:38 INFO DAGScheduler: Executor lost: 346 (epoch 446) 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 346 from BlockManagerMaster. 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(346, datanode-02.bdp.gin.merck.com, 41186, None) 19/03/21 09:51:38 INFO BlockManagerMaster: Removed 346 successfully in removeExecutor 19/03/21 09:51:38 INFO YarnClusterScheduler: Executor 346 on datanode-02.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 332. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 332 (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 332 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(332, data-10.bdp.gin.merck.com, 38713, None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 332 successfully in removeExecutor 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 332 on data-10.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 240. 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 240 on data-22.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 240 (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 240 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(240, data-22.bdp.gin.merck.com, 43344, None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 240 successfully in removeExecutor 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 327. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 327 (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 327 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(327, data-20.bdp.gin.merck.com, 34235, None) 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 327 on data-20.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 327 successfully in removeExecutor 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 355. 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 355 on data-20.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 355 (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 355 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(355, data-20.bdp.gin.merck.com, 43141, None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 355 successfully in removeExecutor 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 168. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 168 (epoch 446) 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 168 on data-07.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 168 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(168, data-07.bdp.gin.merck.com, 44833, None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 168 successfully in removeExecutor 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 332 with no recent heartbeats: 173942 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 332 (already removed): Executor heartbeat timed out after 173942 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 346 with no recent heartbeats: 172853 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 346 (already removed): Executor heartbeat timed out after 172853 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 355 with no recent heartbeats: 173169 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 355 (already removed): Executor heartbeat timed out after 173169 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 168 with no recent heartbeats: 174129 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 168 (already removed): Executor heartbeat timed out after 174129 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 327 with no recent heartbeats: 169555 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 327 (already removed): Executor heartbeat timed out after 169555 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 240 with no recent heartbeats: 177937 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 240 (already removed): Executor heartbeat timed out after 177937 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 228 with no recent heartbeats: 178171 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 228 (already removed): Executor heartbeat timed out after 178171 ms 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 332 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 332 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 346 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 346 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 355 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 355 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 168 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 168 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 327 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 327 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 240 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 240 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 228 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 228 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:56:43 INFO YarnClusterSchedulerBackend: Shutting down all executors 19/03/21 09:56:43 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down 19/03/21 09:56:43 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices (serviceOption=None, services=List(), started=false) 19/03/21 09:56:43 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 19/03/21 09:56:43 INFO MemoryStore: MemoryStore cleared 19/03/21 09:56:43 INFO BlockManager: BlockManager stopped 19/03/21 09:56:43 INFO BlockManagerMaster: BlockManagerMaster stopped 19/03/21 09:56:43 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 19/03/21 09:56:43 INFO SparkContext: Successfully stopped SparkContext 19/03/21 09:56:43 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 19/03/21 09:56:43 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 19/03/21 09:56:43 INFO AMRMClientImpl: Waiting for application to be successfully unregistered. 19/03/21 09:56:43 INFO ApplicationMaster: Deleting staging directory hdfs://bdpprodnn/user/waldmann/.sparkStaging/application_1552944783492_60353 19/03/21 09:56:44 INFO ShutdownHookManager: Shutdown hook called 19/03/21 09:56:44 INFO ShutdownHookManager: Deleting directory /mnt/data01/hadoop/yarn/local/usercache/waldmann/appcache/application_1552944783492_60353/spark-713791f3-87ca-4234-96d5-9969850fcea9}}{quote}


> Spark long delay on close, possible problem with killing executors
> ------------------------------------------------------------------
>
>                 Key: SPARK-27228
>                 URL: https://issues.apache.org/jira/browse/SPARK-27228
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 2.3.0
>            Reporter: Lukas Waldmann
>            Priority: Major
>
> When using dynamic allocations after all jobs finishes spark delays for several minutes before finally finishes. Log suggest that executors are not cleared up properly.
> {quote}{{19/03/21 09:51:38 INFO SparkSession: PROCESSING FINISHED 19/03/21 09:51:38 INFO ExecutorAllocationManager: Request to remove executorIds: 355 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 355 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 355 19/03/21 09:51:38 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 355. 19/03/21 09:51:38 INFO ExecutorAllocationManager: Removing executor 355 because it has been idle for 60 seconds (new desired total will be 65) 19/03/21 09:51:38 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 228. 19/03/21 09:51:38 INFO DAGScheduler: Executor lost: 228 (epoch 446) 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 228 from BlockManagerMaster. 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(228, data-15.bdp.gin.merck.com, 45882, None) 19/03/21 09:51:38 INFO BlockManagerMaster: Removed 228 successfully in removeExecutor 19/03/21 09:51:38 INFO SparkUI: Stopped Spark web UI at [http://data-04.bdp.gin.merck.com:44304|http://data-04.bdp.gin.merck.com:44304/] 19/03/21 09:51:38 INFO YarnClusterScheduler: Executor 228 on data-15.bdp.gin.merck.com killed by driver. 19/03/21 09:51:38 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 346. 19/03/21 09:51:38 INFO DAGScheduler: Executor lost: 346 (epoch 446) 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 346 from BlockManagerMaster. 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(346, datanode-02.bdp.gin.merck.com, 41186, None) 19/03/21 09:51:38 INFO BlockManagerMaster: Removed 346 successfully in removeExecutor 19/03/21 09:51:38 INFO YarnClusterScheduler: Executor 346 on datanode-02.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 332. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 332 (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 332 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(332, data-10.bdp.gin.merck.com, 38713, None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 332 successfully in removeExecutor 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 332 on data-10.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 240. 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 240 on data-22.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 240 (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 240 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(240, data-22.bdp.gin.merck.com, 43344, None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 240 successfully in removeExecutor 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 327. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 327 (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 327 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(327, data-20.bdp.gin.merck.com, 34235, None) 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 327 on data-20.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 327 successfully in removeExecutor 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 355. 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 355 on data-20.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 355 (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 355 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(355, data-20.bdp.gin.merck.com, 43141, None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 355 successfully in removeExecutor 19/03/21 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 168. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 168 (epoch 446) 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 168 on data-07.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor 168 from BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(168, data-07.bdp.gin.merck.com, 44833, None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 168 successfully in removeExecutor 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 332 with no recent heartbeats: 173942 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 332 (already removed): Executor heartbeat timed out after 173942 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 346 with no recent heartbeats: 172853 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 346 (already removed): Executor heartbeat timed out after 172853 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 355 with no recent heartbeats: 173169 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 355 (already removed): Executor heartbeat timed out after 173169 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 168 with no recent heartbeats: 174129 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 168 (already removed): Executor heartbeat timed out after 174129 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 327 with no recent heartbeats: 169555 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 327 (already removed): Executor heartbeat timed out after 169555 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 240 with no recent heartbeats: 177937 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 240 (already removed): Executor heartbeat timed out after 177937 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 228 with no recent heartbeats: 178171 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 228 (already removed): Executor heartbeat timed out after 178171 ms 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 332 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 332 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 346 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 346 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 355 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 355 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 168 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 168 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 327 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 327 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 240 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 240 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Requesting to kill executor(s) 228 19/03/21 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 228 does not exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of executor(s) to be killed is 19/03/21 09:56:43 INFO YarnClusterSchedulerBackend: Shutting down all executors 19/03/21 09:56:43 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down 19/03/21 09:56:43 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices (serviceOption=None, services=List(), started=false) 19/03/21 09:56:43 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 19/03/21 09:56:43 INFO MemoryStore: MemoryStore cleared 19/03/21 09:56:43 INFO BlockManager: BlockManager stopped 19/03/21 09:56:43 INFO BlockManagerMaster: BlockManagerMaster stopped 19/03/21 09:56:43 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 19/03/21 09:56:43 INFO SparkContext: Successfully stopped SparkContext 19/03/21 09:56:43 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 19/03/21 }}
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org