You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wan Kun (Jira)" <ji...@apache.org> on 2020/06/24 08:48:00 UTC

[jira] [Created] (SPARK-32086) RemoveBroadcast RPC failed after executor is shutdown

Wan Kun created SPARK-32086:
-------------------------------

             Summary: RemoveBroadcast RPC failed after executor is shutdown
                 Key: SPARK-32086
                 URL: https://issues.apache.org/jira/browse/SPARK-32086
             Project: Spark
          Issue Type: Bug
          Components: YARN
    Affects Versions: 3.0.0
            Reporter: Wan Kun


When I run job with spark on yarn, and enable Dynamic Resource Allocation. 

There are some exceptions when *RemoveBroadcast* RPC are called from *BlockManagerMaster* to *BlockManagers* on executors.

 

It seems that ***blockManagerInfo.values* in ** *BlockManagerMaster*** has not been updated after the executor is closed

 

 

 
{code:java}
/**
 * Delegate RemoveBroadcast messages to each BlockManager because the master may not notified
 * of all broadcast blocks. If removeFromDriver is false, broadcast blocks are only removed
 * from the executors, but not from the driver.
 */
private def removeBroadcast(broadcastId: Long, removeFromDriver: Boolean): Future[Seq[Int]] = {
  val removeMsg = RemoveBroadcast(broadcastId, removeFromDriver)
  -- blockManagerInfo.values did not updated after executor is shutdown
  val requiredBlockManagers = blockManagerInfo.values.filter { info =>
    removeFromDriver || !info.blockManagerId.isDriver
  }
  val futures = requiredBlockManagers.map { bm =>
    bm.slaveEndpoint.ask[Int](removeMsg).recover {
      case e: IOException =>
        logWarning(s"Error trying to remove broadcast $broadcastId from block manager " +
          s"${bm.blockManagerId}", e)
        0 // zero blocks were removed
    }
  }.toSeq

  Future.sequence(futures)
}
{code}
 

Driver LOG

 
 
{code:java}
13:53:47.167 dispatcher-BlockManagerMaster INFO org.apache.spark.storage.BlockManagerMasterEndpoint: Registering block manager nta-offline-010.nta.leyantech.com:34187 with 912.3 MiB RAM, BlockManagerId(93, nta-offline-010.nta.leyantech.com, 34187, None)


...


13:55:25.530 block-manager-ask-thread-pool-11 WARN org.apache.spark.storage.BlockManagerMasterEndpoint: Error trying to remove broadcast 123 from block manager BlockManagerId(93, nta-offline-010.nta.leyantech.com, 34187, None)
java.io.IOException: Failed to send RPC RPC 5410545105668177846 to /192.168.127.10:42478: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:363)
at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:340)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
....
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
... 22 more


13:55:25.548 dispatcher-BlockManagerMaster INFO org.apache.spark.storage.BlockManagerMasterEndpoint: Trying to remove executor 93 from BlockManagerMaster.
13:55:25.548 dispatcher-BlockManagerMaster INFO org.apache.spark.storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(93, nta-offline-010.nta.leyantech.com, 34187, None)
{code}
NodeMangager LOG
 
{code:java}
2020-06-23 13:55:25,363 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e27_1589526283340_7859_01_000112 transitioned from RUNNING to KILLING2020-06-23 13:55:25,503 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_e27_1589526283340_7859_01_0001122020-06-23 13:55:25,511 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_e27_1589526283340_7859_01_000112 is : 1432020-06-23 13:55:25,637 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e27_1589526283340_7859_01_000112 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL2020-06-23 13:55:25,643 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /data1/yarn/nm/usercache/schedule/appcache/application_1589526283340_7859/container_e27_1589526283340_7859_01_000112
{code}
Container LOG
{code:java}
13:54:24.192 Executor task launch worker for task 1857 INFO org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 377913:55:25.509 SIGTERM handler ERROR org.apache.spark.executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org