You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/11/01 11:27:00 UTC

[jira] [Commented] (SPARK-40987) Avoid creating a directory when deleting a block, causing DAGScheduler to not work

    [ https://issues.apache.org/jira/browse/SPARK-40987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627113#comment-17627113 ] 

Apache Spark commented on SPARK-40987:
--------------------------------------

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/38467

> Avoid creating a directory when deleting a block, causing DAGScheduler to not work
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-40987
>                 URL: https://issues.apache.org/jira/browse/SPARK-40987
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.2, 3.3.1
>            Reporter: dzcxzl
>            Priority: Minor
>
> When the driver submits a job, DAGScheduler calls sc.broadcast(taskBinaryBytes).
> TorrentBroadcast#writeBlocks may fail due to disk problems during blockManager#putBytes.
> BlockManager#doPut calls BlockManager#removeBlockInternal to clean up the block.
> BlockManager#removeBlockInternal calls DiskStore#remove to clean up blocks on disk.
> DiskStore#remove will try to create the directory because the directory does not exist, and an exception will be thrown at this time.
> BlockInfoManager#blockInfoWrappers block info and lock not removed.
> The catch block in TorrentBroadcast#writeBlocks will call blockManager.removeBroadcast to clean up the broadcast.
> Because the block lock in BlockInfoManager#blockInfoWrappers is not released, the dag-scheduler-event-loop thread of DAGScheduler will wait forever.
>  
>  
> {code:java}
> 22/11/01 18:27:48 WARN BlockManager: Putting block broadcast_0_piece0 failed due to exception java.io.IOException: XXXXX.
> 22/11/01 18:27:48 ERROR TorrentBroadcast: Store broadcast broadcast_0 fail, remove all pieces of the broadcast {code}
>  
>  
>  
> {code:java}
> "dag-scheduler-event-loop" #54 daemon prio=5 os_prio=31 tid=0x00007fc98e3fa800 nid=0x7203 waiting on condition [0x0000700008c1e000]
>    java.lang.Thread.State: WAITING (parking)
>     at sun.misc.Unsafe.park(Native Method)
>     - parking to wait for  <0x00000007add3d8c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at org.apache.spark.storage.BlockInfoManager.$anonfun$acquireLock$1(BlockInfoManager.scala:221)
>     at org.apache.spark.storage.BlockInfoManager.$anonfun$acquireLock$1$adapted(BlockInfoManager.scala:214)
>     at org.apache.spark.storage.BlockInfoManager$$Lambda$3038/1307533457.apply(Unknown Source)
>     at org.apache.spark.storage.BlockInfoWrapper.withLock(BlockInfoManager.scala:105)
>     at org.apache.spark.storage.BlockInfoManager.acquireLock(BlockInfoManager.scala:214)
>     at org.apache.spark.storage.BlockInfoManager.lockForWriting(BlockInfoManager.scala:293)
>     at org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1979)
>     at org.apache.spark.storage.BlockManager.$anonfun$removeBroadcast$3(BlockManager.scala:1970)
>     at org.apache.spark.storage.BlockManager.$anonfun$removeBroadcast$3$adapted(BlockManager.scala:1970)
>     at org.apache.spark.storage.BlockManager$$Lambda$3092/1241801156.apply(Unknown Source)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1970)
>     at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:179)
>     at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:99)
>     at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:38)
>     at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:78)
>     at org.apache.spark.SparkContext.broadcastInternal(SparkContext.scala:1538)
>     at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1520)
>     at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1539)
>     at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1355)
>     at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1297)
>     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2929)
>     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2921)
>     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2910)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org