You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sergei Lebedev (JIRA)" <ji...@apache.org> on 2017/10/09 16:48:14 UTC
[jira] [Commented] (SPARK-22227) DiskBlockManager.getAllBlocks could fail if called during shuffle

    [ https://issues.apache.org/jira/browse/SPARK-22227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197275#comment-16197275 ] 

Sergei Lebedev commented on SPARK-22227:
----------------------------------------

Sidenote: the trace above is caused by the temporary file created by [SortShuffleWriter|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala#L69]. Sometimes we also saw failures containing {{TempShuffleBlockId}} names.

> DiskBlockManager.getAllBlocks could fail if called during shuffle
> -----------------------------------------------------------------
>
>                 Key: SPARK-22227
>                 URL: https://issues.apache.org/jira/browse/SPARK-22227
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 2.2.0
>            Reporter: Sergei Lebedev
>            Priority: Minor
>
> {{DiskBlockManager.getAllBlocks}} assumes that the directories managed by the block manager only contains files corresponding to "valid" block IDs, i.e. those parsable via {{BlockId.apply}}. This is not always the case as demonstrated by the following snippet
> {code}
> object GetAllBlocksFailure {
>   def main(args: Array[String]): Unit = {
>     val sc = new SparkContext(new SparkConf()
>         .setMaster("local[*]")
>         .setAppName("demo"))
>     new Thread {
>       override def run(): Unit = {
>         while (true) {
>           println(SparkEnv.get.blockManager.diskBlockManager.getAllBlocks().length)
>           Thread.sleep(10)
>         }
>       }
>     }.start()
>     val rdd = sc.range(1, 65536, numSlices = 10)
>         .map(x => (x % 4096, x))
>         .persist(StorageLevel.DISK_ONLY)
>         .reduceByKey { _ + _ }
>         .collect()
>   }
> }
> {code}
> We have a thread computing the number of bytes occupied by the block manager on-disk and it frequently crashes due to this assumption being violated. Relevant part of the stacktrace
> {code}
> 2017-10-06 11:20:14,287 ERROR  org.apache.spark.util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[CoarseGrainedExecutorBackend-stop-executor,5,main]
> java.lang.IllegalStateException: Unrecognized BlockId: shuffle_1_2466_0.data.5684dd9e-9fa2-42f5-9dd2-051474e372be
>         at org.apache.spark.storage.BlockId$.apply(BlockId.scala:133)
>         at org.apache.spark.storage.DiskBlockManager$$anonfun$getAllBlocks$1.apply(DiskBlockManager.scala:103)
>         at org.apache.spark.storage.DiskBlockManager$$anonfun$getAllBlocks$1.apply(DiskBlockManager.scala:103)
>         at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>         at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>         at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
>         at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>         at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>         at org.apache.spark.storage.DiskBlockManager.getAllBlocks(DiskBlockManager.scala:103)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org