You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Zee Chen <ze...@gmail.com> on 2016/02/25 09:24:25 UTC

Bug in DiskBlockManager subDirs logic?

Hi,

I am debugging a situation where SortShuffleWriter sometimes fail to
create a file, with the following stack trace:

16/02/23 11:48:46 ERROR Executor: Exception in task 13.0 in stage
47827.0 (TID 1367089)
java.io.FileNotFoundException:
/tmp/spark-9dd8dca9-6803-4c6c-bb6a-0e9c0111837c/executor-129dfdb8-9422-4668-989e-e789703526ad/blockmgr-dda6e340-7859-468f-b493-04e4162d341a/00/temp_shuffle_69fe1673-9ff2-462b-92b8-683d04669aad
(No such file or directory)
        at java.io.FileOutputStream.open0(Native Method)
        at java.io.FileOutputStream.open(FileOutputStream.java:270)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:110)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


I checked the linux file system (ext4) and saw the /00/ subdir is
missing. I went through the heap dump of the
CoarseGrainedExecutorBackend jvm proc and found that
DiskBlockManager's subDirs list had more non-null 2-hex subdirs than
present on the file system! As a test I created all 64 2-hex subdirs
by hand and then the problem went away.

So had anybody else seen this problem? Looking at the relevant logic
in DiskBlockManager and it hasn't changed much since the fix to
https://issues.apache.org/jira/browse/SPARK-6468

My configuration:
spark-1.5.1, hadoop-2.6.0, standalone, oracle jdk8u60

Thanks,
Zee

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org