You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (Jira)" <ji...@apache.org> on 2020/03/31 11:40:00 UTC

[jira] [Commented] (SPARK-29285) Temporary shuffle and local block should be able to handle disk failures

    [ https://issues.apache.org/jira/browse/SPARK-29285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071699#comment-17071699 ] 

Wenchen Fan commented on SPARK-29285:
-------------------------------------

This is reverted in https://github.com/apache/spark/pull/28072

> Temporary shuffle and local block should be able to handle disk failures
> ------------------------------------------------------------------------
>
>                 Key: SPARK-29285
>                 URL: https://issues.apache.org/jira/browse/SPARK-29285
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 3.0.0
>            Reporter: Kent Yao
>            Assignee: Kent Yao
>            Priority: Major
>             Fix For: 3.0.0
>
>
> {code:java}
> java.io.FileNotFoundException: /mnt/dfs/4/yarn/local/usercache/da_haitao/appcache/application_1568691584183_1953115/blockmgr-cc4689f5-eddd-4b99-8af4-4166a86ec30b/10/temp_shuffle_79be5049-d1d5-4a81-8e67-4ef236d3834f (No such file or directory)
> 	at java.io.FileOutputStream.open0(Native Method)
> 	at java.io.FileOutputStream.open(FileOutputStream.java:270)
> 	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
> 	at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
> 	at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
> 	at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249)
> 	at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:209)
> 	at org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:416)
> 	at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230)
> 	at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190)
> 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:109)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}
> Local or temp shuffle files are initialized without checking because the getFile method in DiskBlockManager probably return an existing subdirectory. Sometimes, when a disk failure occurs, those files may become inaccessible and throw FileNotFoundException later, which may fail the entire task. Task re-running is a bit heavy for these errors, we may give another or more disks a try at least.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org