You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Saisai Shao (JIRA)" <ji...@apache.org> on 2017/03/08 07:41:37 UTC

[jira] [Comment Edited] (SPARK-19812) YARN shuffle service fails to relocate recovery DB directories

    [ https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900821#comment-15900821 ] 

Saisai Shao edited comment on SPARK-19812 at 3/8/17 7:40 AM:
-------------------------------------------------------------

[~tgraves], I'm not quite sure what you mean here?

bq. The tests are using files rather then directories so it didn't catch. We need to fix the test also.

From my understanding this issues happens when dest dir is not empty and try to move with REPLACE_EXISTING. Also be happened when calling rename failed and the source dir is not empty directory.

But I cannot imagine how this happened, because if dest dir is not empty, then it should be returned before, will not go to check old NM local dirs.




was (Author: jerryshao):
[~tgraves], I'm not quite sure what you mean here?

bq. The tests are using files rather then directories so it didn't catch. We need to fix the test also.

From my understanding this issues happens when dest dir is not empty and try to move with REPLACE_EXISTING, but I cannot imagine how this happened, because if dest dir is not empty, then it should be returned before, will not go to check old NM local dirs.

> YARN shuffle service fails to relocate recovery DB directories
> --------------------------------------------------------------
>
>                 Key: SPARK-19812
>                 URL: https://issues.apache.org/jira/browse/SPARK-19812
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 2.0.1
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>
> The yarn shuffle service tries to switch from the yarn local directories to the real recovery directory but can fail to move the existing recovery db's.  It fails due to Files.move not doing directories that have contents.
> 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move recovery file sparkShuffleRecovery.ldb to the path /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle
> java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb
>         at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498)
>         at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
>         at java.nio.file.Files.move(Files.java:1395)
>         at org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369)
>         at org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200)
>         at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)
> This used to use f.renameTo and we switched it in the pr due to review comments and it looks like didn't do a final real test. The tests are using files rather then directories so it didn't catch. We need to fix the test also.
> history: https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org