You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Thomas Graves (JIRA)" <ji...@apache.org> on 2017/03/03 15:44:45 UTC

[jira] [Commented] (SPARK-19812) YARN shuffle service fails to relocate recovery DB directories

    [ https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15894584#comment-15894584 ] 

Thomas Graves commented on SPARK-19812:
---------------------------------------

note that it will go ahead and start using the recovery db, it just doesn't copy over the old one so anything running gets lost.

> YARN shuffle service fails to relocate recovery DB directories
> --------------------------------------------------------------
>
>                 Key: SPARK-19812
>                 URL: https://issues.apache.org/jira/browse/SPARK-19812
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 2.0.1
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>
> The yarn shuffle service tries to switch from the yarn local directories to the real recovery directory but can fail to move the existing recovery db's.  It fails due to Files.move not doing directories that have contents.
> 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move recovery file sparkShuffleRecovery.ldb to the path /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle
> java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb
>         at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498)
>         at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
>         at java.nio.file.Files.move(Files.java:1395)
>         at org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369)
>         at org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200)
>         at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
>         at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)
> This used to use f.renameTo and we switched it in the pr due to review comments and it looks like didn't do a final real test. The tests are using files rather then directories so it didn't catch. We need to fix the test also.
> history: https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org