You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by LiShuMing <gi...@git.apache.org> on 2017/08/10 08:51:44 UTC

[GitHub] spark pull request #18905: [SPARK-21660] [YARN] [Shuffle] Yarn ShuffleServic...

GitHub user LiShuMing opened a pull request:

    https://github.com/apache/spark/pull/18905

    [SPARK-21660] [YARN] [Shuffle] Yarn ShuffleService failed to start when the chosen dir…

    
    ## What changes were proposed in this pull request?
    
    See [SPARK-21660](https://issues.apache.org/jira/browse/SPARK-21660), this PR add one simple strategy to validate the chosen disk writable to avoid choosing a read-only disk.
    
    ## How was this patch tested?
    
    #### How to mock disk corrupted?
    > change the recovery path read-only mode: 
    > sudo chmod -R 400 /var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle
    
    Before this pr, when we start the nodemanager, exception below:
    
    > 2017-08-10 16:30:08,112 INFO  yarn.YarnShuffleService (YarnShuffleService.java:<init>(136)) - Initializing YARN shuffle service for Spark
    2017-08-10 16:30:08,112 INFO  containermanager.AuxServices (AuxServices.java:addService(72)) - Adding auxiliary service spark_shuffle, "spark_shuffle"
    2017-08-10 16:30:08,218 ERROR util.LevelDBProvider (LevelDBProvider.java:initLevelDB(61)) - error opening leveldb file /var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle/registeredExecutors.ldb.  Creating new file, will not be able to recover state for existing applications
    org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle/registeredExecutors.ldb/LOCK: Permission denied
            at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
            at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
            at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
            at org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:48)
            at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:116)
            at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:94)
            at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.<init>(ExternalShuffleBlockHandler.java:66)
            at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167)
            at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
            at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
            at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:245)
            at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
            at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:261)
            at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:495)
            at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:543)
    2017-08-10 16:30:08,220 WARN  util.LevelDBProvider (LevelDBProvider.java:initLevelDB(71)) - error deleting /var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle/registeredExecutors.ldb
    2017-08-10 16:30:08,220 INFO  service.AbstractService (AbstractService.java:noteFailure(272)) - Service spark_shuffle failed in state INITED; cause: java.io.IOException: Unable to create state store
    java.io.IOException: Unable to create state store
            at org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:77)
            at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:116)
            at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:94)
            at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.<init>(ExternalShuffleBlockHandler.java:66)
            at org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167)
            at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
            at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
            at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:245)
            at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
            at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:261)
            at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:495)
            at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:543)
    Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle/registeredExecutors.ldb/LOCK: Permission denied
            at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
            at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
            at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
            at org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:75)
            ... 15 more
    
    
    After this pr:
    
        
    
    > 2017-08-10 16:36:49,101 INFO  yarn.YarnShuffleService (YarnShuffleService.java:<init>(136)) - Initializing YARN shuffle service for Spark
    2017-08-10 16:36:49,101 INFO  containermanager.AuxServices (AuxServices.java:addService(72)) - Adding auxiliary service spark_shuffle, "spark_shuffle"
    2017-08-10 16:36:49,102 INFO  yarn.YarnShuffleService (YarnShuffleService.java:initRecoveryDb(359)) - Recovery path /var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle ldb available: false.
    2017-08-10 16:36:49,102 WARN  yarn.YarnShuffleService (YarnShuffleService.java:initRecoveryDb(367)) - Recovery path /var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle unavailable: set it to null
    2017-08-10 16:36:49,180 INFO  util.LevelDBProvider (LevelDBProvider.java:initLevelDB(51)) - Creating state database at /mnt/dfs/0/hadoop/yarn/local/registeredExecutors.ldb
    2017-08-10 16:36:49,317 INFO  util.LevelDBProvider$LevelDBLogger (LevelDBProvider.java:log(93)) - Delete type=3 #1
    2017-08-10 16:36:49,548 INFO  yarn.YarnShuffleService (YarnShuffleService.java:serviceInit(186)) - Started YARN shuffle service for Spark on port 7337. Authentication is not enabled.  Registered executor file is /mnt/dfs/0/hadoop/yarn/local/registeredExecutors.ld
    b

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/LiShuMing/spark SPARK-21660

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18905.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18905
    
----
commit d62405dfbbea6ce1e7604721ab1234e5fde5b651
Author: lishuming <al...@126.com>
Date:   2017-08-09T02:45:28Z

    [SPARK-21660] Yarn ShuffleService failed to start when the chosen directory become read-only

commit 2077537c52b43c6df050a7afe23a453d09e38db6
Author: lishuming <al...@126.com>
Date:   2017-08-10T08:45:41Z

    Recovery path had already existed but unavailable, set it to null

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18905: [SPARK-21660][YARN][Shuffle] Yarn ShuffleService failed ...

Posted by LiShuMing <gi...@git.apache.org>.

Github user LiShuMing commented on the issue:

    https://github.com/apache/spark/pull/18905
  
    ping @jerryshao 
    
     I found a method to check disk in hadoop: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/DiskChecker.java#L111
    
    I add a unit test, Can you help me review my code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18905: [SPARK-21660][YARN][Shuffle] Yarn ShuffleService ...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18905#discussion_r132896967
  
    --- Diff: common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java ---
    @@ -333,33 +333,63 @@ protected Path getRecoveryPath(String fileName) {
       }
     
       /**
    +   * Check the chosen DB file available or not.
    +   */
    +  protected Boolean checkFileAvailable(File file) {
    --- End diff --
    
    I'm not sure if it is a thorough way to check disk healthy, in our internal case, we found that disk is not mounted (due to failure), and trying to write to this unmounted disk throws permission deny exception.
    
    I'm thinking that disk unwritable is just one case of disk unhealthy, maybe we should check YARN's disk healthy check mechanism.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18905: [SPARK-21660] [YARN] [Shuffle] Yarn ShuffleService faile...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18905
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18905: [SPARK-21660][YARN][Shuffle] Yarn ShuffleService failed ...

Posted by LiShuMing <gi...@git.apache.org>.

Github user LiShuMing commented on the issue:

    https://github.com/apache/spark/pull/18905
  
    Sorry,  busy recently, I will update it today...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18905: [SPARK-21660][YARN][Shuffle] Yarn ShuffleService failed ...

Posted by LiShuMing <gi...@git.apache.org>.

Github user LiShuMing commented on the issue:

    https://github.com/apache/spark/pull/18905
  
    See another approach to solve this problem: https://github.com/apache/spark/pull/19032 and I will close this pr.
    
    Thanks @jerryshao  @tgravescs .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18905: [SPARK-21660][YARN][Shuffle] Yarn ShuffleService ...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18905#discussion_r132901109
  
    --- Diff: common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java ---
    @@ -333,33 +333,63 @@ protected Path getRecoveryPath(String fileName) {
       }
     
       /**
    +   * Check the chosen DB file available or not.
    +   */
    +  protected Boolean checkFileAvailable(File file) {
    +      if (file.canWrite()){
    +        return true;
    +      }
    +
    +      return false;
    +  }
    +
    +  /**
        * Figure out the recovery path and handle moving the DB if YARN NM recovery gets enabled
        * when it previously was not. If YARN NM recovery is enabled it uses that path, otherwise
        * it will uses a YARN local dir.
        */
       protected File initRecoveryDb(String dbName) {
    +    Boolean bolRecoveryPathAvailable = true;
    +
         if (_recoveryPath != null) {
             File recoveryFile = new File(_recoveryPath.toUri().getPath(), dbName);
    -        if (recoveryFile.exists()) {
    +
    +        bolRecoveryPathAvailable = checkFileAvailable(recoveryFile);
    +        logger.info("Recovery path {} ldb available: {}.", _recoveryPath, bolRecoveryPathAvailable);
    +        if (recoveryFile.exists() && bolRecoveryPathAvailable) {
               return recoveryFile;
             }
         }
    +
    +    // If recovery path unavailable, no use it any more.
    +    if (!bolRecoveryPathAvailable) {
    --- End diff --
    
    I think recovery path is set by user or use yarn default, user should make sure the availability of this directory, and yarn internally relies on it. It doesn't make sense to change to another disk if recovery path is unavailable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18905: [SPARK-21660][YARN][Shuffle] Yarn ShuffleService ...

Posted by LiShuMing <gi...@git.apache.org>.

Github user LiShuMing closed the pull request at:

    https://github.com/apache/spark/pull/18905


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18905: [SPARK-21660][YARN][Shuffle] Yarn ShuffleService ...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18905#discussion_r132897003
  
    --- Diff: common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java ---
    @@ -333,33 +333,63 @@ protected Path getRecoveryPath(String fileName) {
       }
     
       /**
    +   * Check the chosen DB file available or not.
    +   */
    +  protected Boolean checkFileAvailable(File file) {
    +      if (file.canWrite()){
    --- End diff --
    
    two space indent for the java code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18905: [SPARK-21660][YARN][Shuffle] Yarn ShuffleService failed ...

Posted by LiShuMing <gi...@git.apache.org>.

Github user LiShuMing commented on the issue:

    https://github.com/apache/spark/pull/18905
  
    @jerryshao Thanks for your replies!  I will do such things then:
    1.  "it is good to change to other directories (is yarn internally relying on it)?"
    I think the recovery path(local variable) is only used in `YarnShuffleService`, principally not affects yarn environment. This PR cares the scene that we can find a better way to choose a useful disk for the recovery path when there are many disks that can choose.
    
    2. Check HDFS/YARN's disk healthy check mechanism to better define `checkFileAvailable() `;
    
    3. Fix code format.
    
    4. Throw an exception when `_recoveryPath` is empty finally.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18905: [SPARK-21660][YARN][Shuffle] Yarn ShuffleService failed ...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/18905
  
    I have two questions about the fix:
    
    1. Is it a good idea to change recovery path to other directory? Since recovery path is configured by user or figured out by yarn, so maybe YARN has some assumption about this path, if we change to other one, will this introduce some issues. Also if recovery path is null, should it be guaranteed by user for the availability.
    2. What if the previous bad disk back to normal with orphan data? For example is dir1 is failed with state V1, and based on this logic we should another dir2 and state changed to v2. Then after a while if dir1 is back to normal, then which dirs are we choosing based on your current code?
    
    CC @tgravescs to review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18905: [SPARK-21660][YARN][Shuffle] Yarn ShuffleService failed ...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/18905
  
    @LiShuMing any update on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18905: [SPARK-21660][YARN][Shuffle] Yarn ShuffleService failed ...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/18905
  
    The recovery path returned by yarn is supposed to be reliable and if it isn't working then the NM itself shouldn't run.  So in general you should just use that if you want spark to be able to recover.  If you don't have yarn recovery enabled them there is no need for us to write the DBs at all and I think we should change to not do that.
    
    I think this jira is a dup of https://issues.apache.org/jira/browse/SPARK-17321
    
    See my comments there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18905: [SPARK-21660][YARN][Shuffle] Yarn ShuffleService ...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18905#discussion_r132901938
  
    --- Diff: common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java ---
    @@ -378,6 +408,18 @@ protected File initRecoveryDb(String dbName) {
             }
           }
         }
    +
    +    // Find a local_dir which is writable, to avoid creating ldb in a read-only disk.
    +    if (_recoveryPath == null) {
    +      for (String dir : localDirs) {
    +        File f = new File(dir);
    +        if (checkFileAvailable(f)) {
    +          _recoveryPath = new Path(dir);
    +          break;
    +        }
    +      }
    +    }
    +
         if (_recoveryPath == null) {
    --- End diff --
    
    If `_recoveryPath` is still null I think we should throw an exception here, since none of the disk is good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org