You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by suyanNone <gi...@git.apache.org> on 2016/04/27 15:35:11 UTC

[GitHub] spark pull request: [SPARK-14957][Spark] Adopt healthy dir to stor...

GitHub user suyanNone opened a pull request:

    https://github.com/apache/spark/pull/12735

    [SPARK-14957][Spark] Adopt healthy dir to store executor meta

    ## What changes were proposed in this pull request?
    
    Adopt a healthy dir to store executor meta.
    
    ## How was this patch tested?
    
    Unit tests
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/suyanNone/spark fix-0dir

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12735.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12735
    
----
commit 2527391378e3f572663211f9ecc20a3a49c52ebb
Author: hushan <hu...@xiaomi.com>
Date:   2016-04-27T13:24:49Z

    Adopt healthy dir to store executor meta

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-215089725
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-215089728
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57126/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-216418766
  
    Yes, agree with @tgravescs , `setRecoveryPath ` and `getRecoverPath` is a good way for 2.5+. But as mentioned above, executor will be failed faster if external shuffle service is not existed, so leave as it is is not a big problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-215436163
  
    Actually after looking a bit more, Spark does fail fast if the shuffle service isn't there because very soon after start up the BlockManager registers with the shuffle service so if it didn't come up the executors should fail quickly.   Is this what you were seeing?
    This to me isn't so bad, at least it isn't wasting a bunch of work.  Yes new executors could get scheduled there but they should fail very quickly without wasting working.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by suyanNone <gi...@git.apache.org>.

Github user suyanNone closed the pull request at:

    https://github.com/apache/spark/pull/12735


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-217116899
  
    OK sounds like we should close this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-215138869
  
    Hadoop 2.2 is still the base build for 2.x. If you think this is a small additional good reason to up the requirement, let's highlight that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-215089608
  
    **[Test build #57126 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57126/consoleFull)** for PR 12735 at commit [`2527391`](https://github.com/apache/spark/commit/2527391378e3f572663211f9ecc20a3a49c52ebb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/12735#issuecomment-215431514

If the disk is bad or missing there is nothing else you can do then create a new db since as you say deleting wouldn't work.

Note I think all it does is log a message because we didn't want it to kill the entire nodemanager, but I think we should change that. We should throw an exception if registration doesn't work because the way the nodemanager currently works is that it doesn't fail until someone goes to fetch the data. If it failed quick when the executor registered with it that would be preferable, but that is a YARN bug.

If you are going to look at the getRecoveryPath api, I think we can do it without reflection by defining our own setRecoveryPath function in YarnShuffleService (leave override off so it works with older versions of hadoop). Have it default to some invalid value and if its hadoop 2.5 or greater it will get called and set it to a real value. Then in our code we could check to see if its set and if it is use it, if not we could fall back to currently implementation. Note that setRecoverPath is the only one we really need define since getRecoverPath is protected, but to be safe we might also implement that. We can store our own path.
The only other thing here is that we may want to handle upgrading. If you are currently running the shuffle service the ldb will be in local dirs but when you upgrade it will go to a new path and wouldn't find the old one. To handle this we could just look for it in the new path first and if not there look for it in the old locations and if found then move to the new location.

I think between the throw and using the new api we shouldn't need to check the disks. The recovery path that is supposed to be set by administrators is supposed to be something very resilient such that if its down nothing on the node would work.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by suyanNone <gi...@git.apache.org>.

Github user suyanNone commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-215294953
  
    ```
        registeredExecutorFile =
          findRegisteredExecutorFile(conf.getTrimmedStrings("yarn.nodemanager.local-dirs"));
    ```
    
    we got that  local dirs for yarnConfig,  there a lot of dirs, but current we always adopt the first dir. right?
    
    first assume there don't exist any meta file:
    if the first disk had been removed, or had disk errors, like read-only filesystem/input/output errors/no space left. it will cause  ExternalShuffleBlockResolver to create a new leveldb file, but it will failed...and throw IOException, and this IOException will be engulfed by YarnShufflerService
    ```    
    try {
          blockHandler = new ExternalShuffleBlockHandler(transportConf, registeredExecutorFile);
        } catch (Exception e) {
          logger.error("Failed to initialize external shuffle service", e);
        }
    ```
    So may be we can choose a more healthy disk dir for storing meta file, to avoid necessary exception.
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-215123460
  
    I don't understand the problem or your change based on your JIRA. Are you trying to test for writeability? I don't think we want to arbitrarily require certain free space. This patch looks like it deletes directories though, which seems wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by suyanNone <gi...@git.apache.org>.

Github user suyanNone commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-215297811
  
    for why will have multi-meta files,  assume we found one, but the disk will be a read-only filesystem,  still use that ? or choose another healthy dir to create new one?
    
    if choose the second, we can't delete right, and we will have 2 files exist...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-215085222
  
    **[Test build #57126 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57126/consoleFull)** for PR 12735 at commit [`2527391`](https://github.com/apache/spark/commit/2527391378e3f572663211f9ecc20a3a49c52ebb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-215150083
  
    thanks, this is pretty minor and wouldn't warrant changing that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by suyanNone <gi...@git.apache.org>.

Github user suyanNone commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-215298717
  
    To be honest, I not walk through all Yarn shuffle server process, I just fix our user reported problem, why can't connect with shuffle server due to create level db in an non-exists folder.  I will take some time to re-produce problem, and be more comprehend  about this.
    
    @tgravescs I will look into the getRecoveryPath api...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-215134452
  
    I agree with @srowen can you please describe what you are trying to do in more detail. Why are you looking for multiple of the db files and deleting older ones, when would there be more then 1?  Any sort of min disk space would need to be configurable.  
    
    Note that in hadoop 2.5 they added a getRecoveryPath() routine to the AuxiliaryService that gets a path to put the ldb.  I don't know if we decided on what hadoop version we are support in 2.x, @srowen  do you remember if that was decided?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14957][Yarn] Adopt healthy dir to store...

Posted by suyanNone <gi...@git.apache.org>.

Github user suyanNone commented on the pull request:

    https://github.com/apache/spark/pull/12735#issuecomment-218389524
  
    @tgravescs, yes, the executor will failed fast... 
    vaguely remember that there have a application failed caused by shuffle server unhealthy dir, I don't have time to walk deep at that time... now I can't re-produce it
     so I will close this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org