You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by umehrot2 <gi...@git.apache.org> on 2017/03/28 00:40:41 UTC

[GitHub] spark pull request #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recomput...

GitHub user umehrot2 opened a pull request:

    https://github.com/apache/spark/pull/17445

    [SPARK-20115] [CORE] Fix DAGScheduler to recompute all the lost shuffle blocks when external shuffle service is unavailable

    ## What changes were proposed in this pull request?
    The Spark\u2019s DAGScheduler currently does not recompute all the lost shuffle blocks on a host when a FetchFailed exception occurs, while fetching shuffle blocks from another executor with external shuffle service enabled. Instead it only recomputes the lost shuffle blocks computed by the executor for which the FetchFailed exception occurred. This works fine for Internal shuffle scenario, where the executors serve their own shuffle blocks and hence only the shuffle blocks for that executor should be considered lost. However, when External Shuffle Service is being used, a FetchFailed exception would mean that the external shuffle service running on that host has become unavailable. This in turn is sufficient to assume that all the shuffle blocks which were managed by the Shuffle service on that host are lost. Therefore, just recomputing the shuffle blocks associated with the particular Executor for which FetchFailed exception occurred is not sufficient. We need to recompute all the
  shuffle blocks, managed by that service because there could be multiple executors running on that host.
    
    Since not all the shuffle blocks (for all the executors on the host) are recomputed, this causes future attempts of the reduce stage to fail as well because the new tasks scheduled still keep trying to reach the old location of the shuffle blocks (which were not recomputed) and keep throwing further FetchFailed exceptions. This ultimately causes the job to fail, after the reduce stage has been retried 4 times.
    
    Following changes are proposed to address the above issue:
    1. In case of FetchFailed exception when using external shuffle service, mark all the shuffle outputs on the host as failed (due to failure of external shuffle service).
    2. Thus recompute all the lost shuffle blocks, instead of for just one executor.
    
    ## How was this patch tested?
    1. Added unit test for the change in functionality.
    2. Tested on a cluster with Spark running on Yarn (with external shuffle enabled), by performing the following steps:
    - Start a word count job, and wait for the Map stage to be completed
    - During the reduce stage, stop the external shuffle service on a host
    - Wait for fetch failed exception to occur, while fetching shuffle blocks from the host
    - Check that in the reattempt of the Map stage, Spark computes all the lost shuffle blocks for the host on which shuffle service was stopped
    - Job completes successfully, since reduce stage in next reattempt finds all the shuffle blocks


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/umehrot2/spark spark-dagscheduler-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17445.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17445
    
----
commit 6642de3bd7215cc311c4d51284ffe1c50387cdfc
Author: Udit Mehrotra <ud...@amazon.com>
Date:   2017-02-14T23:27:15Z

    Allow Spark to recompute all the shuffle blocks on a host, if external shuffle service is unavailable on that host
    
    cr https://cr.amazon.com/r/6822886/

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17445
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75513/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/17445
  
    Jenkins this is OK to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/17445
  
    Now that #18150 has been merged, maybe we can close this now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/17445
  
    Jenkins this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17445
  
    **[Test build #75513 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75513/testReport)** for PR 17445 at commit [`6642de3`](https://github.com/apache/spark/commit/6642de3bd7215cc311c4d51284ffe1c50387cdfc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17445
  
    there is a large discussion about how to handle fetch failures going on in https://issues.apache.org/jira/browse/SPARK-20178.  The fact that you got a fetch failure does not mean that all blocks are invalid or that the external shuffle service is totally down.  It could very well be an intermittent thing as well.  There was also a pr to make the stage attempts configurable so you could increase that.
    
    If a lot of people are seeing this issue the question is do we need to do something shorter term to handle this well we are discussing SPARK-20178. Certainly if we are seeing more actual job failures due to it, it would be better to invalidate all the output and it possibly runs longer but at least it doesn't fail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recomput...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17445


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

Posted by umehrot2 <gi...@git.apache.org>.
Github user umehrot2 commented on the issue:

    https://github.com/apache/spark/pull/17445
  
    Jenkins test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

Posted by umehrot2 <gi...@git.apache.org>.
Github user umehrot2 commented on the issue:

    https://github.com/apache/spark/pull/17445
  
    @kayousterhout Thanks for your response, and for that link. Well it does seem like #17088 addresses the same issue as this PR.
    
    However, I would like the you all to review this PR as well, because I think it more clearly organizes the code between handling of internal and external shuffle failures. It also removes a lot of the code duplication which is part of the other PR. Further, it adds an epoch check for the 'host'.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

Posted by umehrot2 <gi...@git.apache.org>.
Github user umehrot2 commented on the issue:

    https://github.com/apache/spark/pull/17445
  
    @kayousterhout @mridulm @rxin @lins05 @markhamstra @tgravescs @squito Can you take a look at this ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17445
  
    **[Test build #75513 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75513/testReport)** for PR 17445 at commit [`6642de3`](https://github.com/apache/spark/commit/6642de3bd7215cc311c4d51284ffe1c50387cdfc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17445
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17445
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/17445
  
    Have you seen #17088? I just glanced at this quickly but I think this is a duplicate of that (SPARK-19753)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17445: [SPARK-20115] [CORE] Fix DAGScheduler to recompute all t...

Posted by umehrot2 <gi...@git.apache.org>.
Github user umehrot2 commented on the issue:

    https://github.com/apache/spark/pull/17445
  
    @kayousterhout @mridulm @rxin @lins05 Can you take a look at this PR ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org