You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sitalkedia <gi...@git.apache.org> on 2017/03/14 22:06:15 UTC

[GitHub] spark pull request #17297: [SPARK-14649][CORE] DagScheduler should not run d...

GitHub user sitalkedia opened a pull request:

    https://github.com/apache/spark/pull/17297

    [SPARK-14649][CORE] DagScheduler should not run duplicate tasks on fe\u2026

    ## What changes were proposed in this pull request?
    
    When a fetch failure occurs, the DAGScheduler re-launches the previous stage (to re-generate output that was missing), and then re-launches all tasks in the stage with the fetch failure that hadn't completed when the fetch failure occurred (the DAGScheduler re-lanches all of the tasks whose output data is not available -- which is equivalent to the set of tasks that hadn't yet completed). This some times leads to wasteful duplicate task run for the jobs with long running task.
    
    To address the issue following changes have been made.
    
    1. When a fetch failure happens, the task set manager ask the dag scheduler to abort all the non-running tasks. However, the running tasks in the task set are not killed.
    2. When a task is aborted, the dag scheduler adds the task to the pending task list.
    3. In case of resubmission of the stage, the dag scheduler only resubmits the tasks which are in pending stage.
    
    
    ## How was this patch tested?
    
    Added new tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sitalkedia/spark avoid_duplicate_tasks_new

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17297.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17297
    
----
commit e5429d309801bffb8ddc907fb4800efb6fb1a2fa
Author: Sital Kedia <sk...@fb.com>
Date:   2016-04-15T23:44:23Z

    [SPARK-14649][CORE] DagScheduler should not run duplicate tasks on fetch failure

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #74560 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74560/testReport)** for PR 17297 at commit [`279b09a`](https://github.com/apache/spark/commit/279b09a45016bccbdc7fe6512f504ffa863376b0).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class TasksAborted(stageId: Int, tasks: Seq[Task[_]]) extends DAGSchedulerEvent`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Should we temporarily close the PR and wait for the design doc to be finalized? @sitalkedia 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75029 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75029/testReport)** for PR 17297 at commit [`99b4069`](https://github.com/apache/spark/commit/99b4069efe929fafd1b5fc0780821fd50510abe4).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74631/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    >> I don't think its true that it relaunches all tasks that hadn't completed when the fetch failure occurred. it relaunches all the tasks haven't completed, by the time the stage gets resubmitted. More tasks can complete in between the time of the first failure, and the time the stage is resubmitted.
    
    Actually, I realized that it's not true. If you looked at the code (https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1419), when the stage fails because of fetch failure, we remove the stage from the output commiter. So if any task completes between the time of first fetch failure and the time stage is resubmitted, will be denied to commit the output and so the scheduler re-launches all tasks in the stage with the fetch failure that hadn't completed when the fetch failure occurred.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75339/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @kayousterhout - Sure will file a JIRA in future. Latest test failed and I am not sure if this is the same issue - https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75151/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74566/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75124/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @sitalkedia I won't have time to review this in detail for at least a few weeks, just so you know (although others may have time to review / merge it).
    
    At a very high level, I'm concerned  about the amount of complexity that this adds to the scheduler code.  We've recently had to deal with a number of subtle bugs with jobs hanging or Spark crashing as a result of trying to handle map output from old tasks.  As a result, I'm hesitant to add more complexity -- and the associated risk of bugs that cause job failures + expense of maintaining the code -- to improve performance.
    
    At the point I'd lean towards cancelling outstanding map tasks when a fetch failure occurs (there's currently a TODO in the code to do this) to simplify these issues.  This would improve performance in some ways, by freeing up slots that could be used for something else, at the expense of wasted work if the tasks have already made significant progress.  But it would significantly simplify the scheduler code, which given the debugging + reviewer time that has gone into fixing subtle issues with this code path, I think is worthwhile.
    
    Curious what other folks think here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75287 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75287/testReport)** for PR 17297 at commit [`1e6e88a`](https://github.com/apache/spark/commit/1e6e88a37001bd2f026eff1bd8db6adb5e9bf796).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75030/testReport)** for PR 17297 at commit [`40a3742`](https://github.com/apache/spark/commit/40a374236b69d1c6efd9b5a91944268280b0fba8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75127 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75127/testReport)** for PR 17297 at commit [`1aab715`](https://github.com/apache/spark/commit/1aab715c03b3a64c4548e0434e0ffcc7b439d47b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74562/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    cc - @kayousterhout - Addressed your earlier comment about https://github.com/apache/spark/pull/12436 ignoring fetch failure from stale map output. I have addressed this issue by adding epoch for each map output registered, that way if the task's epoch is smaller than the epoch of the map output, we can ignore the fetch failure. This also takes care of  epoch changes which will be triggered due to executor loss for a shuffle task when its shuffle map task executor is gone as pointed out by @mridulm. 
    
    Let me know what you think of the approach. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75176/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @kayousterhout - Both the scenario A and B you described above are likely (it totally depend on the nature of the job and available cluster resources) and you are right that in case of scenario B, this PR will not provide any benefit. 
    
    I am planning to have a follow up PR to make the fetch failure handling logic better by not failing a task at all. In that case, the reducers can just inform the scheduler of lost map output and can still continue processing other available map outputs while the scheduler concurrently recomputes the lost map output. But that will be a bigger change in the scheduler. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74560/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75332 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75332/testReport)** for PR 17297 at commit [`bdaff12`](https://github.com/apache/spark/commit/bdaff123dd21feff72218d8163fa1a69e45f1a1e).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Agreed. Let's establish what we want to do before trying to discuss the
    details of how we are going to do it.
    
    On Tue, Mar 28, 2017 at 8:17 AM, Imran Rashid <no...@github.com>
    wrote:
    
    > @sitalkedia <https://github.com/sitalkedia> This change is pretty
    > contentious, there are lot of questions about whether or not this is a good
    > change. I don't think discussing this here in github comments on a PR is
    > the best form. I think of PR comments as being more about code details --
    > clarity, tests, whether the implementation is correct, etc. But here we're
    > discussing whether the behavior is even desirable, as well as trying to
    > discuss this in relation to other changes. I think a better format would be
    > for you to open a jira and submit a design document (maybe a shared google
    > doc at first), where we can focus more on the desired behavior and consider
    > all the changes, even if the PRs are smaller to make them easier to review.
    >
    > I'm explicitly *not* making a judgement on whether or not this is a good
    > change. Also I do appreciate you having the code changes ready, as a POC,
    > as that can help folks consider the complexity of the change. But it seems
    > clear to me that first we need to come to a decision about the end goal.
    >
    > Also, assuming we do decide this is desirable behavior, there is also a
    > question about how we can get changes like this in without risking breaking
    > things -- I have started a thread on dev@ related to that topic in
    > general, but we should figure that for these changes in particular as well.
    >
    > @kayousterhout <https://github.com/kayousterhout> @tgravescs
    > <https://github.com/tgravescs> @markhamstra
    > <https://github.com/markhamstra> makes sense?
    >
    > \u2014
    > You are receiving this because you were mentioned.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/17297#issuecomment-289803690>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AAZ4-pbaJWHOMCLOB2JZFReBYx0E1xOHks5rqSSTgaJpZM4MdN08>
    > .
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @squito - Thanks, that helps a lot. I will fix the issue and submit a patch soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #74562 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74562/testReport)** for PR 17297 at commit [`f127150`](https://github.com/apache/spark/commit/f1271506d0f5d5d037cee91cc91d42ddb14a8038).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17297: [SPARK-14649][CORE] DagScheduler should not run d...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17297#discussion_r107018874
  
    --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
    @@ -378,15 +382,17 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf,
         val array = mapStatuses(shuffleId)
         array.synchronized {
           array(mapId) = status
    +      val epochs = epochForMapStatus.get(shuffleId).get
    --- End diff --
    
    ```scala
    val epochs = epochForMapStatus(shuffleId)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75126 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75126/testReport)** for PR 17297 at commit [`05770b9`](https://github.com/apache/spark/commit/05770b9334002a6fd995e1ec9fa9e22edb2884d9).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75176 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176/testReport)** for PR 17297 at commit [`b179439`](https://github.com/apache/spark/commit/b179439ac059b8b7f8325bf883ba9ce4c7ac5136).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75029/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75126 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75126/testReport)** for PR 17297 at commit [`05770b9`](https://github.com/apache/spark/commit/05770b9334002a6fd995e1ec9fa9e22edb2884d9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @sitalkedia This change is pretty contentious, there are lot of questions about whether or not this is a good change.  I don't think discussing this here in github comments on a PR is the best form.  I think of PR comments as being more about code details -- clarity, tests, whether the implementation is correct, etc.  But here we're discussing whether the behavior is even desirable, as well as trying to discuss this in relation to other changes.  I think a better format would be for you to open a jira and submit a design document (maybe a shared google doc at first), where we can focus more on the desired behavior and consider all the changes, even if the PRs are smaller to make them easier to review.  
    
    I'm explicitly *not* making a judgement on whether or not this is a good change.  Also I do appreciate you having the code changes ready, as a POC, as that can help folks consider the complexity of the change.  But it seems clear to me that first we need to come to a decision about the end goal.
    
    Also, assuming we do decide this is desirable behavior, there is also a question about how we can get changes like this in without risking breaking things -- I have started a thread on dev@ related to that topic in general, but we should figure that for these changes in particular as well.
    
    @kayousterhout @tgravescs @markhamstra makes sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @squito - I am not able to reproduce this issue locally. 
    
    The tests fails with some other issue - 
    
    ``None.get
    java.util.NoSuchElementException: None.get
    	at scala.None$.get(Option.scala:347)
    	at scala.None$.get(Option.scala:345)
    	at org.apache.spark.InternalAccumulatorSuite$$anonfun$1.apply$mcV$sp(InternalAccumulatorSuite.scala:43)
    	at org.apache.spark.InternalAccumulatorSuite$$anonfun$1.apply(InternalAccumulatorSuite.scala:39)
    	at org.apache.spark.InternalAccumulatorSuite$$anonfun$1.apply(InternalAccumulatorSuite.scala:39)
    	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
    	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
    	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    	at org.scalatest.Transformer.apply(Transformer.scala:22)
    	at org.scalatest.Transformer.apply(Transformer.scala:20)
    	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
    	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
    	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
    	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
    	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
    	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
    	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
    	at org.apache.spark.InternalAccumulatorSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(InternalAccumulatorSuite.scala:28)
    	at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
    	at org.apache.spark.InternalAccumulatorSuite.runTest(InternalAccumulatorSuite.scala:28)
    	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
    	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
    	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
    	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
    	at scala.collection.immutable.List.foreach(List.scala:381)
    	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
    	at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
    	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
    	at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
    	at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
    	at org.scalatest.Suite$class.run(Suite.scala:1424)
    	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
    	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
    	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
    	at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
    	at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
    	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
    	at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
    	at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
    	at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
    	at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
    	at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
    	at org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)
    	at scala.collection.immutable.List.foreach(List.scala:381)
    	at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
    	at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)
    	at org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1043)
    	at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:2722)
    	at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1043)
    	at org.scalatest.tools.Runner$.run(Runner.scala:883)
    	at org.scalatest.tools.Runner.run(Runner.scala)
    	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:138)
    	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:28)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:497)
    	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) ```
    
    Please note that all `InternalAccumulatorSuite` tests fail on my laptop. 
    In the Jenkins log, do you see any other test cases having the java.lang.ArrayIndexOutOfBoundsException from `MapOutputTrackerMaster` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #74562 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74562/testReport)** for PR 17297 at commit [`f127150`](https://github.com/apache/spark/commit/f1271506d0f5d5d037cee91cc91d42ddb14a8038).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class TasksAborted(stageId: Int, tasks: Seq[Task[_]]) extends DAGSchedulerEvent`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75151/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Thanks a lot @squito for taking a look at it and for your feedback. 
    
    >> this is already true. when there is a fetch failure, the TaskSetManager is marked as zombie, and the DAGScheduler resubmits stages, but nothing actively kills running tasks.
    
    That is true but currently the DAG scheduler has no idea about which tasks are running and which are being aborted. With this change, the task set manager informs the dag scheduler about currently running/aborted tasks so that the DAG scheduler can avoid resubmitting duplicates.
    
    >> I don't think its true that it relaunches all tasks that hadn't completed when the fetch failure occurred. it relaunches all the tasks haven't completed, by the time the stage gets resubmitted. More tasks can complete in between the time of the first failure, and the time the stage is resubmitted.
    
    Yes that's true. I will update the PR description.
    
    
    >> So I think in (b) and (c), you are trying to avoid resubmitting tasks 3-9 on stage 1 attempt 1. the thing is, there is a strong reason to believe that the original version of those tasks will fail. Most likely, those tasks needs map output from the same executor that caused the first fetch failure. So Kay is suggesting that we take the opposite approach, and instead actively kill the tasks from stage 1 attempt 0. OTOH, its possible that (i) the issue may have been transient or (ii) the tasks already finished fetching that data before the error occurred. We really have no idea.
    
    In our case, we are observing that any transient issue on the shuffle service might cause few tasks to fail. While other reducers might not see the fetch failure because either they already fetched the data from that shuffle service or they are yet to fetch it.  Killing all the reducers in those cases is waste of a lot of work and also as I mentioned above, we might end of in a state where jobs will not make any progress at all in case of frequent fetch failure, because they will just flip-flop between two stage.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75176 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176/testReport)** for PR 17297 at commit [`b179439`](https://github.com/apache/spark/commit/b179439ac059b8b7f8325bf883ba9ce4c7ac5136).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @sitalkedia how are you trying to run the test?  Works fine for me on my laptop on master.  Note that the test is referencing a var which is only defined if "spark.testing" is a system property: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala#L199
    
    which it is in the sbt and maven build.  (maybe doens't work inside an IDE?  I'd strongly just using `~testOnly` with sbt for faster dev iterations if you're not already)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75151 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75151/testReport)** for PR 17297 at commit [`1aab715`](https://github.com/apache/spark/commit/1aab715c03b3a64c4548e0434e0ffcc7b439d47b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #74566 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74566/testReport)** for PR 17297 at commit [`0bcc69a`](https://github.com/apache/spark/commit/0bcc69a7a3094ddaa8c915be1e4a198a354f8b6b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    > when the stage fails because of fetch failure, we remove the stage from the output commiter. So if any task completes between the time of first fetch failure and the time stage is resubmitted, will be denied to commit the output
    
    oh, that is a great point.  I was mostly thinking of another shufflemapstage, where that wouldn't matter, but if its a result stage which needs to commit its output, you are right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75339 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75339/testReport)** for PR 17297 at commit [`ace8464`](https://github.com/apache/spark/commit/ace8464a1ec34864e56fbfceaac509895dcf31d4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @sitalkedia they're in core/target/unit-tests.log
    
    Sometimes it's easier to move the logs to the tests (so they show up in-line), which you can do by changing core/src/test/resources/log4j.properties to log to the console instead of to a file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Agree sounds good!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @kayousterhout - I understand your concern and I agree that canceling the running tasks is definitely a simpler approach, but this is very inefficient for large jobs where tasks can run for hours.  In our environment where fetch failures are common, this change will not only improve the performance of the jobs in case of fetch failure, this also helps reliability. If we cancel all running reducers, we might end of in a state where jobs will not make any progress at all in case of frequent fetch failure, because they will just flip-flop between two stage.  
    
    Comparing this approach to how Hadoop handles fetch failure, it does not fail any reducer in case it detects any map output missing. The reducers just continue processing output from other mappers while the missing output is being recomputed concurrently. This approach give Hadoop a big edge over Spark for long running jobs with multiple fetch failure. This change is one step towards making Spark robust against fetch failure, we would eventually want to have the hadoop model, where we would not fail any task in case of map output missing.
    
    Regarding the approach,  please let me know if you can think of some way to reduce the complexity of this change.
    
    cc -@markhamstra, @rxin, @sameeragarwal  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #74560 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74560/testReport)** for PR 17297 at commit [`279b09a`](https://github.com/apache/spark/commit/279b09a45016bccbdc7fe6512f504ffa863376b0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @squito - Sounds good to me, let me compile the list of pain points related to fetch failure we are seeing and also a design doc to have better handling of the issues. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #74631 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74631/testReport)** for PR 17297 at commit [`901c9bf`](https://github.com/apache/spark/commit/901c9bf55247f0489519d976ca9729e5babbd292).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Thanks @markhamstra for review comments, addressed. I also found an issue with my previous implementation that we do not allow task commits from old stage attempts, I fixed that issue as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17297: [SPARK-14649][CORE] DagScheduler should not run d...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17297#discussion_r107044272
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -803,6 +810,16 @@ class DAGScheduler(
         stageIdToStage.get(taskSet.stageId).foreach { abortStage(_, reason, exception) }
       }
     
    +  private[scheduler] def handleTasksAborted(
    +      stageId: Int,
    +      tasks: Seq[Task[_]]): Unit = {
    +    for (stage <- stageIdToStage.get(stageId)) {
    +      for (task <- tasks) {
    +        stage.pendingPartitions -= task.partitionId
    +      }
    +    }
    --- End diff --
    
    ```scala
        for {
          stage <- stageIdToStage.get(stageId)
          task <- tasks
        } stage.pendingPartitions -= task.partitionId
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75127/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    To recap the issue that Imran and I discussed here, I think it can be summarized as follows:
    
    - A Fetch Failure happens at some time t and indicates that the map output on machine M has been lost
    - Consider some running task that's read x map outputs and still needs to process y map outputs
    - Scenario A: (PRO of this PR) If the output from M was in the x outputs that are already read, we should keep running the task (as this PR does), because the task already successfully fetched the output from the failed machine. We don't do this currently, meaning we're throwing away the wasted work.
    - Scenario B: (CON of this PR) If the output from M was in the y outputs that have not yet been read, then we should cancel the task, because the task won't learn about the new location for the re-generated output of M (IIUC, there's no functionality to do this now) so is going to fail later on.  The current code will re-run the task, which is what we should do.  This code will try to re-use the old task, which means the job will take longer to run because the task will fail later on and need to be re-started.
    
    If my description above is correct, then this PR is assuming that scenario A is more likely than scenario B, but it seems to me that these two scenarios are equally likely (in which case this PR provides no net benefit).  @sitalkedia what are your thoughts here / did I miss something in my description above?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17297: [SPARK-14649][CORE] DagScheduler should not run d...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17297#discussion_r106315206
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
    @@ -193,13 +193,6 @@ private[spark] class TaskSchedulerImpl private[scheduler](
           val stageTaskSets =
             taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
           stageTaskSets(taskSet.stageAttemptId) = manager
    -      val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
    --- End diff --
    
    Please note that this check is not needed anymore because the DagScheduler already keeps track of running tasks and does not submit duplicate tasks anymore. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #74566 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74566/testReport)** for PR 17297 at commit [`0bcc69a`](https://github.com/apache/spark/commit/0bcc69a7a3094ddaa8c915be1e4a198a354f8b6b).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class TasksAborted(stageId: Int, tasks: Seq[Task[_]]) extends DAGSchedulerEvent`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75287/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75029 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75029/testReport)** for PR 17297 at commit [`99b4069`](https://github.com/apache/spark/commit/99b4069efe929fafd1b5fc0780821fd50510abe4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @squito - I am able to reproduce the issue by running `./build/sbt  "test-only org.apache.spark.InternalAccumulatorSuite`, however test case logs are not being printed on the console, do you know where can I find the test case logs on my laptop? 
    
    Also, one weird thing is that after adding system.testing property to my Intellij, all test cases succeeds without being stuck :/ .
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17297: [SPARK-14649][CORE] DagScheduler should not run d...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17297#discussion_r107018555
  
    --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
    @@ -378,15 +382,17 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf,
         val array = mapStatuses(shuffleId)
         array.synchronized {
           array(mapId) = status
    +      val epochs = epochForMapStatus.get(shuffleId).get
    +      epochs(mapId) = epoch
         }
       }
     
       /** Register multiple map output information for the given shuffle */
       def registerMapOutputs(shuffleId: Int, statuses: Array[MapStatus], changeEpoch: Boolean = false) {
    -    mapStatuses.put(shuffleId, statuses.clone())
         if (changeEpoch) {
           incrementEpoch()
         }
    +    mapStatuses.put(shuffleId, statuses.clone())
    --- End diff --
    
    What was the point of moving this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #74558 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74558/testReport)** for PR 17297 at commit [`e5429d3`](https://github.com/apache/spark/commit/e5429d309801bffb8ddc907fb4800efb6fb1a2fa).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class TasksAborted(stageId: Int, tasks: Seq[Task[_]]) extends DAGSchedulerEvent`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17297: [SPARK-14649][CORE] DagScheduler should not run d...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17297#discussion_r106774683
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
    @@ -193,13 +193,6 @@ private[spark] class TaskSchedulerImpl private[scheduler](
           val stageTaskSets =
             taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
           stageTaskSets(taskSet.stageAttemptId) = manager
    -      val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
    --- End diff --
    
    @squito - That's correct, this is checking that we should not have more than one non-zombie attempts of a stage running. But in the scenario in (d) you described below, we will end up having more than two non-zombie attempts. 
    
    However, my point is there is no reason we should not allow multiple concurrent attempts of a stage to run, the only thing we should guarantee is we are running mutually exclusive tasks in those attempts. With this change, since the dag scheduler already keeps track of submitted/running tasks, it can guarantee that it will not resubmit duplicate tasks for a stage. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17297: [SPARK-14649][CORE] DagScheduler should not run d...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17297#discussion_r107017201
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -1265,64 +1280,11 @@ class DAGScheduler(
             val failedStage = stageIdToStage(task.stageId)
             val mapStage = shuffleIdToMapStage(shuffleId)
     
    -        if (failedStage.latestInfo.attemptId != task.stageAttemptId) {
    -          logInfo(s"Ignoring fetch failure from $task as it's from $failedStage attempt" +
    -            s" ${task.stageAttemptId} and there is a more recent attempt for that stage " +
    -            s"(attempt ID ${failedStage.latestInfo.attemptId}) running")
    -        } else {
    -          // It is likely that we receive multiple FetchFailed for a single stage (because we have
    -          // multiple tasks running concurrently on different executors). In that case, it is
    -          // possible the fetch failure has already been handled by the scheduler.
    -          if (runningStages.contains(failedStage)) {
    -            logInfo(s"Marking $failedStage (${failedStage.name}) as failed " +
    -              s"due to a fetch failure from $mapStage (${mapStage.name})")
    -            markStageAsFinished(failedStage, Some(failureMessage))
    -          } else {
    -            logDebug(s"Received fetch failure from $task, but its from $failedStage which is no " +
    -              s"longer running")
    -          }
    -
    -          val shouldAbortStage =
    -            failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId) ||
    -            disallowStageRetryForTest
    -
    -          if (shouldAbortStage) {
    -            val abortMessage = if (disallowStageRetryForTest) {
    -              "Fetch failure will not retry stage due to testing config"
    -            } else {
    -              s"""$failedStage (${failedStage.name})
    -                 |has failed the maximum allowable number of
    -                 |times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}.
    -                 |Most recent failure reason: $failureMessage""".stripMargin.replaceAll("\n", " ")
    -            }
    -            abortStage(failedStage, abortMessage, None)
    -          } else { // update failedStages and make sure a ResubmitFailedStages event is enqueued
    -            // TODO: Cancel running tasks in the failed stage -- cf. SPARK-17064
    -            val noResubmitEnqueued = !failedStages.contains(failedStage)
    -            failedStages += failedStage
    -            failedStages += mapStage
    -            if (noResubmitEnqueued) {
    -              // We expect one executor failure to trigger many FetchFailures in rapid succession,
    -              // but all of those task failures can typically be handled by a single resubmission of
    -              // the failed stage.  We avoid flooding the scheduler's event queue with resubmit
    -              // messages by checking whether a resubmit is already in the event queue for the
    -              // failed stage.  If there is already a resubmit enqueued for a different failed
    -              // stage, that event would also be sufficient to handle the current failed stage, but
    -              // producing a resubmit for each failed stage makes debugging and logging a little
    -              // simpler while not producing an overwhelming number of scheduler events.
    -              logInfo(
    -                s"Resubmitting $mapStage (${mapStage.name}) and " +
    -                s"$failedStage (${failedStage.name}) due to fetch failure"
    -              )
    -              messageScheduler.schedule(
    -                new Runnable {
    -                  override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages)
    -                },
    -                DAGScheduler.RESUBMIT_TIMEOUT,
    -                TimeUnit.MILLISECONDS
    -              )
    -            }
    -          }
    +        val epochForMapOutput = mapOutputTracker.getEpochForMapOutput(shuffleId, mapId)
    +        // It is possible that the map output was regenerated by rerun of the stage and the
    +        // fetch failure is being reported for stale map output. In that case, we should just
    +        // ignore the fetch failure and relaunch the task with latest map output info.
    +        if (epochForMapOutput.nonEmpty && epochForMapOutput.get <= task.epoch) {
    --- End diff --
    
    I'd be inclined to do this without the extra binding and `get`:
    ```scala
            for(epochForMapOutput <- mapOutputTracker.getEpochForMapOutput(shuffleId, mapId) if
                epochForMapOutput <= task.epoch) {
              // Mark the map whose fetch failed as broken in the map stage
              if (mapId != -1) {
                mapStage.removeOutputLoc(mapId, bmAddress)
                mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress)
              }
    
              // TODO: mark the executor as failed only if there were lots of fetch failures on it
              if (bmAddress != null) {
                handleExecutorLost(bmAddress.executorId, filesLost = true, Some(task.epoch))
              }
            }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17297: [SPARK-14649][CORE] DagScheduler should not run d...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17297#discussion_r106774285
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
    @@ -193,13 +193,6 @@ private[spark] class TaskSchedulerImpl private[scheduler](
           val stageTaskSets =
             taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
           stageTaskSets(taskSet.stageAttemptId) = manager
    -      val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
    --- End diff --
    
    actually, that is not really the point of this check.  Its just checking if one stage has two tasksets (aka stage attempts), where both are in the "non-zombie" state.  It doesn't do any checks at all on what tasks are actually in those tasksets.
    
    This is just checking an invariant which we believe to always be true, but we figure its better to fail-fast if we hit this condition, rather than proceed with some inconsistent state.  This check was added because behavior gets *really* confusing when the invariant is violated, and though we think it should always be true, we've still hit cases where it happens.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75124 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75124/testReport)** for PR 17297 at commit [`c0bdca6`](https://github.com/apache/spark/commit/c0bdca65691d526676a74e47c8629f1fd64add87).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75332 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75332/testReport)** for PR 17297 at commit [`bdaff12`](https://github.com/apache/spark/commit/bdaff123dd21feff72218d8163fa1a69e45f1a1e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74558/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @kayousterhout - It seems like the test timeout might be related to the change. But I am not able to find the culprit test case from the build log. Any idea what is wrong?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17297: [SPARK-14649][CORE] DagScheduler should not run d...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17297#discussion_r107044660
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -929,12 +946,22 @@ class DAGScheduler(
         }
       }
     
    -  /** Called when stage's parents are available and we can now do its task. */
    +  /**
    +   * Called when stage's parents are available and we can now run its task.
    +   * This only submits the partitions which are missing and have not been
    +   * submitted to the lower-level scheduler for execution.
    +   */
       private def submitMissingTasks(stage: Stage, jobId: Int) {
         logDebug("submitMissingTasks(" + stage + ")")
     
    -    // First figure out the indexes of partition ids to compute.
    -    val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
    +    val missingPartitions = stage.findMissingPartitions()
    +    val partitionsToCompute =
    +      missingPartitions.filter(id => !stage.pendingPartitions.contains(id))
    --- End diff --
    
    ```scala
    missingPartitions.filterNot(stage.pendingPartitions)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75124 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75124/testReport)** for PR 17297 at commit [`c0bdca6`](https://github.com/apache/spark/commit/c0bdca65691d526676a74e47c8629f1fd64add87).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75127 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75127/testReport)** for PR 17297 at commit [`1aab715`](https://github.com/apache/spark/commit/1aab715c03b3a64c4548e0434e0ffcc7b439d47b).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Sounds good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75287 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75287/testReport)** for PR 17297 at commit [`1e6e88a`](https://github.com/apache/spark/commit/1e6e88a37001bd2f026eff1bd8db6adb5e9bf796).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Jenkins retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17297: [SPARK-14649][CORE] DagScheduler should not run d...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17297#discussion_r107040190
  
    --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
    @@ -418,6 +424,15 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf,
         cachedSerializedStatuses.contains(shuffleId) || mapStatuses.contains(shuffleId)
       }
     
    +  /** Get the epoch for map output for a shuffle, if it is available */
    +  def getEpochForMapOutput(shuffleId: Int, mapId: Int): Option[Long] = {
    +    val arrayOpt = mapStatuses.get(shuffleId)
    +    if (arrayOpt.isDefined && arrayOpt.get != null && arrayOpt.get(mapId) != null) {
    +       return Some(epochForMapStatus.get(shuffleId).get(mapId))
    +    }
    +    None
    +  }
    --- End diff --
    
    First, `arrayOpt.get != null` isn't necessary since we don't put `null` values into `mapStatuses`. Second, `epochForMapStatus.get(shuffleId).get` is the same as `epochForMapStatus(shuffleId)`. Third, I don't like all the explicit `get`s,`null` checks and the unnecessary non-local `return`. To my mind, this is better:
    ``` scala
      def getEpochForMapOutput(shuffleId: Int, mapId: Int): Option[Long] = {
        for {
          mapStatus <- mapStatuses.get(shuffleId).flatMap { mapStatusArray =>
            Option(mapStatusArray(mapId))
          }
        } yield epochForMapStatus(shuffleId)(mapId)
      }
    ``` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    btw I filed https://issues.apache.org/jira/browse/SPARK-20128 for the test timeout -- fwiw I don't think its a problem w/ the test but a potential real issue with the metrics system, though I don't really understand how it can happen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    I'm a bit confused by the description:
    
    > 1. When a fetch failure happens, the task set manager ask the dag scheduler to abort all the non-running tasks. However, the running tasks in the task set are not killed.
    
    this is already true.  when there is a fetch failure, the [TaskSetManager is marked as zombie](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala?squery=TaskSetManager#L755), and the DAGScheduler resubmits stages, but nothing actively kills running tasks.
    
    >  re-launches all tasks in the stage with the fetch failure that hadn't completed when the fetch failure occurred (the DAGScheduler re-lanches all of the tasks whose output data is not available -- which is equivalent to the set of tasks that hadn't yet completed).
    
    I don't think its true that it relaunches all tasks that hadn't completed _when the fetch failure occurred_.  it relaunches all the tasks haven't completed, by the time the stage gets resubmitted.  More tasks can complete in between the time of the first failure, and the time the stage is resubmitted.
    
    But there are several other potential issues you may be trying to address.
    
    Say there is stage 0 and stage 1, each one has 10 tasks.  Stage 0 completes fine on the first attempt, then stage 1 starts.  Tasks 0 & 1 in stage 1 complete, but then there is a fetch failure in task 2.  Lets also say we have an abundance of cluster resources so tasks 3 - 9 from stage 1, attempt 0 are still running.
    
    Stage 0 get resubmitted as attempt 1, just to regenerate the map output for whatever executor had the data for the fetch failure -- perhaps its just one task from stage 0 that needs to resubmitted.  Now, lots of different scenarios are possible:
    
    (a) Tasks 3 - 9 from stage 1 attempt 0 all finish successfully while stage 0 attempt 1 is running.  So when stage 0 attempt 1 finishes, then stage 1 attempt 1 is submitted, just with Task 2.  If it completely succesfully, we're done (no wasted work).
    
    (b) stage 0 attempt 1 finishes, before tasks 3 - 9 from stage 1 attempt 0 have finished.  So stage 1 gets submitted again as stage 1 attempt 1, with tasks 2 - 9.  So there are now two copies running for tasks 3 - 9. Maybe all the tasks from attempt 0 actually finish shortly after attempt 1 starts.  In this case, the stage is complete as soon as there is one complete attempt for each task.  But even after the stage completes successfully, all the other tasks keep running anyway.  (plenty of wasted work)
    
    (c) like (b), but shortly after stage 1 attempt 1 is submitted, we get another fetch failure in one of the old "zombie" tasks from stage 1 attempt 0.  But the [DAGScheduler realizes it already has a more recent attempt for this stage](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala?squery=DAgScheduler#L1268), so it ignores the fetch failure.  All the other tasks keep running as usual.  If there aren't any other issues, the stage completes when there is one completed attempt for each task.  (same amount of wasted work as (b)).
    
    (d) While stage 0 attempt 1 is running, we get another fetch failure from stage 1 attempt 0, say in Task 3, which has a failure from a *different executor*.  Maybe its from a completely different host (just by chance, or there may be cluster maintenance where multiple hosts are serviced at once); or maybe its from another executor on the same host (at least, until we do something about your other pr on unregistering all shuffle files on a host).  To be honest, I don't understand how things work in this scenario.  We [mark stage 0 as failed](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala?squery=DAgScheduler#L1303), we [unregister some shuffle output](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala?squery=DAgScheduler#L1328), and [we resubmit stage 0](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/s
 park/scheduler/DAGScheduler.scala?squery=DAgScheduler#L1319).  But stage 0 attempt 1 is still running, so I would have expected us to end up with conflicting task sets.  Whatever the real behavior is here, it seems we're at risk of having even more duplicated work for yet another attempt for stage 1.
    
    etc.
    
    So I think in (b) and (c), you are trying to avoid resubmitting tasks 3-9 on stage 1 attempt 1.  the thing is, there is a strong reason to believe that the original version of those tasks will fail.  Most likely, those tasks needs map output from the same executor that caused the first fetch failure.  So Kay is suggesting that we take the opposite approach, and instead actively kill the tasks from stage 1 attempt 0.  OTOH, its possible that  (i) the issue may have been transient or (ii) the tasks already finished fetching that data before the error occurred.  We really have no idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    okay, closing the PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #74631 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74631/testReport)** for PR 17297 at commit [`901c9bf`](https://github.com/apache/spark/commit/901c9bf55247f0489519d976ca9729e5babbd292).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75030/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @sitalkedia I took a closer look -- I think this is from "o.a.s.InternalAccumulatorSuite: 'internal accumulators in resubmitted stages'".  From the console output on jenkins, that was the last test run, and also I can log into the box and see the log4j output from the tests in 'core/target/unit-test.log', and it shows that is the last test and it ends with:
    
    ```
    17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting down SparkContext
    java.lang.ArrayIndexOutOfBoundsException: -1
            at org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431)
            at org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430)
            at scala.Option.flatMap(Option.scala:171)
            at org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430)
            at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298)
            at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731)
            at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689)
            at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678)
            at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared
    17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager stopped
    17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: BlockManagerMaster stopped
    17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully stopped SparkContext
    17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. Exception was suppressed.
    java.lang.NullPointerException
            at org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35)
            at org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34)
            at com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239)
    ...
    ```
    
    with those NPE's repeatedly thrown every 15 minutes.
    
    Normally that 'unit-tests.log' file is available as an archived artifact from jenkins, but I guess it doesn't show up if the test times out, so I don't think there is anyway for you to get at this info yourself.  Hopefully you can reproduce and debug locally?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @sitalkedia can you file a JIRA in the future when you see flaky test failures?  In this case I updated an existing JIRA (https://issues.apache.org/jira/browse/SPARK-19612) but please do this next time -- otherwise these issues never get fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17297: [SPARK-14649][CORE] DagScheduler should not run d...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia closed the pull request at:

    https://github.com/apache/spark/pull/17297


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75126/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #74558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74558/testReport)** for PR 17297 at commit [`e5429d3`](https://github.com/apache/spark/commit/e5429d309801bffb8ddc907fb4800efb6fb1a2fa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75332/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by sitalkedia <gi...@git.apache.org>.
Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @kayousterhout, @squito - Since we need more discussion on this change over a design doc, I have put out a temporary change (https://github.com/apache/spark/pull/17485) to kill the running tasks in case of fetch failure. Although this is not ideal but that would be better than current situation.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    **[Test build #75339 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75339/testReport)** for PR 17297 at commit [`ace8464`](https://github.com/apache/spark/commit/ace8464a1ec34864e56fbfceaac509895dcf31d4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org