You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Imran Rashid (JIRA)" <ji...@apache.org> on 2015/05/23 03:40:18 UTC

[jira] [Comment Edited] (SPARK-7308) Should there be multiple concurrent attempts for one stage?

    [ https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557090#comment-14557090 ] 

Imran Rashid edited comment on SPARK-7308 at 5/23/15 1:40 AM:
--------------------------------------------------------------

Couple of clarifications based on some offline discussions I've had:

a) There are actually two different "types" of concurrent stage attempts.
i. when you get a fetch failed, that attempt is marked as a "zombie", and eventually a new attempt for that stage is started which is a non-zombie.  However, there may be tasks in the zombie attempt which run for a while, and so you might have tasks running from the zombie attempt and the non-zombie attempt
ii.  with multiple fetch failures from one attempt, you can end up with multiple *non*-zombie attempts for one stage.  This is the problem that I originally opened this JIRA for, and its issue (1) in the doc.  I can reproduce that with a simpler example, without reproducing the other problems.  However, in those simpler examples, somehow jobs still succeed (with the right answer, too).  I could only get true failures from spark when I tried more complicated workloads, which also triggered all of the other problems.

Issues (2), (3), & (4) are all from the multiple attempts that come from (i) -- that is, a zombie and non-zombie together.

b) Marcelo pointed out that there is another possible solution to this problem -- different stage attempts should write to different shuffle map outputs.  That would eliminate all the problems of multiple attempts corrupting each other's output.  But it does complicate the shuffle logic, because now for each map output, you need to know which attempt to look for.

I'll update the doc later to try to make this more clear and include the alternative suggestion, but I wanted to mention it briefly in case anyone is confused by the doc now. 


was (Author: irashid):
Couple of clarifications based on some offline discussions I've had:

a) There are actually two different "types" of concurrent stage attempts.
i. when you get a fetch failed, that attempt is marked as a "zombie", and eventually a new attempt for that stage is started which is a non-zombie.  However, there may be tasks in the zombie attempt which run for a while, and so you might have tasks running from the zombie attempt and the non-zombie attempt
ii.  with multiple fetch failures from one attempt, you can end up with multiple *non*-zombie attempts for one stage.  This is the problem that I originally opened this JIRA for, and its issue (1) in the doc.  I can reproduce that with a simpler example, without reproducing the other problems.  However, in those simpler examples, somehow jobs still succeed (with the right answer, too).  I could only get true failures from spark when I tried more complicated workloads, which also triggered all of the other problems.

Issues (2), (3), & (4) are all from the multiple attempts that come from (i) -- that is, just a zombie and non-zombie together.

b) Marcelo pointed out that there is another possible solution to this problem -- different stage attempts should write to different shuffle map outputs.  That would eliminate all the problems of multiple attempts corrupting each other's output.  But it does complicate the shuffle logic, because now for each map output, you need to know which attempt to look for.

I'll update the doc later to try to make this more clear and include the alternative suggestion, but I wanted to mention it briefly in case anyone is confused by the doc now. 

> Should there be multiple concurrent attempts for one stage?
> -----------------------------------------------------------
>
>                 Key: SPARK-7308
>                 URL: https://issues.apache.org/jira/browse/SPARK-7308
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.3.1
>            Reporter: Imran Rashid
>            Assignee: Imran Rashid
>         Attachments: SPARK-7308_discussion.pdf
>
>
> Currently, when there is a fetch failure, you can end up with multiple concurrent attempts for the same stage.  Is this intended?  At best, it leads to some very confusing behavior, and it makes it hard for the user to make sense of what is going on.  At worst, I think this is cause of some very strange errors we've seen errors we've seen from users, where stages start executing before all the dependent stages have completed.
> This can happen in the following scenario:  there is a fetch failure in attempt 0, so the stage is retried.  attempt 1 starts.  But, tasks from attempt 0 are still running -- some of them can also hit fetch failures after attempt 1 starts.  That will cause additional stage attempts to get fired up.
> There is an attempt to handle this already https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105
> but that only checks whether the **stage** is running.  It really should check whether that **attempt** is still running, but there isn't enough info to do that.  
> I'll also post some info on how to reproduce this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org