You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Imran Rashid (JIRA)" <ji...@apache.org> on 2015/06/04 18:15:38 UTC

[jira] [Commented] (SPARK-7308) Should there be multiple concurrent attempts for one stage?

    [ https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573066#comment-14573066 ] 

Imran Rashid commented on SPARK-7308:
-------------------------------------

I'm turning this into an umbrella jira (but leaving what is here already, for archiving).  I've broken it into

* SPARK-8103 just the issues w/ the DAGScheduler
* SPARK-8029 making ShuffleMapTask safe with multiple concurrent attempts on one executor

> Should there be multiple concurrent attempts for one stage?
> -----------------------------------------------------------
>
>                 Key: SPARK-7308
>                 URL: https://issues.apache.org/jira/browse/SPARK-7308
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.3.1
>            Reporter: Imran Rashid
>            Assignee: Imran Rashid
>         Attachments: SPARK-7308_discussion.pdf
>
>
> Currently, when there is a fetch failure, you can end up with multiple concurrent attempts for the same stage.  Is this intended?  At best, it leads to some very confusing behavior, and it makes it hard for the user to make sense of what is going on.  At worst, I think this is cause of some very strange errors we've seen errors we've seen from users, where stages start executing before all the dependent stages have completed.
> This can happen in the following scenario:  there is a fetch failure in attempt 0, so the stage is retried.  attempt 1 starts.  But, tasks from attempt 0 are still running -- some of them can also hit fetch failures after attempt 1 starts.  That will cause additional stage attempts to get fired up.
> There is an attempt to handle this already https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105
> but that only checks whether the **stage** is running.  It really should check whether that **attempt** is still running, but there isn't enough info to do that.  
> I'll also post some info on how to reproduce this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org