You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jason Moore (JIRA)" <ji...@apache.org> on 2016/04/27 12:13:13 UTC

[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete

    [ https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259920#comment-15259920 ] 

Jason Moore commented on SPARK-14915:
-------------------------------------

Could I get thoughts on this: at [TaskSetManager.scala#L723|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L723] a call is made to addPendingTask after a task has failed.  I can think of a scenario that it might be a good idea not to add the task back into the pending queue: when success(index) == true (which implies that another copy of the task has already succeeded).

I'm soon going to test it out with the condition, as I think it's quite possibly what is causing tasks to continually re-queue after a CDE until the stage has completed (further lengthening the duration of the stage, as that take up execution resources).

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14915
>                 URL: https://issues.apache.org/jira/browse/SPARK-14915
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.2
>            Reporter: Jason Moore
>            Priority: Critical
>
> In SPARK-14357, code was corrected towards the originally intended behavior that a CommitDeniedException should not count towards the failure count for a job.  After having run with this fix for a few weeks, it's become apparent that this behavior has some unintended consequences - that a speculative task will continuously receive a CDE from the driver, now causing it to fail and retry over and over without limit.
> I'm thinking we could put a task that receives a CDE from the driver, into a TaskState.FINISHED or some other state to indicated that the task shouldn't be resubmitted by the TaskScheduler. I'd probably need some opinions on whether there are other consequences for doing something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org