You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:11:09 UTC

[jira] [Resolved] (SPARK-19790) OutputCommitCoordinator should not allow another task to commit after an ExecutorFailure

     [ https://issues.apache.org/jira/browse/SPARK-19790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-19790.
----------------------------------
    Resolution: Incomplete

> OutputCommitCoordinator should not allow another task to commit after an ExecutorFailure
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-19790
>                 URL: https://issues.apache.org/jira/browse/SPARK-19790
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.1.0
>            Reporter: Imran Rashid
>            Priority: Major
>              Labels: bulk-closed
>
> The OutputCommitCoordinator resets the allowed committer when the task fails.  https://github.com/apache/spark/blob/8aa560b75e6b083b2a890c52301414285ba35c3d/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L143
> However, if a task fails because of an ExecutorFailure, we actually have no idea what the status is of the task.  The task may actually still be running, and perhaps successfully commit its output.  By allowing another task to commit its output, there is a chance that multiple tasks commit, which can result in corrupt output.  This would be particularly problematic when commit() is an expensive operation, eg. moving files on S3.
> For other task failures, we can allow other tasks to commit.  But with an ExecutorFailure, its not clear what the right thing to do is.  The only safe thing to do may be to fail the job.
> This is related to SPARK-19631, and was discovered during discussion on that PR https://github.com/apache/spark/pull/16959#discussion_r103549134



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org