You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Imran Rashid (JIRA)" <ji...@apache.org> on 2017/03/01 17:42:45 UTC

[jira] [Created] (SPARK-19790) OutputCommitCoordinator should not allow another task to commit after an ExecutorFailure

Imran Rashid created SPARK-19790:
------------------------------------

             Summary: OutputCommitCoordinator should not allow another task to commit after an ExecutorFailure
                 Key: SPARK-19790
                 URL: https://issues.apache.org/jira/browse/SPARK-19790
             Project: Spark
          Issue Type: Bug
          Components: Scheduler
    Affects Versions: 2.1.0
            Reporter: Imran Rashid


The OutputCommitCoordinator resets the allowed committer when the task fails.  https://github.com/apache/spark/blob/8aa560b75e6b083b2a890c52301414285ba35c3d/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L143
However, if a task fails because of an ExecutorFailure, we actually have no idea what the status is of the task.  The task may actually still be running, and perhaps successfully commit its output.  By allowing another task to commit its output, there is a chance that multiple tasks commit, which can result in corrupt output.  This would be particularly problematic when commit() is an expensive operation, eg. moving files on S3.

For other task failures, we can allow other tasks to commit.  But with an ExecutorFailure, its not clear what the right thing to do is.  The only safe thing to do may be to fail the job.

This is related to SPARK-19631, and was discovered during discussion on that PR https://github.com/apache/spark/pull/16959#discussion_r103549134



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org