You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jiang Xingbo (JIRA)" <ji...@apache.org> on 2018/06/13 21:48:00 UTC

[jira] [Comment Edited] (SPARK-24552) Task attempt numbers are reused when stages are retried

    [ https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511695#comment-16511695 ] 

Jiang Xingbo edited comment on SPARK-24552 at 6/13/18 9:47 PM:
---------------------------------------------------------------

IIUC stageAttemptId + taskAttemptNumber shall probably define a unique task attempt, and it carries enough information to know how many failed attempts you had previously.


was (Author: jiangxb1987):
IIUC stageAttemptId + taskAttemptId shall probably define a unique task attempt, and it carries enough information to know how many failed attempts you had previously.

> Task attempt numbers are reused when stages are retried
> -------------------------------------------------------
>
>                 Key: SPARK-24552
>                 URL: https://issues.apache.org/jira/browse/SPARK-24552
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.1
>            Reporter: Ryan Blue
>            Priority: Major
>
> When stages are retried due to shuffle failures, task attempt numbers are reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so that writers can use the attempt number to track and clean up data from failed or speculative attempts. In the v2 docs for DataWriterFactory, the attempt number's javadoc states that "Implementations can use this attempt number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two tasks can create the same data and attempt to commit. The commit coordinator prevents both from committing and will abort the attempt that finishes last. When using the (partition, attempt) pair to track data, the aborted task may delete data associated with the (partition, attempt) pair. If that happens, the data for the task that committed is also deleted as well, which is a correctness bug.
> For a concrete example, I have a data source that creates files in place named with {{part-<partition>-<attempt>-<uuid>.<format>}}. Because these files are written in place, both tasks create the same file and the one that is aborted deletes the file, leading to data corruption when the file is added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org