You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "gaoyajun02 (Jira)" <ji...@apache.org> on 2021/07/13 12:57:00 UTC

[jira] [Created] (SPARK-36121) Write data loss caused by stage retry when enable v2 FileOutputCommitter

gaoyajun02 created SPARK-36121:
----------------------------------

             Summary: Write data loss caused by stage retry when enable v2 FileOutputCommitter
                 Key: SPARK-36121
                 URL: https://issues.apache.org/jira/browse/SPARK-36121
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 3.0.1, 2.2.1
            Reporter: gaoyajun02


All our ETL scenarios are configured:
mapreduce.fileoutputcommitter.algorithm.version=2, when shuffle fetchFailed occurs, the stage retry is triggered, and then the zombie stage and the retry stage may write tasks of the same part at the same time, and their task directory and file name are exactly the same. This may cause data part loss due to conflicts between file writing and rename operations. For example, recently encountered a case of data loss:

Stage 5.0 is a zombie stage caused by shuffle FetchFailed, and stage 5.1 is a retry stage. They have two tasks concurrently writing the same part file: part-00298.
 # The task of stage 5.1 has preemptively created part file: part-00298 and written data.
 # At the same time as the task commit of stage 5.1, the task of sage 5.0 is going to create this part file to write data, because the file already exists, it throw an exception and delete the task's temporary directory.
 # Then stage 5.0 starts commitTask, it will traverse the sub-directories and execute rename. At this time, because the file has been deleted, it finally moves without any exception, which causes data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org