You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "feiwang (Jira)" <ji...@apache.org> on 2019/10/18 02:17:00 UTC
[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

     [ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

feiwang updated SPARK-29037:
----------------------------
    Description: 
For InsertIntoHadoopFsRelation operations.

Case A:
Application appA insert overwrite table table_a with static partition overwrite.
But it was killed when committing tasks, because one task is hang.
And parts of its committed tasks output is kept under /path/table_a/_temporary/0/.

Then we rerun appA. It will reuse the staging dir /path/table_a/_temporary/0/.
It executes successfully.
But it also commit the data reminded by killed application to destination dir.

Case B:

Application appA insert overwrite table table_a.

Application appB insert overwrite table table_a, too.

They execute concurrently, and they may all use /path/table_a/_temporary/0/ as workPath.

And their result may be corruptted.

  was:
When we insert overwrite a partition of table.
For a stage, whose tasks commit output, a task saves output to a staging dir firstly,  when this task complete, it will save output to committedTaskPath, when all tasks of this stage success, all task output under committedTaskPath will be moved to destination dir.

However, when we kill an application, which is committing tasks' output, parts of tasks' results will be kept in committedTaskPath, which would not be cleared gracefully.

Then we rerun this application and the new application will reuse this committedTaskPath dir.

And when the task commit stage of new application success, all task output under this committedTaskPath, which contains parts of old application's task output , would be moved to destination dir and the result is duplicated.




> [Core] Spark gives duplicate result when an application was killed and rerun
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-29037
>                 URL: https://issues.apache.org/jira/browse/SPARK-29037
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0, 2.3.3
>            Reporter: feiwang
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> For InsertIntoHadoopFsRelation operations.
> Case A:
> Application appA insert overwrite table table_a with static partition overwrite.
> But it was killed when committing tasks, because one task is hang.
> And parts of its committed tasks output is kept under /path/table_a/_temporary/0/.
> Then we rerun appA. It will reuse the staging dir /path/table_a/_temporary/0/.
> It executes successfully.
> But it also commit the data reminded by killed application to destination dir.
> Case B:
> Application appA insert overwrite table table_a.
> Application appB insert overwrite table table_a, too.
> They execute concurrently, and they may all use /path/table_a/_temporary/0/ as workPath.
> And their result may be corruptted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org