You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/07/31 06:58:16 UTC

[GitHub] [spark] advancedxy commented on issue #24142: [SPARK-27194][core] Job failures when task attempts do not clean up spark-staging parquet files

advancedxy commented on issue #24142: [SPARK-27194][core] Job failures when task attempts do not clean up spark-staging parquet files
URL: https://github.com/apache/spark/pull/24142#issuecomment-516723747

Hi, @vanzin, @ajithme and @cloud-fan, I am also interested in this problem.
I'd like to proposal a fix here(use partitioned table for example, the non partition case is similar), I'd like to know your thoughts:
1. `HadoopMapReduceCommitProtocol`.`newTaskTempFile` now returns /$stageDir/$partitionSpecs-$taskAttemptNum/$fileName
2. `commitTask` should move files under /$stageDir/$partitionSpecs-$taskAttemptNum to /$stageDir/$partitionSpec. Due to host down or executor preemption, there may be files already under /$stageDir/$partition moved by other task attempt but not committed. Since we are assuming the output of one task should be idempotent, we can simply skip the movement of already existed files. Once a task is committed, all the files output by this task will be moved to the corresponding partition dirs.
3. commitJob will be same, just move the /$stageDir/$partitionSpec to final dir.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org