You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/10/21 03:21:49 UTC

[GitHub] [spark] turboFei commented on issue #26159: [SPARK-29506][SQL] Use dynamicPartitionOverwrite in FileCommitProtocol when insert into hive table

turboFei commented on issue #26159: [SPARK-29506][SQL] Use dynamicPartitionOverwrite in FileCommitProtocol when insert into hive table
URL: https://github.com/apache/spark/pull/26159#issuecomment-544333629
 
 
   > For hive table insertion, we insert to a fresh staging dir first. So dynamicPartitionOverwrite and normal write are logically the same IIUC. Do you mean dynamicPartitionOverwrite is better for performance?
   
   Hi, @cloud-fan I think dynamicPartitionOverwrite would keep a filesToMove set and  the times to rename file is partitions-num.
   https://github.com/apache/spark/blob/f4d5aa42139ff8412c573c96a1631ef3ccf81844/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L181-L183
   
   And normal write(let file output committer algorithm version to 1), per task would commit output to a temp path.
   For example:
   _temporary/0/task_attempt_1/p1=v1/p2=v2/task1.parquet
   _temporary/0/task_attempt_2/p1=v1/p2=v2/task2.parquet
   _temporary/0/task_attempt_3/p1=v1/p2=v2/task3.parquet
   
   After all tasks completed,  it would invoke mergePaths to merge these output.
   The cost is larger for a partitioned table than dynamicPartitionOverwrite.
   
   But there is a known issue for dynamicPartitionOverwrite,  a task may conflict with its speculation task.
   I have created a PR https://github.com/apache/spark/pull/26086, can you help take a look?
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org