You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/10/03 03:54:57 UTC

[GitHub] [spark] felixcheung commented on issue #25979: [SPARK-29295][SQL] Insert overwrite to Hive external table partition should delete old data

felixcheung commented on issue #25979: [SPARK-29295][SQL] Insert overwrite to Hive external table partition should delete old data
URL: https://github.com/apache/spark/pull/25979#issuecomment-537773541
 
 
   yeah, this is a hard one, obviously the behavior is buggy, hard to detect etc. but that's how Hive is designed. I think we should log a warn in Spark at least so interested folks (like us) can detect this after the job is run
   
   > On Hive 2.1.0, two "INSERT OVERWRITE" produces data file with same name like 000000_0. The second "INSERT OVERWRITE" moves the file into and overwrite old file.
   
   > On Hive 2.3.2, the second "INSERT OVERWRITE" causes following failure when moving file with same name
   
   we can't really rely on the name being the same to overwrite. it depends on a number of things. for instance, if the original partition has 10B row and 1M file, overwritten with new partition having 1B and 100k file, then a lot of files are not going to be overwritten (like 900k)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org