You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/10/02 16:15:39 UTC
[GitHub] [spark] viirya commented on issue #25979: [SPARK-29295][SQL] Insert overwrite to Hive external table partition should delete old data

viirya commented on issue #25979: [SPARK-29295][SQL] Insert overwrite to Hive external table partition should delete old data
URL: https://github.com/apache/spark/pull/25979#issuecomment-537568308
 
 
   As I quickly tried on Hive, Hive does not delete existing directory with "INSERT OVERWRITE", if the external partition was dropped by "DROP PARTITION" before. Hive just moves data from staging into the directory in this case.
   
   Whether it produces duplicated data, depending on the filename generated. I tested two version of Hive locally. I do not know the logic Hive produces data filename. 
   
   On Hive 2.1.0, two "INSERT OVERWRITE" produces data file with same name like 000000_0. The second "INSERT OVERWRITE" moves the file into and overwrite old file.
   
   On Hive 2.3.2, the second "INSERT OVERWRITE" causes following failure when moving file with same name:
   
   ```
   0: jdbc:hive2://localhost:10000> CREATE EXTERNAL TABLE test(id int) PARTITIONED BY (name string) STORED AS TEXTFILE LOCATION '/tmp/test';
   No rows affected (0.074 seconds)
   0: jdbc:hive2://localhost:10000> INSERT OVERWRITE TABLE test PARTITION(name='n1') SELECT 1;
   WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
   No rows affected (1.809 seconds)
   0: jdbc:hive2://localhost:10000> ALTER TABLE test DROP PARTITION(name='n1');
   No rows affected (0.175 seconds)
   0: jdbc:hive2://localhost:10000> INSERT OVERWRITE TABLE test PARTITION(name='n1') SELECT 2;
   WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
   Error: org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. java.io.IOException: rename for src path: hdfs://namenode:8020/tmp/test/name=n1/.hive-staging_hive_2019-10-02_16-01-28_570_7296144361967433532-1/-ext-10000/000000_0 to dest path:hdfs://namenode:8020/tmp/test/name=n1/000000_0 returned false
   	at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380)
   	at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:257)
   	at org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91)
   	at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:348)
   	at java.security.AccessController.doPrivileged(Native Method)
   	at javax.security.auth.Subject.doAs(Subject.java:422)
   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
   	at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:362)
   	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   	at java.lang.Thread.run(Thread.java:748)
   ```
   
   So, I think it is sure that Hive does not delete existing directory when "INSERT OVERWRITE" into a non-existing partition. Whether duplicated data are seen, it depends on if filenames are unique or not.
   
   Above tests are using MR as Hive execution engine. I am not sure if Hive has different output filename in different execution engine like Spark or Tez.
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org