You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "smallx (Jira)" <ji...@apache.org> on 2019/12/15 23:20:00 UTC

[jira] [Comment Edited] (HIVE-17063) insert overwrite partition onto a external table fail when drop partition first

    [ https://issues.apache.org/jira/browse/HIVE-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996891#comment-16996891 ] 

smallx edited comment on HIVE-17063 at 12/15/19 11:19 PM:
----------------------------------------------------------

[~wanghaihua] [~djaiswal]

When the replace flag is true, we should delete all files in the target path except the source directory and hidden files, not only the file with rename conflict, otherwise it may cause data duplication or unexpected.

We need to consider this case: the number of files becomes smaller when hive inserts data again.

Or this case: after spark-sql inserts data, drop partition, and then hive inserts data. Because the file names are different, the data inserted by spark-sql will not be replaced, and the data will double at this time.


was (Author: smallx):
[~wanghaihua] [~djaiswal]

When the replace flag is true, we should delete all files in the target path except the source directory and hidden files, not only the file with rename conflict, otherwise it may cause data duplication or unexpected.
We need to consider this case: the number of files becomes smaller when hive inserts data again.
Or this case: after spark-sql inserts data, drop partition, and then hive inserts data. Because the file names are different, the data inserted by spark-sql will not be replaced, and the data will double at this time.

> insert overwrite partition onto a external table fail when drop partition first
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-17063
>                 URL: https://issues.apache.org/jira/browse/HIVE-17063
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 1.2.2, 2.1.1, 2.2.0
>            Reporter: Wang Haihua
>            Assignee: Deepak Jaiswal
>            Priority: Major
>         Attachments: HIVE-17063.1.patch, HIVE-17063.2.patch, HIVE-17063.3.patch, HIVE-17063.4.patch
>
>
> The default value of {{hive.exec.stagingdir}} which is a relative path, and also drop partition on a external table will not clear the real data. As a result, insert overwrite partition twice will happen to fail because of the target data to be moved has 
>  already existed.
> This happened when we reproduce partition data onto a external table. 
> I see the target data will not be cleared only when {{immediately generated data}} is child of {{the target data directory}}, so my proposal is trying  to clear target file already existed finally whe doing rename  {{immediately generated data}} into {{the target data directory}}
> Operation reproduced:
> {code}
> create external table insert_after_drop_partition(key string, val string) partitioned by (insertdate string);
> from src insert overwrite table insert_after_drop_partition partition (insertdate='2008-01-01') select *;
> alter table insert_after_drop_partition drop partition (insertdate='2008-01-01');
> from src insert overwrite table insert_after_drop_partition partition (insertdate='2008-01-01') select *;
> {code}
> Stack trace:
> {code}
> 2017-07-09T08:32:05,212 ERROR [f3bc51c8-2441-4689-b1c1-d60aef86c3aa main] exec.Task: Failed with exception java.io.IOException: rename for src path: pfile:/data/haihua/official/hive/itests/qtest/target/warehouse/insert_after_drop_partition/insertdate=2008-01-01/.hive-staging_hive_2017-07-09_08-32-03_840_4046825276907030554-1/-ext-10000/000000_0 to dest path:pfile:/data/haihua/official/hive/itests/qtest/target/warehouse/insert_after_drop_partition/insertdate=2008-01-01/000000_0 returned false
> org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: rename for src path: pfile:/data/haihua/official/hive/itests/qtest/target/warehouse/insert_after_drop_partition/insertdate=2008-01-01/.hive-staging_hive_2017-07-09_08-32-03_840_4046825276907030554-1/-ext-10000/000000_0 to dest path:pfile:/data/haihua/official/hive/itests/qtest/target/warehouse/insert_after_drop_partition/insertdate=2008-01-01/000000_0 returned false
>         at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2992)
>         at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3248)
>         at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1532)
>         at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1461)
>         at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:498)
>         at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
>         at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>         at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2073)
>         at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1744)
>         at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1453)
>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1171)
>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1161)
>         at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:232)
>         at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
>         at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
>         at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:335)
>         at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1137)
>         at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1111)
>         at org.apache.hadoop.hive.cli.TestCliDriver.runTest(TestCliDriver.java:120)
>         at org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_insert_after_drop_partition(TestCliDriver.java:103)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:497)
>         at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>         at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>         at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>         at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>         at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>         at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>         at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
>         at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
>         at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>         at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
>         at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
>         at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
>         at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
>         at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
>         at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>         at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
>         at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
>         at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
>         at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
>         at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
>         at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
>         at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
> Caused by: java.io.IOException: rename for src path: pfile:/data/haihua/official/hive/itests/qtest/target/warehouse/insert_after_drop_partition/insertdate=2008-01-01/.hive-staging_hive_2017-07-09_08-32-03_840_4046825276907030554-1/-ext-10000/000000_0 to dest path:pfile:/data/haihua/official/hive/itests/qtest/target/warehouse/insert_after_drop_partition/insertdate=2008-01-01/000000_0 returned false
>         at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:2972)
>         at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:2962)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)