You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Marta Kuczora (Jira)" <ji...@apache.org> on 2021/08/23 09:47:00 UTC

[jira] [Assigned] (HIVE-22969) Union remove optimisation results incorrect data when inserting to ACID table

     [ https://issues.apache.org/jira/browse/HIVE-22969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marta Kuczora reassigned HIVE-22969:
------------------------------------

    Assignee:     (was: Marta Kuczora)

> Union remove optimisation results incorrect data when inserting to ACID table
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-22969
>                 URL: https://issues.apache.org/jira/browse/HIVE-22969
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>            Reporter: Marta Kuczora
>            Priority: Major
>
> Steps to reproduce the issue:
> {noformat}
> create table input_text(key string, val string) stored as textfile location '/Users/martakuczora/work/hive/warehouse/external/input_text';
> create table output_acid(key string, val string) stored as orc tblproperties('transactional'='true');
> insert into input_text values ('1','1'), ('2','2'),('3','3');
> {noformat}
> {noformat}
> set hive.mapred.mode=nonstrict;
> set hive.stats.autogather=false;
> set hive.optimize.union.remove=true;
> set hive.auto.convert.join=true;
> set hive.exec.submitviachild=false;
> set hive.exec.submit.local.task.via.child=false;
> SELECT * FROM (
> select key, val from input_text
> union all
> select a.key as key, b.val as val FROM input_text a join input_text b on a.key=b.key) c;
> The result of the select:
> 1	1
> 2	2
> 3	3
> 1	1
> 2	2
> 3	3
> {noformat}
> {noformat}
> insert into table output_acid
> SELECT * FROM (
> select key, val from input_text
> union all
> select a.key as key, b.val as val FROM input_text a join input_text b on a.key=b.key) c;
> select * from output_acid;
> The result:
> 1	1
> 2	2
> 3	3
> {noformat}
> The folder of the output_acid table contained the following delta directories:
> {noformat}
> drwxr-xr-x  6 martakuczora  staff  192 Mar  2 16:29 delta_0000000_0000000
> drwxr-xr-x  6 martakuczora  staff  192 Mar  2 16:29 delta_0000001_0000001_0001
> {noformat}
> It can be seen that the statement ID from the first directory is missing and when the select statements runs on the table, this directory will be ignored. That's why only half of the data got returned when running the select on the output_acid table.
> If either hive.stats.autogather is set to true or hive.optimize.union.remove is set to false the result of the insert will be correct. In this case there will be only 1 delta directory in the table's folder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)