You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Sourabh Badhya (Jira)" <ji...@apache.org> on 2023/04/17 12:07:00 UTC
[jira] [Updated] (HIVE-27267) Incorrect results when doing bucket map join on decimal bucketed column with subquery

     [ https://issues.apache.org/jira/browse/HIVE-27267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sourabh Badhya updated HIVE-27267:
----------------------------------
    Description: 
The following queries when run on a Hive cluster produce no results - 
Repro queries - 
{code:java}
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.support.concurrency=true;
set hive.convert.join.bucket.mapjoin.tez=true;

drop table if exists test_external_source;
create external table test_external_source (date_col date, string_col string, decimal_col decimal(38,0)) stored as orc tblproperties ('external.table.purge'='true');
insert into table test_external_source values ('2022-08-30', 'pipeline', '50000000000000000005905545593'), ('2022-08-16', 'pipeline', '50000000000000000005905545593'), ('2022-09-01', 'pipeline', '50000000000000000006008686831'), ('2022-08-30', 'pipeline', '50000000000000000005992620837'), ('2022-09-01', 'pipeline', '50000000000000000005992620837'), ('2022-09-01', 'pipeline', '50000000000000000005992621067'), ('2022-08-30', 'pipeline', '50000000000000000005992621067');

drop table if exists test_external_target;
create external table test_external_target (date_col date, string_col string, decimal_col decimal(38,0)) stored as orc tblproperties ('external.table.purge'='true');
insert into table test_external_target values ('2017-05-17', 'pipeline', '50000000000000000000441610525'), ('2018-12-20', 'pipeline', '50000000000000000001048981030'), ('2020-06-30', 'pipeline', '50000000000000000002332575516'), ('2021-08-16', 'pipeline', '50000000000000000003897973989'), ('2017-06-06', 'pipeline', '50000000000000000000449148729'), ('2017-09-08', 'pipeline', '50000000000000000000525378314'), ('2022-08-30', 'pipeline', '50000000000000000005905545593'), ('2022-08-16', 'pipeline', '50000000000000000005905545593'), ('2018-05-03', 'pipeline', '50000000000000000000750826355'), ('2020-01-10', 'pipeline', '50000000000000000001816579677'), ('2021-11-01', 'pipeline', '50000000000000000004269423714'), ('2017-11-07', 'pipeline', '50000000000000000000585901787'), ('2019-10-15', 'pipeline', '50000000000000000001598843430'), ('2020-04-01', 'pipeline', '50000000000000000002035795461'), ('2020-02-24', 'pipeline', '50000000000000000001932600185'), ('2020-04-27', 'pipeline', '50000000000000000002108160849'), ('2016-07-05', 'pipeline', '50000000000000000000054405114'), ('2020-06-02', 'pipeline', '50000000000000000002234387967'), ('2020-08-21', 'pipeline', '50000000000000000002529168758'), ('2021-02-17', 'pipeline', '50000000000000000003158511687');

drop table if exists target_table;
drop table if exists source_table;
create table target_table(date_col date, string_col string, decimal_col decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc tblproperties ('bucketing_version'='2', 'transactional'='true', 'transactional_properties'='default');
create table source_table(date_col date, string_col string, decimal_col decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc tblproperties ('bucketing_version'='2', 'transactional'='true', 'transactional_properties'='default');

insert into table target_table select * from test_external_target;
insert into table source_table select * from test_external_source; {code}
Query which is under investigation - 
{code:java}
select * from target_table inner join (select distinct date_col, 'pipeline' string_col, decimal_col from source_table where coalesce(decimal_col,'') = '50000000000000000005905545593') s on s.date_col = target_table.date_col AND s.string_col = target_table.string_col AND s.decimal_col = target_table.decimal_col; {code}
Expected result of the query - 2 records
{code:java}
+------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
| target_table.date_col  | target_table.string_col  |    target_table.decimal_col    | s.date_col  | s.string_col  |         s.decimal_col          |
+------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
| 2022-08-16             | pipeline                 | 50000000000000000005905545593  | 2022-08-16  | pipeline      | 50000000000000000005905545593  |
| 2022-08-30             | pipeline                 | 50000000000000000005905545593  | 2022-08-30  | pipeline      | 50000000000000000005905545593  |
+------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+ {code}
Actual result of the query - No records
{code:java}
+------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
| target_table.date_col  | target_table.string_col  | target_table.decimal_col  | s.date_col  | s.string_col  | s.decimal_col  |
+------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
+------------------------+--------------------------+---------------------------+-------------+---------------+----------------+ {code}
The workaround which fetches the correct result here is to set the below config to false - 
{code:java}
set hive.convert.join.bucket.mapjoin.tez=false;{code}
Notes from investigation - 
1. The batch containing the 2 results are forwarded correctly to the map join operator. However, during the join comparision, the hash table is empty.
2. The problem seems to be that HashTableDummyOperator performs loading of hash table with the records, however the map join operator does not take into account all the hash tables from various instances of HashTableDummyOperator (due to multiple map tasks initiated by bucket map join) but rather uses only one hash table from one of the HashTableDummyOperator instance. In this case, the selected instance had an empty hash table hence no records were matched in the join operator.
3. If the table is unbucketed / 1-bucketed, then the results are correct. There is only 1 map task which is spawned which loads the records into the hash table. The workaround (setting *hive.convert.join.bucket.mapjoin.tez* to {*}false{*}) also has the same effect since there is 1 map task which loads the records into the hash table.
4. HashTableDummyOperator is created in the optimizer and is associated with the plan, hence suspecting there is a some issue in the optimizer code. Ideally, all hash tables from all instances of HashTableDummyOperator must be used by the map join operator.

  was:
The following queries when run on a Hive cluster produce no results - 
Repro queries - 
{code:java}
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.support.concurrency=true;
set hive.convert.join.bucket.mapjoin.tez=true;

drop table if exists test_external_source;
create external table test_external_source (date_col date, string_col string, decimal_col decimal(38,0)) stored as orc tblproperties ('external.table.purge'='true');
insert into table test_external_source values ('2022-08-30', 'pipeline', '50000000000000000005905545593'), ('2022-08-16', 'pipeline', '50000000000000000005905545593'), ('2022-09-01', 'pipeline', '50000000000000000006008686831'), ('2022-08-30', 'pipeline', '50000000000000000005992620837'), ('2022-09-01', 'pipeline', '50000000000000000005992620837'), ('2022-09-01', 'pipeline', '50000000000000000005992621067'), ('2022-08-30', 'pipeline', '50000000000000000005992621067');

drop table if exists test_external_target;
create external table test_external_target (date_col date, string_col string, decimal_col decimal(38,0)) stored as orc tblproperties ('external.table.purge'='true');
insert into table test_external_target values ('2017-05-17', 'pipeline', '50000000000000000000441610525'), ('2018-12-20', 'pipeline', '50000000000000000001048981030'), ('2020-06-30', 'pipeline', '50000000000000000002332575516'), ('2021-08-16', 'pipeline', '50000000000000000003897973989'), ('2017-06-06', 'pipeline', '50000000000000000000449148729'), ('2017-09-08', 'pipeline', '50000000000000000000525378314'), ('2022-08-30', 'pipeline', '50000000000000000005905545593'), ('2022-08-16', 'pipeline', '50000000000000000005905545593'), ('2018-05-03', 'pipeline', '50000000000000000000750826355'), ('2020-01-10', 'pipeline', '50000000000000000001816579677'), ('2021-11-01', 'pipeline', '50000000000000000004269423714'), ('2017-11-07', 'pipeline', '50000000000000000000585901787'), ('2019-10-15', 'pipeline', '50000000000000000001598843430'), ('2020-04-01', 'pipeline', '50000000000000000002035795461'), ('2020-02-24', 'pipeline', '50000000000000000001932600185'), ('2020-04-27', 'pipeline', '50000000000000000002108160849'), ('2016-07-05', 'pipeline', '50000000000000000000054405114'), ('2020-06-02', 'pipeline', '50000000000000000002234387967'), ('2020-08-21', 'pipeline', '50000000000000000002529168758'), ('2021-02-17', 'pipeline', '50000000000000000003158511687');

drop table if exists target_table;
drop table if exists source_table;
create table target_table(date_col date, string_col string, decimal_col decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc tblproperties ('bucketing_version'='2', 'transactional'='true', 'transactional_properties'='default');
create table source_table(date_col date, string_col string, decimal_col decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc tblproperties ('bucketing_version'='2', 'transactional'='true', 'transactional_properties'='default');

insert into table target_table select * from test_external_target;
insert into table source_table select * from test_external_source; {code}
Query which is under investigation - 
{code:java}
select * from target_table inner join (select distinct date_col, 'pipeline' string_col, decimal_col from source_table where coalesce(decimal_col,'') = '50000000000000000005905545593') s on s.date_col = target_table.date_col AND s.string_col = target_table.string_col AND s.decimal_col = target_table.decimal_col; {code}
Expected result of the query - 2 records
{code:java}
+------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
| target_table.date_col  | target_table.string_col  |    target_table.decimal_col    | s.date_col  | s.string_col  |         s.decimal_col          |
+------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
| 2022-08-16             | pipeline                 | 50000000000000000005905545593  | 2022-08-16  | pipeline      | 50000000000000000005905545593  |
| 2022-08-30             | pipeline                 | 50000000000000000005905545593  | 2022-08-30  | pipeline      | 50000000000000000005905545593  |
+------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+ {code}
Actual result of the query - No records
{code:java}
+------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
| target_table.date_col  | target_table.string_col  | target_table.decimal_col  | s.date_col  | s.string_col  | s.decimal_col  |
+------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
+------------------------+--------------------------+---------------------------+-------------+---------------+----------------+ {code}
The workaround which fetches the correct result here is to set the below config to false - 

 
{code:java}
set hive.convert.join.bucket.mapjoin.tez=false;{code}
Notes from investigation - 
1. The batch containing the 2 results are forwarded correctly to the map join operator. However, during the join comparision, the hash table is empty.
2. The problem seems to be that HashTableDummyOperator performs loading of hash table with the records, however the map join operator does not take into account all the hash tables from various instances of HashTableDummyOperator (due to multiple map tasks initiated by bucket map join) but rather uses only one hash table from one of the HashTableDummyOperator instance. In this case, the selected instance had an empty hash table hence no records were matched in the join operator.
3. If the table is unbucketed / 1-bucketed, then the results are correct. There is only 1 map task which is spawned which loads the records into the hash table. The workaround (setting *hive.convert.join.bucket.mapjoin.tez* to {*}false{*}) also has the same effect since there is 1 map task which loads the records into the hash table.
4. HashTableDummyOperator is created in the optimizer and is associated with the plan, hence suspecting there is a some issue in the optimizer code. Ideally, all hash tables from all instances of HashTableDummyOperator must be used by the map join operator.

 


> Incorrect results when doing bucket map join on decimal bucketed column with subquery
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-27267
>                 URL: https://issues.apache.org/jira/browse/HIVE-27267
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sourabh Badhya
>            Priority: Major
>
> The following queries when run on a Hive cluster produce no results - 
> Repro queries - 
> {code:java}
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.support.concurrency=true;
> set hive.convert.join.bucket.mapjoin.tez=true;
> drop table if exists test_external_source;
> create external table test_external_source (date_col date, string_col string, decimal_col decimal(38,0)) stored as orc tblproperties ('external.table.purge'='true');
> insert into table test_external_source values ('2022-08-30', 'pipeline', '50000000000000000005905545593'), ('2022-08-16', 'pipeline', '50000000000000000005905545593'), ('2022-09-01', 'pipeline', '50000000000000000006008686831'), ('2022-08-30', 'pipeline', '50000000000000000005992620837'), ('2022-09-01', 'pipeline', '50000000000000000005992620837'), ('2022-09-01', 'pipeline', '50000000000000000005992621067'), ('2022-08-30', 'pipeline', '50000000000000000005992621067');
> drop table if exists test_external_target;
> create external table test_external_target (date_col date, string_col string, decimal_col decimal(38,0)) stored as orc tblproperties ('external.table.purge'='true');
> insert into table test_external_target values ('2017-05-17', 'pipeline', '50000000000000000000441610525'), ('2018-12-20', 'pipeline', '50000000000000000001048981030'), ('2020-06-30', 'pipeline', '50000000000000000002332575516'), ('2021-08-16', 'pipeline', '50000000000000000003897973989'), ('2017-06-06', 'pipeline', '50000000000000000000449148729'), ('2017-09-08', 'pipeline', '50000000000000000000525378314'), ('2022-08-30', 'pipeline', '50000000000000000005905545593'), ('2022-08-16', 'pipeline', '50000000000000000005905545593'), ('2018-05-03', 'pipeline', '50000000000000000000750826355'), ('2020-01-10', 'pipeline', '50000000000000000001816579677'), ('2021-11-01', 'pipeline', '50000000000000000004269423714'), ('2017-11-07', 'pipeline', '50000000000000000000585901787'), ('2019-10-15', 'pipeline', '50000000000000000001598843430'), ('2020-04-01', 'pipeline', '50000000000000000002035795461'), ('2020-02-24', 'pipeline', '50000000000000000001932600185'), ('2020-04-27', 'pipeline', '50000000000000000002108160849'), ('2016-07-05', 'pipeline', '50000000000000000000054405114'), ('2020-06-02', 'pipeline', '50000000000000000002234387967'), ('2020-08-21', 'pipeline', '50000000000000000002529168758'), ('2021-02-17', 'pipeline', '50000000000000000003158511687');
> drop table if exists target_table;
> drop table if exists source_table;
> create table target_table(date_col date, string_col string, decimal_col decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc tblproperties ('bucketing_version'='2', 'transactional'='true', 'transactional_properties'='default');
> create table source_table(date_col date, string_col string, decimal_col decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc tblproperties ('bucketing_version'='2', 'transactional'='true', 'transactional_properties'='default');
> insert into table target_table select * from test_external_target;
> insert into table source_table select * from test_external_source; {code}
> Query which is under investigation - 
> {code:java}
> select * from target_table inner join (select distinct date_col, 'pipeline' string_col, decimal_col from source_table where coalesce(decimal_col,'') = '50000000000000000005905545593') s on s.date_col = target_table.date_col AND s.string_col = target_table.string_col AND s.decimal_col = target_table.decimal_col; {code}
> Expected result of the query - 2 records
> {code:java}
> +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
> | target_table.date_col  | target_table.string_col  |    target_table.decimal_col    | s.date_col  | s.string_col  |         s.decimal_col          |
> +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+
> | 2022-08-16             | pipeline                 | 50000000000000000005905545593  | 2022-08-16  | pipeline      | 50000000000000000005905545593  |
> | 2022-08-30             | pipeline                 | 50000000000000000005905545593  | 2022-08-30  | pipeline      | 50000000000000000005905545593  |
> +------------------------+--------------------------+--------------------------------+-------------+---------------+--------------------------------+ {code}
> Actual result of the query - No records
> {code:java}
> +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
> | target_table.date_col  | target_table.string_col  | target_table.decimal_col  | s.date_col  | s.string_col  | s.decimal_col  |
> +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+
> +------------------------+--------------------------+---------------------------+-------------+---------------+----------------+ {code}
> The workaround which fetches the correct result here is to set the below config to false - 
> {code:java}
> set hive.convert.join.bucket.mapjoin.tez=false;{code}
> Notes from investigation - 
> 1. The batch containing the 2 results are forwarded correctly to the map join operator. However, during the join comparision, the hash table is empty.
> 2. The problem seems to be that HashTableDummyOperator performs loading of hash table with the records, however the map join operator does not take into account all the hash tables from various instances of HashTableDummyOperator (due to multiple map tasks initiated by bucket map join) but rather uses only one hash table from one of the HashTableDummyOperator instance. In this case, the selected instance had an empty hash table hence no records were matched in the join operator.
> 3. If the table is unbucketed / 1-bucketed, then the results are correct. There is only 1 map task which is spawned which loads the records into the hash table. The workaround (setting *hive.convert.join.bucket.mapjoin.tez* to {*}false{*}) also has the same effect since there is 1 map task which loads the records into the hash table.
> 4. HashTableDummyOperator is created in the optimizer and is associated with the plan, hence suspecting there is a some issue in the optimizer code. Ideally, all hash tables from all instances of HashTableDummyOperator must be used by the map join operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)