You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/07/14 00:33:00 UTC
[jira] [Work logged] (HIVE-23438) Missing Rows When Left Outer Join In N-way HybridGraceHashJoin

     [ https://issues.apache.org/jira/browse/HIVE-23438?focusedWorklogId=458392&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-458392 ]

ASF GitHub Bot logged work on HIVE-23438:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 14/Jul/20 00:32
            Start Date: 14/Jul/20 00:32
    Worklog Time Spent: 10m 
      Work Description: github-actions[bot] commented on pull request #1014:
URL: https://github.com/apache/hive/pull/1014#issuecomment-657898549


   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 458392)
    Remaining Estimate: 0h
            Time Spent: 10m

> Missing Rows When Left Outer Join In N-way HybridGraceHashJoin
> --------------------------------------------------------------
>
>                 Key: HIVE-23438
>                 URL: https://issues.apache.org/jira/browse/HIVE-23438
>             Project: Hive
>          Issue Type: Bug
>          Components: SQL, Tez
>    Affects Versions: 2.3.4
>            Reporter: 范宜臻
>            Assignee: 范宜臻
>            Priority: Major
>         Attachments: HIVE-23438.001.branch-2.3.patch, HIVE-23438.branch-2.3.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> *Run Test in Patch File*
> {code:java}
> mvn test -Dtest=TestMiniTezCliDriver -Dqfile=hybridgrace_hashjoin_2.q{code}
> *Manual Reproduce*
> *STEP 1. Create test data(q_test_init_tez.sql)*
> {code:java}
> //create table src1
> CREATE TABLE src1 (key STRING COMMENT 'default', value STRING COMMENT 'default') STORED AS TEXTFILE;
> LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv3.txt" INTO TABLE src1;
> //create table src2
> CREATE TABLE src2(key STRING COMMENT 'default', value STRING COMMENT 'default') STORED AS TEXTFILE;
> LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv11.txt" OVERWRITE INTO TABLE src2;
> //create table srcpart
> CREATE TABLE srcpart (key STRING COMMENT 'default', value STRING COMMENT 'default')
> PARTITIONED BY (ds STRING, hr STRING)
> STORED AS TEXTFILE;
> LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt"
> OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-08", hr="11");
> LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt"
> OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-08", hr="12");
> LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt"
> OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-09", hr="11");
> LOAD DATA LOCAL INPATH "${hiveconf:test.data.dir}/kv1.txt"
> OVERWRITE INTO TABLE srcpart PARTITION (ds="2008-04-09", hr="12");{code}
> *STEP 2. Run query*
> {code:java}
> set hive.auto.convert.join=true; 
> set hive.auto.convert.join.noconditionaltask=true; 
> set hive.auto.convert.join.noconditionaltask.size=10000000; 
> set hive.cbo.enable=false;
> set hive.mapjoin.hybridgrace.hashtable=true;
> select *
> from
> (
> select key from src1 group by key
> ) x
> left join src2 z on x.key = z.key
> join
> (
> select key from srcpart y group by key
> ) y on y.key = x.key;
> {code}
> *EXPECTED RESULT***
>  
> {code:java}
> 128	NULL	NULL	128
> 146	146	1val_1461	146
> 150	150	1val_1501	150
> 238	NULL	NULL	238
> 369	NULL	NULL	369
> 406	406	1val_4061	406
> 273	273	1val_2731	273
> 98	NULL	NULL	98
> 213	213	1val_2131	213
> 255	NULL	NULL	255
> 401	401	1val_4011	401
> 278	NULL	NULL	278
> 66	66	11val_6611	66
> 224	NULL	NULL	224
> 311	NULL	NULL	311
> {code}
>  
> *ACTUAL RESULT*
> {code:java}
> 128	NULL	NULL	128
> 146	146	1val_1461	146
> 150	150	1val_1501	150
> 213	213	1val_2131	213
> 238	NULL	NULL	238
> 273	273	1val_2731	273
> 369	NULL	NULL	369
> 406	406	1val_4061	406
> 98	NULL	NULL	98
> 401	401	1val_4011	401
> 66	66	11val_6611	66
> {code}
>  
> *ROOT CAUSE*
> src1 left join src2, src1 is big table and src2 is small table. Join result between big table row and the corresponding hashtable maybe NO_MATCH state, however, these NO_MATCH rows is needed because LEFT OUTER JOIN.
> In addition, these big table rows will not spilled into matchfile related to this hashtable on disk because only SPILL state can use `spillBigTableRow`.  Then, these big table rows will be spilled into matchfile in hashtables of table `srcpart`(second small table)
> Finally, when reProcessBigTable, big table rows in matchfile are only read from `firstSmallTable`, some datum are missing.
>  
> *WORKAROUND*
>  configure firstSmallTable in completeInitializationOp and only spill big table row into firstSmallTable when spill matchfile.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)