You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Youjun Yuan (Jira)" <ji...@apache.org> on 2022/04/05 00:29:00 UTC

[jira] [Updated] (HIVE-26111) FULL JOIN returns incorrect result with Tez engine

     [ https://issues.apache.org/jira/browse/HIVE-26111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Youjun Yuan updated HIVE-26111:
-------------------------------
    Description: 
we hit a query which FULL JOINs two tables, hive produces incorrect results, for a single value of join key, it produces two records, each record has a valid value for one table and NULL for the other table.

The query is:
{code:java}
SELECT d.id, u.id
FROM (
       SELECT id
       FROM   airflow.tableA rud
       WHERE  rud.dt = '2022-04-02-1row'
) d
FULL JOIN (
       SELECT id
       FROM   default.tableB
       WHERE  dt = '2022-04-01' and device_token='blabla'
 ) u
ON u.id = d.id
; {code}
And produces two records for id=350570497
{code:java}
350570497    NULL
NULL    350570497
Time taken: 62.692 seconds, Fetched: 2 row(s) {code}
I am sure tableB has only one row where device_token='blabla'

And we tried:

1, SET mapreduce.job.reduces=1; then it produces right result;

2, SET hive.execution.engine=mr; then it produces right result;

3, JOIN (instead of FULL JOIN) worked as expected

4, in sub query u, change filter device_token='blabla' to id=350570497, it worked ok

Below is the explain output of the query:
{code:java}
Plan optimized by CBO.Vertex dependency in root stage
Reducer 3 <- Map 1 (CUSTOM_SIMPLE_EDGE), Map 2 (CUSTOM_SIMPLE_EDGE)Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 3
      File Output Operator [FS_10]
        Map Join Operator [MAPJOIN_13] (rows=2 width=8)
          Conds:RS_6.KEY.reducesinkkey0=RS_7.KEY.reducesinkkey0(Outer),DynamicPartitionHashJoin:true,Output:["_col0","_col1"]
        <-Map 1 [CUSTOM_SIMPLE_EDGE]
          PARTITION_ONLY_SHUFFLE [RS_6]
            PartitionCols:_col0
            Select Operator [SEL_2] (rows=1 width=4)
              Output:["_col0"]
              TableScan [TS_0] (rows=1 width=4)
                airflow@rds_users_delta,rud,Tbl:COMPLETE,Col:COMPLETE,Output:["id"]
        <-Map 2 [CUSTOM_SIMPLE_EDGE]
          PARTITION_ONLY_SHUFFLE [RS_7]
            PartitionCols:_col0
            Select Operator [SEL_5] (rows=1 width=4)
              Output:["_col0"]
              Filter Operator [FIL_12] (rows=1 width=110)
                predicate:(device_token = 'blabla')
                TableScan [TS_3] (rows=215192362 width=109)
                  default@users,users,Tbl:COMPLETE,Col:COMPLETE,Output:["id","device_token"]  {code}
I can't generate a small enough result set to reproduce the issue, I have minimized the tableA to only 1 row, tableB has ~200m rows, but if I further reduce the size of tableB, then the issue can't be reproduced.

Any suggestion would be highly appreciated, regarding the root cause of the issue, how to work around it, or how to reproduce it with small enough dataset. 

  was:
we hit a query which FULL JOINs two tables, hive produces incorrect results, for a single value of join key, it produces two records, each record has a valid value for one table and NULL for the other table.

The query is:
{code:java}
SELECT d.id, u.id
FROM (
       SELECT id
       FROM   airflow.tableA rud
       WHERE  rud.dt = '2022-04-02-1row'
) d
FULL JOIN (
       SELECT id
       FROM   default.tableB
       WHERE  dt = '2022-04-01' and device_token='blabla'
 ) u
ON u.id = d.id
; {code}
And produces two records for id=350570497
{code:java}
350570497    NULL
NULL    350570497
Time taken: 62.692 seconds, Fetched: 2 row(s) {code}
I am sure tableB has only one row where device_token='blabla'And we tried: 1, SET mapreduce.job.reduces=1; then it produces right result;2, SET hive.execution.engine=mr; then it produces right result;3, JOIN (instead of FULL JOIN) worked as expected 4, in sub query u, change filter device_token='blabla' to id=350570497, it worked okBelow is the explain output of the query:
{code:java}
Plan optimized by CBO.Vertex dependency in root stage
Reducer 3 <- Map 1 (CUSTOM_SIMPLE_EDGE), Map 2 (CUSTOM_SIMPLE_EDGE)Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 3
      File Output Operator [FS_10]
        Map Join Operator [MAPJOIN_13] (rows=2 width=8)
          Conds:RS_6.KEY.reducesinkkey0=RS_7.KEY.reducesinkkey0(Outer),DynamicPartitionHashJoin:true,Output:["_col0","_col1"]
        <-Map 1 [CUSTOM_SIMPLE_EDGE]
          PARTITION_ONLY_SHUFFLE [RS_6]
            PartitionCols:_col0
            Select Operator [SEL_2] (rows=1 width=4)
              Output:["_col0"]
              TableScan [TS_0] (rows=1 width=4)
                airflow@rds_users_delta,rud,Tbl:COMPLETE,Col:COMPLETE,Output:["id"]
        <-Map 2 [CUSTOM_SIMPLE_EDGE]
          PARTITION_ONLY_SHUFFLE [RS_7]
            PartitionCols:_col0
            Select Operator [SEL_5] (rows=1 width=4)
              Output:["_col0"]
              Filter Operator [FIL_12] (rows=1 width=110)
                predicate:(device_token = 'blabla')
                TableScan [TS_3] (rows=215192362 width=109)
                  default@users,users,Tbl:COMPLETE,Col:COMPLETE,Output:["id","device_token"]  {code}
I can't generate a small enough result set to reproduce the issue, I have minimized the tableA to only 1 row, tableB has ~200m rows, but if I further reduce the size of tableB, then the issue can't be reproduced. any suggestion would be highly appreciated, regarding the root cause of the issue, how to work around it, or how to reproduce it with small enough dataset. 


> FULL JOIN returns incorrect result with Tez engine
> --------------------------------------------------
>
>                 Key: HIVE-26111
>                 URL: https://issues.apache.org/jira/browse/HIVE-26111
>             Project: Hive
>          Issue Type: Bug
>         Environment: aws EMR (hive 3.1.2 + Tez 0.10.1)
>            Reporter: Youjun Yuan
>            Priority: Blocker
>
> we hit a query which FULL JOINs two tables, hive produces incorrect results, for a single value of join key, it produces two records, each record has a valid value for one table and NULL for the other table.
> The query is:
> {code:java}
> SELECT d.id, u.id
> FROM (
>        SELECT id
>        FROM   airflow.tableA rud
>        WHERE  rud.dt = '2022-04-02-1row'
> ) d
> FULL JOIN (
>        SELECT id
>        FROM   default.tableB
>        WHERE  dt = '2022-04-01' and device_token='blabla'
>  ) u
> ON u.id = d.id
> ; {code}
> And produces two records for id=350570497
> {code:java}
> 350570497    NULL
> NULL    350570497
> Time taken: 62.692 seconds, Fetched: 2 row(s) {code}
> I am sure tableB has only one row where device_token='blabla'
> And we tried:
> 1, SET mapreduce.job.reduces=1; then it produces right result;
> 2, SET hive.execution.engine=mr; then it produces right result;
> 3, JOIN (instead of FULL JOIN) worked as expected
> 4, in sub query u, change filter device_token='blabla' to id=350570497, it worked ok
> Below is the explain output of the query:
> {code:java}
> Plan optimized by CBO.Vertex dependency in root stage
> Reducer 3 <- Map 1 (CUSTOM_SIMPLE_EDGE), Map 2 (CUSTOM_SIMPLE_EDGE)Stage-0
>   Fetch Operator
>     limit:-1
>     Stage-1
>       Reducer 3
>       File Output Operator [FS_10]
>         Map Join Operator [MAPJOIN_13] (rows=2 width=8)
>           Conds:RS_6.KEY.reducesinkkey0=RS_7.KEY.reducesinkkey0(Outer),DynamicPartitionHashJoin:true,Output:["_col0","_col1"]
>         <-Map 1 [CUSTOM_SIMPLE_EDGE]
>           PARTITION_ONLY_SHUFFLE [RS_6]
>             PartitionCols:_col0
>             Select Operator [SEL_2] (rows=1 width=4)
>               Output:["_col0"]
>               TableScan [TS_0] (rows=1 width=4)
>                 airflow@rds_users_delta,rud,Tbl:COMPLETE,Col:COMPLETE,Output:["id"]
>         <-Map 2 [CUSTOM_SIMPLE_EDGE]
>           PARTITION_ONLY_SHUFFLE [RS_7]
>             PartitionCols:_col0
>             Select Operator [SEL_5] (rows=1 width=4)
>               Output:["_col0"]
>               Filter Operator [FIL_12] (rows=1 width=110)
>                 predicate:(device_token = 'blabla')
>                 TableScan [TS_3] (rows=215192362 width=109)
>                   default@users,users,Tbl:COMPLETE,Col:COMPLETE,Output:["id","device_token"]  {code}
> I can't generate a small enough result set to reproduce the issue, I have minimized the tableA to only 1 row, tableB has ~200m rows, but if I further reduce the size of tableB, then the issue can't be reproduced.
> Any suggestion would be highly appreciated, regarding the root cause of the issue, how to work around it, or how to reproduce it with small enough dataset. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)