You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jonas Shaw (Jira)" <ji...@apache.org> on 2022/09/20 02:20:00 UTC
[jira] [Commented] (TEZ-4402) failed to do FULL JOIN

    [ https://issues.apache.org/jira/browse/TEZ-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606828#comment-17606828 ] 

Jonas Shaw commented on TEZ-4402:
---------------------------------

Same hive and tez version, we got same incorrect result. The bug happen on certain row count in two full join table.

> failed to do FULL JOIN 
> -----------------------
>
>                 Key: TEZ-4402
>                 URL: https://issues.apache.org/jira/browse/TEZ-4402
>             Project: Apache Tez
>          Issue Type: Bug
>         Environment: EMR (hive 3.1.2 + Tez 0.10.1)
> Data stored in S3
>  
>            Reporter: Youjun Yuan
>            Priority: Major
>
> we hit a query which FULL JOINs two tables, hive produces incorrect results, for a single value of join key, it produces two records, each record has a valid value for one table and NULL for the other table.
> The query is:
>  
> {code:java}
> SELECT d.id, u.id
> FROM (
>        SELECT id
>        FROM   airflow.tableA rud
>        WHERE  rud.dt = '2022-04-02-1row'
> ) d
> FULL JOIN (
>        SELECT id
>        FROM   default.tableB
>        WHERE  dt = '2022-04-01' and device_token='blabla'
>  ) u
> ON u.id = d.id
> ;
> {code}
> And produces two records for id=350570497
> {code:java}
> 350570497    NULL
> NULL    350570497
> Time taken: 62.692 seconds, Fetched: 2 row(s) {code}
> I am sure tableB has only one row where device_token='blabla'
> And we tried:
> 1, SET mapreduce.job.reduces=1; then it produces right result;
> 2, SET hive.execution.engine=mr; then it produces right result;
> 3, *JOIN* (instead of FULL JOIN) worked as expected
> 4, in sub query u, change filter _device_token='blabla'_ to {_}id=350570497{_}, it worked ok
> Below is the explain output of the query:
> {code:java}
> Plan optimized by CBO.Vertex dependency in root stage
> Reducer 3 <- Map 1 (CUSTOM_SIMPLE_EDGE), Map 2 (CUSTOM_SIMPLE_EDGE)Stage-0
>   Fetch Operator
>     limit:-1
>     Stage-1
>       Reducer 3
>       File Output Operator [FS_10]
>         Map Join Operator [MAPJOIN_13] (rows=2 width=8)
>           Conds:RS_6.KEY.reducesinkkey0=RS_7.KEY.reducesinkkey0(Outer),DynamicPartitionHashJoin:true,Output:["_col0","_col1"]
>         <-Map 1 [CUSTOM_SIMPLE_EDGE]
>           PARTITION_ONLY_SHUFFLE [RS_6]
>             PartitionCols:_col0
>             Select Operator [SEL_2] (rows=1 width=4)
>               Output:["_col0"]
>               TableScan [TS_0] (rows=1 width=4)
>                 airflow@rds_users_delta,rud,Tbl:COMPLETE,Col:COMPLETE,Output:["id"]
>         <-Map 2 [CUSTOM_SIMPLE_EDGE]
>           PARTITION_ONLY_SHUFFLE [RS_7]
>             PartitionCols:_col0
>             Select Operator [SEL_5] (rows=1 width=4)
>               Output:["_col0"]
>               Filter Operator [FIL_12] (rows=1 width=110)
>                 predicate:(device_token = 'blabla')
>                 TableScan [TS_3] (rows=215192362 width=109)
>                   default@users,users,Tbl:COMPLETE,Col:COMPLETE,Output:["id","device_token"] {code}
> I can't generate a small enough result set to reproduce the issue, I have minimized the tableA to only 1 row, tableB has ~10m rows, but if I further reduce the size of tableB, then the issue can't be reproduced.
>  
> any suggestion would be highly appreciated, regarding the root cause of the issue, how to work around it, or how to reproduce it with small enough dataset. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)