You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2018/07/16 15:20:00 UTC

[jira] [Commented] (TEZ-3971) Incorrect query result in hive when hive.convert.join.bucket.mapjoin.tez=true

    [ https://issues.apache.org/jira/browse/TEZ-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545328#comment-16545328 ] 

Jason Lowe commented on TEZ-3971:
---------------------------------

This problem report seems more appropriate for the HIVE project.  Hive is responsible for generating the DAG for Tez, and it sounds like it is generating an incorrect DAG to process the results.  That would be a bug in Hive rather than a bug in Tez.


> Incorrect query result in hive when hive.convert.join.bucket.mapjoin.tez=true
> -----------------------------------------------------------------------------
>
>                 Key: TEZ-3971
>                 URL: https://issues.apache.org/jira/browse/TEZ-3971
>             Project: Apache Tez
>          Issue Type: Bug
>         Environment: We are using Hive 3, Hadoop 3.1 and Tez 0.91
>            Reporter: Karthik
>            Priority: Major
>         Attachments: extended_explain.txt
>
>
> When hive.convert.join.bucket.mapjoin.tez=true and bucketed column is in select clause but not in where clause, hive is performing a bucket map join and returning incorrect results. When the bucketed column is removed from select clause or  hive.convert.join.bucket.mapjoin.tez=false, returned query results are correct.
>  
> create table my_fact(AMT decimal(20,3),bucket_col string ,join_col string )
> PARTITIONED BY (FISCAL_YEAR string ,ACCOUNTING_PERIOD string )
>  CLUSTERED BY (bucket_col) INTO 10 
> BUCKETS 
> stored as ORC
>  ;
> create table my_dim(join_col string,filter_col string) stored as orc;
> After populating and analyzing above tables, explain  plan looks as below when  hive.convert.join.bucket.mapjoin.tez=TRUE:
>  
> explain  select T4.join_col as account1,my_fact.accounting_period
> FROM my_fact JOIN my_dim T4 ON my_fact.join_col = T4.join_col
> WHERE my_fact.fiscal_year = '2015'
>  AND T4.filter_col IN ( 'VAL1', 'VAL2' ) 
> and my_fact.accounting_period in (10);
> Vertex dependency in root stage
> Map 1 <- Map 2 (CUSTOM_EDGE)
> Stage-0
>  Fetch Operator
>  limit:-1
>  Stage-1
>  Map 1 vectorized, llap
>  File Output Operator [FS_24]
>  Select Operator [SEL_23] (rows=15282589 width=291)
>  Output:["_col0","_col1","_col2"]
>  Map Join Operator [MAPJOIN_22] (rows=15282589 width=291)
>  *BucketMapJoin*:true,Conds:SEL_21._col1=RS_19._col0(Inner),Output:["_col0","_col3","_col4"]
>  <-Map 2 [CUSTOM_EDGE] vectorized, llap
>  MULTICAST [RS_19]
>  PartitionCols:_col0
>  Select Operator [SEL_18] (rows=818 width=186)
>  Output:["_col0"]
>  Filter Operator [FIL_17] (rows=818 width=186)
>  predicate:((filter_col) IN ('VAL1', 'VAL2') and join_col is not null)
>  TableScan [TS_3] (rows=1635 width=186)
>  default@my_dim,t4,Tbl:COMPLETE,Col:NONE,Output:["join_col","filter_col"]
>  <-Select Operator [SEL_21] (rows=13893263 width=291)
>  Output:["_col0","_col1","_col3"]
>  Filter Operator [FIL_20] (rows=13893263 width=291)
>  predicate:join_col is not null
>  TableScan [TS_0] (rows=13893263 width=291)
>  default@my_fact,my_fact,Tbl:COMPLETE,Col:NONE,Output:["bucket_col","join_col"]
> [^extended_explain.txt] has more detailed plan.
> When  hive.convert.join.bucket.mapjoin.tez=false,  plan no longer has bucketjoin and query results are correct.
> Vertex dependency in root stage
> Map 1 <- Map 2 (BROADCAST_EDGE)
> Stage-0
>  Fetch Operator
>  limit:-1
>  Stage-1
>  Map 1 vectorized, llap
>  File Output Operator [FS_24]
>  Select Operator [SEL_23] (rows=15282589 width=291)
>  Output:["_col0","_col1","_col2"]
>  Map Join Operator [MAPJOIN_22] (rows=15282589 width=291)
>  Conds:SEL_21._col1=RS_19._col0(Inner),Output:["_col0","_col3","_col4"]
>  <-Map 2 [BROADCAST_EDGE] vectorized, llap
>  BROADCAST [RS_19]
>  PartitionCols:_col0
>  Select Operator [SEL_18] (rows=818 width=186)
>  Output:["_col0"]
>  Filter Operator [FIL_17] (rows=818 width=186)
>  predicate:((filter_col) IN ('VAL1', 'VAL2') and join_col is not null)
>  TableScan [TS_3] (rows=1635 width=186)
>  default@my_dim,t4,Tbl:COMPLETE,Col:NONE,Output:["join_col","filter_col"]
>  <-Select Operator [SEL_21] (rows=13893263 width=291)
>  Output:["_col0","_col1","_col3"]
>  Filter Operator [FIL_20] (rows=13893263 width=291)
>  predicate:join_col is not null
>  TableScan [TS_0] (rows=13893263 width=291)
>  default@my_fact,my_fact,Tbl:COMPLETE,Col:NONE,Output:["bucket_col","join_col"]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)