You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Jesus Camacho Rodriguez (Jira)" <ji...@apache.org> on 2021/04/05 20:29:00 UTC
[jira] [Commented] (HIVE-24595) Vectorization causing incorrect results for scalar subquery

    [ https://issues.apache.org/jira/browse/HIVE-24595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315084#comment-17315084 ] 

Jesus Camacho Rodriguez commented on HIVE-24595:
------------------------------------------------

[~mustafaiman], fyi HIVE-24638 has been fixed.

> Vectorization causing incorrect results for scalar subquery
> -----------------------------------------------------------
>
>                 Key: HIVE-24595
>                 URL: https://issues.apache.org/jira/browse/HIVE-24595
>             Project: Hive
>          Issue Type: Bug
>          Components: Vectorization
>    Affects Versions: 3.0.0
>            Reporter: Vineet Garg
>            Assignee: Mustafa İman
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> *Repro*
> {code:sql}
>  CREATE EXTERNAL TABLE `alltypessmall`(             
>    `id` int,                                        
>    `bool_col` boolean,                              
>    `tinyint_col` tinyint,                           
>    `smallint_col` smallint,                         
>    `int_col` int,                                   
>    `bigint_col` bigint,                             
>    `float_col` float,                               
>    `double_col` double,                             
>    `date_string_col` string,                        
>    `string_col` string,                             
>    `timestamp_col` timestamp)                       
>  PARTITIONED BY (                                   
>    `year` int,                                      
>    `month` int)                                     
>  ROW FORMAT SERDE                                   
>    'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  
>  WITH SERDEPROPERTIES (                             
>    'escape.delim'='\\',                             
>    'field.delim'=',',                               
>    'serialization.format'=',')                      
>  STORED AS INPUTFORMAT                              
>    'org.apache.hadoop.mapred.TextInputFormat'       
>  OUTPUTFORMAT                                       
>    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
>  TBLPROPERTIES (                                    
>    'DO_NOT_UPDATE_STATS'='true',                    
>    'OBJCAPABILITIES'='EXTREAD,EXTWRITE',            
>    'STATS_GENERATED'='TASK',                        
>    'impala.lastComputeStatsTime'='1608312793',      
>    'transient_lastDdlTime'='1608310442')            ;
> insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,4,3434,5.4,44.3,'str1','str2', '01-01-2001');
> insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,4,3434,5.4,44.3,'str1','str2', '01-01-2001');
> insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,40,3434,5.4,44.3,'str1','str2', '01-01-2001');
> {code}
> Following query should fail but it succeeds
> {code:sql}
> SELECT id FROM alltypessmall
> WHERE int_col =
>   (SELECT int_col
>    FROM alltypessmall)
> ORDER BY id;
> {code}
> *Explain plan*
> {code:java}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: vgarg_20210106115838_3fe73bf6-66c2-4281-92e8-fd75fd8ad400:17
>       Edges:
>         Map 1 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE)
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>         Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE)
>       DagName: vgarg_20210106115838_3fe73bf6-66c2-4281-92e8-fd75fd8ad400:17
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: alltypessmall
>                   filterExpr: int_col is not null (type: boolean)
>                   Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: int_col is not null (type: boolean)
>                     Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: id (type: int), int_col (type: int)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
>                       Map Join Operator
>                         condition map:
>                              Inner Join 0 to 1
>                         keys:
>                           0
>                           1
>                         outputColumnNames: _col0, _col1
>                         input vertices:
>                           1 Reducer 4
>                         Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
>                         Map Join Operator
>                           condition map:
>                                Inner Join 0 to 1
>                           keys:
>                             0 _col1 (type: int)
>                             1 _col0 (type: int)
>                           outputColumnNames: _col0
>                           input vertices:
>                             1 Map 3
>                           Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
>                           Reduce Output Operator
>                             key expressions: _col0 (type: int)
>                             null sort order: z
>                             sort order: +
>                             Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized
>         Map 3
>             Map Operator Tree:
>                 TableScan
>                   alias: alltypessmall
>                   Statistics: Num rows: 3 Data size: 142 Basic stats: COMPLETE Column stats: COMPLETE
>                   Select Operator
>                     Statistics: Num rows: 3 Data size: 142 Basic stats: COMPLETE Column stats: COMPLETE
>                     Group By Operator
>                       aggregations: count()
>                       minReductionHashAggr: 0.6666666
>                       mode: hash
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         null sort order:
>                         sort order:
>                         Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
>                         value expressions: _col0 (type: bigint)
>                   Filter Operator
>                     predicate: int_col is not null (type: boolean)
>                     Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: int_col (type: int)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         null sort order: z
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized
>         Reducer 2
>             Execution mode: vectorized
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: int)
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
>                   table:
>                       input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                       output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Reducer 4
>             Execution mode: vectorized
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
>                 Filter Operator
>                   predicate: (sq_count_check(_col0) <= 1) (type: boolean)
>                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
>                   Select Operator
>                     Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
>                     Reduce Output Operator
>                       null sort order:
>                       sort order:
>                       Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> {code}
> Issue is that Map 3/Reducer 4 group by is producing incorrect count. Expected output in this case is *3* but actual seems to be *1*.
>  
> *CBO plan*
> {code:sql}
> HiveSortLimit(sort0=[$0], dir0=[ASC])
>   HiveProject(id=[$0])
>     HiveJoin(condition=[=($1, $3)], joinType=[inner], algorithm=[none], cost=[{6.0 rows, 0.0 cpu, 0.0 io}])
>       HiveJoin(condition=[true], joinType=[inner], algorithm=[none], cost=[{4.0 rows, 0.0 cpu, 0.0 io}])
>         HiveProject(id=[$0], int_col=[$4])
>           HiveFilter(condition=[IS NOT NULL($4)])
>             HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall])
>         HiveProject(cnt=[$0])
>           HiveFilter(condition=[<=(sq_count_check($0), 1)])
>             HiveProject(cnt=[$0])
>               HiveAggregate(group=[{}], cnt=[COUNT()])
>                 HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall])
>       HiveProject(int_col=[$4])
>         HiveFilter(condition=[IS NOT NULL($4)])
>           HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall])
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)