You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/03/16 00:50:00 UTC
[jira] [Work logged] (HIVE-24595) Vectorization causing incorrect results for scalar subquery

     [ https://issues.apache.org/jira/browse/HIVE-24595?focusedWorklogId=566656&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-566656 ]

ASF GitHub Bot logged work on HIVE-24595:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Mar/21 00:49
            Start Date: 16/Mar/21 00:49
    Worklog Time Spent: 10m 
      Work Description: github-actions[bot] commented on pull request #1867:
URL: https://github.com/apache/hive/pull/1867#issuecomment-799860346


   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 566656)
    Time Spent: 20m  (was: 10m)

> Vectorization causing incorrect results for scalar subquery
> -----------------------------------------------------------
>
>                 Key: HIVE-24595
>                 URL: https://issues.apache.org/jira/browse/HIVE-24595
>             Project: Hive
>          Issue Type: Bug
>          Components: Vectorization
>    Affects Versions: 3.0.0
>            Reporter: Vineet Garg
>            Assignee: Mustafa İman
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> *Repro*
> {code:sql}
>  CREATE EXTERNAL TABLE `alltypessmall`(             
>    `id` int,                                        
>    `bool_col` boolean,                              
>    `tinyint_col` tinyint,                           
>    `smallint_col` smallint,                         
>    `int_col` int,                                   
>    `bigint_col` bigint,                             
>    `float_col` float,                               
>    `double_col` double,                             
>    `date_string_col` string,                        
>    `string_col` string,                             
>    `timestamp_col` timestamp)                       
>  PARTITIONED BY (                                   
>    `year` int,                                      
>    `month` int)                                     
>  ROW FORMAT SERDE                                   
>    'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  
>  WITH SERDEPROPERTIES (                             
>    'escape.delim'='\\',                             
>    'field.delim'=',',                               
>    'serialization.format'=',')                      
>  STORED AS INPUTFORMAT                              
>    'org.apache.hadoop.mapred.TextInputFormat'       
>  OUTPUTFORMAT                                       
>    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
>  TBLPROPERTIES (                                    
>    'DO_NOT_UPDATE_STATS'='true',                    
>    'OBJCAPABILITIES'='EXTREAD,EXTWRITE',            
>    'STATS_GENERATED'='TASK',                        
>    'impala.lastComputeStatsTime'='1608312793',      
>    'transient_lastDdlTime'='1608310442')            ;
> insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,4,3434,5.4,44.3,'str1','str2', '01-01-2001');
> insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,4,3434,5.4,44.3,'str1','str2', '01-01-2001');
> insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,40,3434,5.4,44.3,'str1','str2', '01-01-2001');
> {code}
> Following query should fail but it succeeds
> {code:sql}
> SELECT id FROM alltypessmall
> WHERE int_col =
>   (SELECT int_col
>    FROM alltypessmall)
> ORDER BY id;
> {code}
> *Explain plan*
> {code:java}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: vgarg_20210106115838_3fe73bf6-66c2-4281-92e8-fd75fd8ad400:17
>       Edges:
>         Map 1 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE)
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>         Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE)
>       DagName: vgarg_20210106115838_3fe73bf6-66c2-4281-92e8-fd75fd8ad400:17
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: alltypessmall
>                   filterExpr: int_col is not null (type: boolean)
>                   Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: int_col is not null (type: boolean)
>                     Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: id (type: int), int_col (type: int)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
>                       Map Join Operator
>                         condition map:
>                              Inner Join 0 to 1
>                         keys:
>                           0
>                           1
>                         outputColumnNames: _col0, _col1
>                         input vertices:
>                           1 Reducer 4
>                         Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
>                         Map Join Operator
>                           condition map:
>                                Inner Join 0 to 1
>                           keys:
>                             0 _col1 (type: int)
>                             1 _col0 (type: int)
>                           outputColumnNames: _col0
>                           input vertices:
>                             1 Map 3
>                           Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
>                           Reduce Output Operator
>                             key expressions: _col0 (type: int)
>                             null sort order: z
>                             sort order: +
>                             Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized
>         Map 3
>             Map Operator Tree:
>                 TableScan
>                   alias: alltypessmall
>                   Statistics: Num rows: 3 Data size: 142 Basic stats: COMPLETE Column stats: COMPLETE
>                   Select Operator
>                     Statistics: Num rows: 3 Data size: 142 Basic stats: COMPLETE Column stats: COMPLETE
>                     Group By Operator
>                       aggregations: count()
>                       minReductionHashAggr: 0.6666666
>                       mode: hash
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         null sort order:
>                         sort order:
>                         Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
>                         value expressions: _col0 (type: bigint)
>                   Filter Operator
>                     predicate: int_col is not null (type: boolean)
>                     Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: int_col (type: int)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         null sort order: z
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized
>         Reducer 2
>             Execution mode: vectorized
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: KEY.reducesinkkey0 (type: int)
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
>                   table:
>                       input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                       output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Reducer 4
>             Execution mode: vectorized
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
>                 Filter Operator
>                   predicate: (sq_count_check(_col0) <= 1) (type: boolean)
>                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
>                   Select Operator
>                     Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
>                     Reduce Output Operator
>                       null sort order:
>                       sort order:
>                       Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> {code}
> Issue is that Map 3/Reducer 4 group by is producing incorrect count. Expected output in this case is *3* but actual seems to be *1*.
>  
> *CBO plan*
> {code:sql}
> HiveSortLimit(sort0=[$0], dir0=[ASC])
>   HiveProject(id=[$0])
>     HiveJoin(condition=[=($1, $3)], joinType=[inner], algorithm=[none], cost=[{6.0 rows, 0.0 cpu, 0.0 io}])
>       HiveJoin(condition=[true], joinType=[inner], algorithm=[none], cost=[{4.0 rows, 0.0 cpu, 0.0 io}])
>         HiveProject(id=[$0], int_col=[$4])
>           HiveFilter(condition=[IS NOT NULL($4)])
>             HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall])
>         HiveProject(cnt=[$0])
>           HiveFilter(condition=[<=(sq_count_check($0), 1)])
>             HiveProject(cnt=[$0])
>               HiveAggregate(group=[{}], cnt=[COUNT()])
>                 HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall])
>       HiveProject(int_col=[$4])
>         HiveFilter(condition=[IS NOT NULL($4)])
>           HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall])
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)