You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Mustafa İman (Jira)" <ji...@apache.org> on 2021/04/05 21:51:00 UTC
[jira] [Resolved] (HIVE-24595) Vectorization causing incorrect
results for scalar subquery
[ https://issues.apache.org/jira/browse/HIVE-24595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mustafa İman resolved HIVE-24595.
---------------------------------
Fix Version/s: 4.0.0
Resolution: Invalid
> Vectorization causing incorrect results for scalar subquery
> -----------------------------------------------------------
>
> Key: HIVE-24595
> URL: https://issues.apache.org/jira/browse/HIVE-24595
> Project: Hive
> Issue Type: Bug
> Components: Vectorization
> Affects Versions: 3.0.0
> Reporter: Vineet Garg
> Assignee: Mustafa İman
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> *Repro*
> {code:sql}
> CREATE EXTERNAL TABLE `alltypessmall`(
> `id` int,
> `bool_col` boolean,
> `tinyint_col` tinyint,
> `smallint_col` smallint,
> `int_col` int,
> `bigint_col` bigint,
> `float_col` float,
> `double_col` double,
> `date_string_col` string,
> `string_col` string,
> `timestamp_col` timestamp)
> PARTITIONED BY (
> `year` int,
> `month` int)
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> WITH SERDEPROPERTIES (
> 'escape.delim'='\\',
> 'field.delim'=',',
> 'serialization.format'=',')
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> TBLPROPERTIES (
> 'DO_NOT_UPDATE_STATS'='true',
> 'OBJCAPABILITIES'='EXTREAD,EXTWRITE',
> 'STATS_GENERATED'='TASK',
> 'impala.lastComputeStatsTime'='1608312793',
> 'transient_lastDdlTime'='1608310442') ;
> insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,4,3434,5.4,44.3,'str1','str2', '01-01-2001');
> insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,4,3434,5.4,44.3,'str1','str2', '01-01-2001');
> insert into alltypessmall partition(year=2002,month=1) values(1, true, 3,3,40,3434,5.4,44.3,'str1','str2', '01-01-2001');
> {code}
> Following query should fail but it succeeds
> {code:sql}
> SELECT id FROM alltypessmall
> WHERE int_col =
> (SELECT int_col
> FROM alltypessmall)
> ORDER BY id;
> {code}
> *Explain plan*
> {code:java}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
> STAGE PLANS:
> Stage: Stage-1
> Tez
> DagId: vgarg_20210106115838_3fe73bf6-66c2-4281-92e8-fd75fd8ad400:17
> Edges:
> Map 1 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE)
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE)
> DagName: vgarg_20210106115838_3fe73bf6-66c2-4281-92e8-fd75fd8ad400:17
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: alltypessmall
> filterExpr: int_col is not null (type: boolean)
> Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: int_col is not null (type: boolean)
> Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: id (type: int), int_col (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> keys:
> 0
> 1
> outputColumnNames: _col0, _col1
> input vertices:
> 1 Reducer 4
> Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> keys:
> 0 _col1 (type: int)
> 1 _col0 (type: int)
> outputColumnNames: _col0
> input vertices:
> 1 Map 3
> Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: z
> sort order: +
> Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
> Execution mode: vectorized
> Map 3
> Map Operator Tree:
> TableScan
> alias: alltypessmall
> Statistics: Num rows: 3 Data size: 142 Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> Statistics: Num rows: 3 Data size: 142 Basic stats: COMPLETE Column stats: COMPLETE
> Group By Operator
> aggregations: count()
> minReductionHashAggr: 0.6666666
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> null sort order:
> sort order:
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
> value expressions: _col0 (type: bigint)
> Filter Operator
> predicate: int_col is not null (type: boolean)
> Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: int_col (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: z
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 3 Data size: 12 Basic stats: COMPLETE Column stats: COMPLETE
> Execution mode: vectorized
> Reducer 2
> Execution mode: vectorized
> Reduce Operator Tree:
> Select Operator
> expressions: KEY.reducesinkkey0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
> File Output Operator
> compressed: false
> Statistics: Num rows: 9 Data size: 36 Basic stats: COMPLETE Column stats: COMPLETE
> table:
> input format: org.apache.hadoop.mapred.SequenceFileInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Reducer 4
> Execution mode: vectorized
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: (sq_count_check(_col0) <= 1) (type: boolean)
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> null sort order:
> sort order:
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> {code}
> Issue is that Map 3/Reducer 4 group by is producing incorrect count. Expected output in this case is *3* but actual seems to be *1*.
>
> *CBO plan*
> {code:sql}
> HiveSortLimit(sort0=[$0], dir0=[ASC])
> HiveProject(id=[$0])
> HiveJoin(condition=[=($1, $3)], joinType=[inner], algorithm=[none], cost=[{6.0 rows, 0.0 cpu, 0.0 io}])
> HiveJoin(condition=[true], joinType=[inner], algorithm=[none], cost=[{4.0 rows, 0.0 cpu, 0.0 io}])
> HiveProject(id=[$0], int_col=[$4])
> HiveFilter(condition=[IS NOT NULL($4)])
> HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall])
> HiveProject(cnt=[$0])
> HiveFilter(condition=[<=(sq_count_check($0), 1)])
> HiveProject(cnt=[$0])
> HiveAggregate(group=[{}], cnt=[COUNT()])
> HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall])
> HiveProject(int_col=[$4])
> HiveFilter(condition=[IS NOT NULL($4)])
> HiveTableScan(table=[[cdpd_20765, alltypessmall]], table:alias=[alltypessmall])
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)