You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Zhihua Deng (Jira)" <ji...@apache.org> on 2020/07/22 12:09:00 UTC
[jira] [Comment Edited] (HIVE-23893) Extract deterministic
conditions for pdd when the predicate contains non-deterministic function
[ https://issues.apache.org/jira/browse/HIVE-23893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162744#comment-17162744 ]
Zhihua Deng edited comment on HIVE-23893 at 7/22/20, 12:08 PM:
---------------------------------------------------------------
[~pgaref] the problem can be reproduced on master, like query:
set hive.cbo.enable=false;
create table a(k string) partitioned by(hs int);
create table b(k string) partitioned by(hs int);
explain extended select a.**, b.** from a join b on a.k = b.k where rand(100) < 0.1 and a.hs = 11 and b.hs = 10;
Query Plan:
{noformat}
STAGE PLANS:
Stage: Stage-1
Tez
#### A masked pattern was here ####
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
#### A masked pattern was here ####
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: a
filterExpr: k is not null (type: boolean)
Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: k is not null (type: boolean)
Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
bucketingVersion: 2
key expressions: k (type: string)
null sort order: z
numBuckets: -1
sort order: +
Map-reduce partition columns: k (type: string)
Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
tag: 0
value expressions: hs (type: int)
auto parallelism: true
Execution mode: vectorized, llap
LLAP IO: unknown
Map 3
Map Operator Tree:
TableScan
alias: b
filterExpr: k is not null (type: boolean)
Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: k is not null (type: boolean)
Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
bucketingVersion: 2
key expressions: k (type: string)
null sort order: z
numBuckets: -1
sort order: +
Map-reduce partition columns: k (type: string)
Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
tag: 1
value expressions: hs (type: int)
auto parallelism: true
Execution mode: vectorized, llap
LLAP IO: unknown
Reducer 2
Execution mode: llap
Needs Tagging: false
Reduce Operator Tree:
Merge Join Operator
condition map:
Inner Join 0 to 1
keys:
0 k (type: string)
1 k (type: string)
outputColumnNames: _col0, _col1, _col5, _col6
Position of Big Table: 0
Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator
isSamplingPred: false
predicate: ((rand(100) < 0.1) and (_col1 = 11) and (_col6 = 10)) (type: boolean)
Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: _col0 (type: string), 11 (type: int), _col5 (type: string), 10 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
bucketingVersion: 2
compressed: false
GlobalTableId: 0
#### A masked pattern was here ####
NumFilesPerFileSink: 1
Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
#### A masked pattern was here ####
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
properties:
bucketing_version -1
columns _col0,_col1,_col2,_col3
columns.types string:int:string:int
escape.delim \
hive.serialization.extend.additional.nesting.levels true
serialization.escape.crlf true
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false{noformat}
was (Author: dengzh):
[~pgaref] the problem can be reproduced on master, like query:
set hive.cbo.enable=false;
create table a(k string) partitioned by(hs int);
create table b(k string) partitioned by(hs int);
explain extended select a.*, b.* from a join b on a.k = b.k where rand(100) < 0.1 and a.hs = 11 and b.hs = 10;
Query Plan:
{noformat}
STAGE PLANS:
Stage: Stage-1
Tez
#### A masked pattern was here ####
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
#### A masked pattern was here ####
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: a
filterExpr: k is not null (type: boolean)
Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: k is not null (type: boolean)
Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
bucketingVersion: 2
key expressions: k (type: string)
null sort order: z
numBuckets: -1
sort order: +
Map-reduce partition columns: k (type: string)
Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
tag: 0
value expressions: hs (type: int)
auto parallelism: true
Execution mode: vectorized, llap
LLAP IO: unknown
Map 3
Map Operator Tree:
TableScan
alias: b
filterExpr: k is not null (type: boolean)
Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: k is not null (type: boolean)
Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
bucketingVersion: 2
key expressions: k (type: string)
null sort order: z
numBuckets: -1
sort order: +
Map-reduce partition columns: k (type: string)
Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
tag: 1
value expressions: hs (type: int)
auto parallelism: true
Execution mode: vectorized, llap
LLAP IO: unknown
Reducer 2
Execution mode: llap
Needs Tagging: false
Reduce Operator Tree:
Merge Join Operator
condition map:
Inner Join 0 to 1
keys:
0 k (type: string)
1 k (type: string)
outputColumnNames: _col0, _col1, _col5, _col6
Position of Big Table: 0
Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
Filter Operator
isSamplingPred: false
predicate: ((rand(100) < 0.1) and (_col1 = 11) and (_col6 = 10)) (type: boolean)
Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: _col0 (type: string), 11 (type: int), _col5 (type: string), 10 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
bucketingVersion: 2
compressed: false
GlobalTableId: 0
#### A masked pattern was here ####
NumFilesPerFileSink: 1
Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
#### A masked pattern was here ####
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
properties:
bucketing_version -1
columns _col0,_col1,_col2,_col3
columns.types string:int:string:int
escape.delim \
hive.serialization.extend.additional.nesting.levels true
serialization.escape.crlf true
serialization.format 1
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false{noformat}
> Extract deterministic conditions for pdd when the predicate contains non-deterministic function
> -----------------------------------------------------------------------------------------------
>
> Key: HIVE-23893
> URL: https://issues.apache.org/jira/browse/HIVE-23893
> Project: Hive
> Issue Type: Improvement
> Components: Logical Optimizer
> Reporter: Zhihua Deng
> Priority: Major
>
> Taken the following query for example, assume unix_timestamp is non-deterministic before version 1.3.0:
>
> {{SELECT}}
> {{ from_unixtime(unix_timestamp(a.first_dt), 'yyyyMMdd') AS ft,}}
> {{ b.game_id AS game_id,}}
> {{ b.game_name AS game_name,}}
> {{ count(DISTINCT a.sha1_imei) uv}}
> {{FROM}}
> {{ gamesdk_userprofile a}}
> {{ JOIN game_info_all b ON a.appid = b.dev_app_id}}
> {{WHERE}}
> {{ a.date = 20200704}}
> {{ AND from_unixtime(unix_timestamp(a.first_dt), 'yyyyMMdd') = 20200704}}
> {{ AND b.date = 20200704}}
> {{GROUP BY}}
> {{ from_unixtime(unix_timestamp(a.first_dt), 'yyyyMMdd'),}}
> {{ b.game_id,}}
> {{ b.game_name}}
> {{ORDER BY}}
> {{ uv DESC}}
> {{LIMIT 200;}}
>
> The predicates(a.date = 20200704, b.date = 20200704) are unable to push down to join op, make the optimizer unable to prune partitions, which may result to a full scan on tables gamesdk_userprofile and game_info_all.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)