You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Zhihua Deng (Jira)" <ji...@apache.org> on 2020/07/22 12:09:00 UTC
[jira] [Comment Edited] (HIVE-23893) Extract deterministic conditions for pdd when the predicate contains non-deterministic function

    [ https://issues.apache.org/jira/browse/HIVE-23893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162744#comment-17162744 ] 

Zhihua Deng edited comment on HIVE-23893 at 7/22/20, 12:08 PM:
---------------------------------------------------------------

[~pgaref] the problem can be reproduced on master,  like query:

set hive.cbo.enable=false;
 create table a(k string) partitioned by(hs int);
 create table b(k string) partitioned by(hs int);
 explain extended select a.**, b.** from a join b on a.k = b.k where rand(100) < 0.1 and a.hs = 11 and b.hs = 10;

Query Plan:
{noformat}
STAGE PLANS:
  Stage: Stage-1
    Tez
#### A masked pattern was here ####
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: a
                  filterExpr: k is not null (type: boolean)
                  Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
                  GatherStats: false
                  Filter Operator
                    isSamplingPred: false
                    predicate: k is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
                    Reduce Output Operator
                      bucketingVersion: 2
                      key expressions: k (type: string)
                      null sort order: z
                      numBuckets: -1
                      sort order: +
                      Map-reduce partition columns: k (type: string)
                      Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
                      tag: 0
                      value expressions: hs (type: int)
                      auto parallelism: true
            Execution mode: vectorized, llap
            LLAP IO: unknown
        Map 3 
            Map Operator Tree:
                TableScan
                  alias: b
                  filterExpr: k is not null (type: boolean)
                  Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
                  GatherStats: false
                  Filter Operator
                    isSamplingPred: false
                    predicate: k is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
                    Reduce Output Operator
                      bucketingVersion: 2
                      key expressions: k (type: string)
                      null sort order: z
                      numBuckets: -1
                      sort order: +
                      Map-reduce partition columns: k (type: string)
                      Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
                      tag: 1
                      value expressions: hs (type: int)
                      auto parallelism: true
            Execution mode: vectorized, llap
            LLAP IO: unknown
        Reducer 2 
            Execution mode: llap
            Needs Tagging: false
            Reduce Operator Tree:
              Merge Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 k (type: string)
                  1 k (type: string)
                outputColumnNames: _col0, _col1, _col5, _col6
                Position of Big Table: 0
                Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
                Filter Operator
                  isSamplingPred: false
                  predicate: ((rand(100) < 0.1) and (_col1 = 11) and (_col6 = 10)) (type: boolean)
                  Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
                  Select Operator
                    expressions: _col0 (type: string), 11 (type: int), _col5 (type: string), 10 (type: int)
                    outputColumnNames: _col0, _col1, _col2, _col3
                    Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
                    File Output Operator
                      bucketingVersion: 2
                      compressed: false
                      GlobalTableId: 0
#### A masked pattern was here ####
                      NumFilesPerFileSink: 1
                      Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
#### A masked pattern was here ####
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          properties:
                            bucketing_version -1
                            columns _col0,_col1,_col2,_col3
                            columns.types string:int:string:int
                            escape.delim \
                            hive.serialization.extend.additional.nesting.levels true
                            serialization.escape.crlf true
                            serialization.format 1
                            serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      TotalFiles: 1
                      GatherStats: false
                      MultiFileSpray: false{noformat}
 


was (Author: dengzh):
[~pgaref] the problem can be reproduced on master,  like query:

set hive.cbo.enable=false;
create table a(k string) partitioned by(hs int);
create table b(k string) partitioned by(hs int);
explain extended select a.*, b.* from a join b on a.k = b.k where rand(100) < 0.1 and a.hs = 11 and b.hs = 10;

Query Plan:
{noformat}
STAGE PLANS:
  Stage: Stage-1
    Tez
#### A masked pattern was here ####
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: a
                  filterExpr: k is not null (type: boolean)
                  Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
                  GatherStats: false
                  Filter Operator
                    isSamplingPred: false
                    predicate: k is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
                    Reduce Output Operator
                      bucketingVersion: 2
                      key expressions: k (type: string)
                      null sort order: z
                      numBuckets: -1
                      sort order: +
                      Map-reduce partition columns: k (type: string)
                      Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
                      tag: 0
                      value expressions: hs (type: int)
                      auto parallelism: true
            Execution mode: vectorized, llap
            LLAP IO: unknown
        Map 3 
            Map Operator Tree:
                TableScan
                  alias: b
                  filterExpr: k is not null (type: boolean)
                  Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
                  GatherStats: false
                  Filter Operator
                    isSamplingPred: false
                    predicate: k is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
                    Reduce Output Operator
                      bucketingVersion: 2
                      key expressions: k (type: string)
                      null sort order: z
                      numBuckets: -1
                      sort order: +
                      Map-reduce partition columns: k (type: string)
                      Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL
                      tag: 1
                      value expressions: hs (type: int)
                      auto parallelism: true
            Execution mode: vectorized, llap
            LLAP IO: unknown
        Reducer 2 
            Execution mode: llap
            Needs Tagging: false
            Reduce Operator Tree:
              Merge Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 k (type: string)
                  1 k (type: string)
                outputColumnNames: _col0, _col1, _col5, _col6
                Position of Big Table: 0
                Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
                Filter Operator
                  isSamplingPred: false
                  predicate: ((rand(100) < 0.1) and (_col1 = 11) and (_col6 = 10)) (type: boolean)
                  Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
                  Select Operator
                    expressions: _col0 (type: string), 11 (type: int), _col5 (type: string), 10 (type: int)
                    outputColumnNames: _col0, _col1, _col2, _col3
                    Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
                    File Output Operator
                      bucketingVersion: 2
                      compressed: false
                      GlobalTableId: 0
#### A masked pattern was here ####
                      NumFilesPerFileSink: 1
                      Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL
#### A masked pattern was here ####
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          properties:
                            bucketing_version -1
                            columns _col0,_col1,_col2,_col3
                            columns.types string:int:string:int
                            escape.delim \
                            hive.serialization.extend.additional.nesting.levels true
                            serialization.escape.crlf true
                            serialization.format 1
                            serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                      TotalFiles: 1
                      GatherStats: false
                      MultiFileSpray: false{noformat}
 

> Extract deterministic conditions for pdd when the predicate contains non-deterministic function
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-23893
>                 URL: https://issues.apache.org/jira/browse/HIVE-23893
>             Project: Hive
>          Issue Type: Improvement
>          Components: Logical Optimizer
>            Reporter: Zhihua Deng
>            Priority: Major
>
> Taken the following query for example, assume unix_timestamp is non-deterministic before version 1.3.0:
>   
>  {{SELECT}}
>  {{        from_unixtime(unix_timestamp(a.first_dt), 'yyyyMMdd') AS ft,}}
>  {{        b.game_id AS game_id,}}
>  {{        b.game_name AS game_name,}}
>  {{        count(DISTINCT a.sha1_imei) uv}}
>  {{FROM}}
>  {{        gamesdk_userprofile a}}
>  {{        JOIN game_info_all b ON a.appid = b.dev_app_id}}
>  {{WHERE}}
>  {{        a.date = 20200704}}
>  {{        AND from_unixtime(unix_timestamp(a.first_dt), 'yyyyMMdd') = 20200704}}
>  {{        AND b.date = 20200704}}
>  {{GROUP BY}}
>  {{        from_unixtime(unix_timestamp(a.first_dt), 'yyyyMMdd'),}}
>  {{        b.game_id,}}
>  {{        b.game_name}}
>  {{ORDER BY}}
>  {{        uv DESC}}
>  {{LIMIT 200;}}
>   
>  The predicates(a.date = 20200704, b.date = 20200704) are unable to push down to join op, make the optimizer unable to prune partitions, which may result  to a full scan on tables gamesdk_userprofile and game_info_all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)