You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Ashutosh Chauhan (JIRA)" <ji...@apache.org> on 2016/09/06 22:22:21 UTC
[jira] [Commented] (HIVE-14708) Optimizer: NOT IN query scans one input two times

    [ https://issues.apache.org/jira/browse/HIVE-14708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468804#comment-15468804 ] 

Ashutosh Chauhan commented on HIVE-14708:
-----------------------------------------

I am not sure I follow. We need to know if there are any nulls in subqueries or not and we need to know that before we start generating resultset for outer query. Thus we count null first and then use that count to do a filter. I am not sure how is it possible to do this in a single scan. For reference: this rewrite is described on page 10 on [design doc | https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf] on HIVE-784

> Optimizer: NOT IN query scans one input two times
> -------------------------------------------------
>
>                 Key: HIVE-14708
>                 URL: https://issues.apache.org/jira/browse/HIVE-14708
>             Project: Hive
>          Issue Type: Bug
>          Components: Logical Optimizer
>    Affects Versions: 2.2.0
>            Reporter: Gopal V
>            Priority: Critical
>
> {code}
> hive (tpcds_bin_partitioned_orc_1000)> explain select count(1) from store_sales where ss_sold_date_sk NOT in (select d_date_sk from date_dim);
>     Stage-1
>       Reducer 2 vectorized, llap
>       File Output Operator [FS_52]
>         Group By Operator [GBY_51] (rows=1 width=8)
>           Output:["_col0"],aggregations:["count(VALUE._col0)"]
>         <-Map 1 [SIMPLE_EDGE] vectorized, llap
>           SHUFFLE [RS_50]
>             Group By Operator [GBY_49] (rows=1 width=8)
>               Output:["_col0"],aggregations:["count(1)"]
>               Select Operator [SEL_48] (rows=1 width=4)
>                 Filter Operator [FIL_47] (rows=1 width=4)
>                   predicate:_col2 is null
>                   Map Join Operator [MAPJOIN_46] (rows=2879987999 width=4)
>                     Conds:MAPJOIN_45._col0=RS_43._col0(Left Outer),Output:["_col2"]
>                   <-Map 5 [BROADCAST_EDGE] vectorized, llap
>                     BROADCAST [RS_43]
>                       PartitionCols:_col0
>                       Select Operator [SEL_42] (rows=73049 width=4)
>                         Output:["_col0"]
>                         TableScan [TS_11] (rows=73049 width=4)
>                           tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"]
>                   <-Map Join Operator [MAPJOIN_45] (rows=2879987999 width=4)
>                       Conds:(Inner),Output:["_col0"]
>                     <-Reducer 4 [BROADCAST_EDGE] vectorized, llap
>                       BROADCAST [RS_41]
>                         Select Operator [SEL_40] (rows=1 width=8)
>                           Filter Operator [FIL_39] (rows=1 width=8)
>                             predicate:(_col0 = 0)
>                             Group By Operator [GBY_38] (rows=1 width=8)
>                               Output:["_col0"],aggregations:["count(VALUE._col0)"]
>                             <-Map 3 [SIMPLE_EDGE] vectorized, llap
>                               SHUFFLE [RS_37]
>                                 Group By Operator [GBY_36] (rows=1 width=8)
>                                   Output:["_col0"],aggregations:["count()"]
>                                   Select Operator [SEL_35] (rows=1 width=4)
>                                     Filter Operator [FIL_34] (rows=1 width=4)
>                                       predicate:d_date_sk is null
>                                       TableScan [TS_2] (rows=73049 width=4)
>                                         tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"]
>                     <-Select Operator [SEL_44] (rows=2879987999 width=4)
>                         Output:["_col0"]
>                         TableScan [TS_0] (rows=2879987999 width=92)
>                           tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE
> {code}
> The 2nd scan is merely to count the number of NULLs and has {{predicate:d_date_sk is null}} in the operator.
> The NULL checks can be done inline with the NOT-NULL codepath instead of producing 2 independent full-scans of the date_dim table.
> This is not significant in a scenario like the above where the small table side is an actual HDFS table, but entirely throttles performance when the small side is actually an expensive aggregate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)