You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2022/09/07 03:15:00 UTC
[jira] [Created] (HIVE-26520) Improve dynamic partition pruning operator when subqueries are involved

Rajesh Balamohan created HIVE-26520:
---------------------------------------

             Summary: Improve dynamic partition pruning operator when subqueries are involved
                 Key: HIVE-26520
                 URL: https://issues.apache.org/jira/browse/HIVE-26520
             Project: Hive
          Issue Type: Improvement
          Components: HiveServer2
            Reporter: Rajesh Balamohan
         Attachments: q58_test.pdf

Dynamic partition pruning operator sends entire date_dim table and due to this, entire catalog_sales data is scanned causing huge IO and decoding cost.

If dynamic partition pruning operator was created after the "date_dim" subquery has been evaluated, it would have saved huge IO cost. E.g It would have just taken 6-7 partition scans instead of 1800+ partitions.

 

Consider the following simplified query as example

{noformat}

select count(*) from (select i_item_id item_id
        ,sum(cs_ext_sales_price) cs_item_rev
  from catalog_sales
      ,item
      ,date_dim
 where cs_item_sk = i_item_sk
  and  d_date in (select d_date
                  from date_dim
                  where d_week_seq = (select d_week_seq 
                                      from date_dim
                                      where d_date = '1998-02-21'))
  and  cs_sold_date_sk = d_date_sk
 group by i_item_id) a;
 

CBO PLAN:
HiveAggregate(group=[{}], agg#0=[count()])
  HiveProject(i_item_id=[$0])
    HiveAggregate(group=[{4}])
      HiveSemiJoin(condition=[=($6, $7)], joinType=[semi])
        HiveJoin(condition=[=($2, $5)], joinType=[inner])
          HiveJoin(condition=[=($0, $3)], joinType=[inner])
            HiveProject(cs_item_sk=[$14], cs_ext_sales_price=[$22], cs_sold_date_sk=[$33])
              HiveFilter(condition=[AND(IS NOT NULL($33), IS NOT NULL($14))])
                HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, catalog_sales]], table:alias=[catalog_sales])
            HiveProject(i_item_sk=[$0], i_item_id=[$1])
              HiveFilter(condition=[IS NOT NULL($0)])
                HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, item]], table:alias=[item])
          HiveProject(d_date_sk=[$0], d_date=[$2])
            HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT NULL($0))])
              HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, date_dim]], table:alias=[date_dim])
        HiveProject(d_date=[$0])
          HiveJoin(condition=[=($1, $3)], joinType=[inner])
            HiveJoin(condition=[true], joinType=[inner])
              HiveProject(d_date=[$2], d_week_seq=[$4])
                HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT NULL($4))])
                  HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, date_dim]], table:alias=[date_dim])
              HiveProject(cnt=[$0])
                HiveFilter(condition=[<=(sq_count_check($0), 1)])
                  HiveProject(cnt=[$0])
                    HiveAggregate(group=[{}], cnt=[COUNT()])
                      HiveFilter(condition=[=($2, 1998-02-21)])
                        HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, date_dim]], table:alias=[date_dim])
            HiveProject(d_week_seq=[$4])
              HiveFilter(condition=[AND(=($2, 1998-02-21), IS NOT NULL($4))])
                HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, date_dim]], table:alias=[date_dim])
{noformat}
 
I will attach the formatted plan for reference as well. If planner generated the dynamic partition pruning event after "date_dim" got evaluated in "Map 7", it would be been very efficient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)