You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2022/09/07 03:15:00 UTC
[jira] [Created] (HIVE-26520) Improve dynamic partition pruning operator when subqueries are involved
Rajesh Balamohan created HIVE-26520:
---------------------------------------
Summary: Improve dynamic partition pruning operator when subqueries are involved
Key: HIVE-26520
URL: https://issues.apache.org/jira/browse/HIVE-26520
Project: Hive
Issue Type: Improvement
Components: HiveServer2
Reporter: Rajesh Balamohan
Attachments: q58_test.pdf
Dynamic partition pruning operator sends entire date_dim table and due to this, entire catalog_sales data is scanned causing huge IO and decoding cost.
If dynamic partition pruning operator was created after the "date_dim" subquery has been evaluated, it would have saved huge IO cost. E.g It would have just taken 6-7 partition scans instead of 1800+ partitions.
Consider the following simplified query as example
{noformat}
select count(*) from (select i_item_id item_id
,sum(cs_ext_sales_price) cs_item_rev
from catalog_sales
,item
,date_dim
where cs_item_sk = i_item_sk
and d_date in (select d_date
from date_dim
where d_week_seq = (select d_week_seq
from date_dim
where d_date = '1998-02-21'))
and cs_sold_date_sk = d_date_sk
group by i_item_id) a;
CBO PLAN:
HiveAggregate(group=[{}], agg#0=[count()])
HiveProject(i_item_id=[$0])
HiveAggregate(group=[{4}])
HiveSemiJoin(condition=[=($6, $7)], joinType=[semi])
HiveJoin(condition=[=($2, $5)], joinType=[inner])
HiveJoin(condition=[=($0, $3)], joinType=[inner])
HiveProject(cs_item_sk=[$14], cs_ext_sales_price=[$22], cs_sold_date_sk=[$33])
HiveFilter(condition=[AND(IS NOT NULL($33), IS NOT NULL($14))])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, catalog_sales]], table:alias=[catalog_sales])
HiveProject(i_item_sk=[$0], i_item_id=[$1])
HiveFilter(condition=[IS NOT NULL($0)])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, item]], table:alias=[item])
HiveProject(d_date_sk=[$0], d_date=[$2])
HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT NULL($0))])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, date_dim]], table:alias=[date_dim])
HiveProject(d_date=[$0])
HiveJoin(condition=[=($1, $3)], joinType=[inner])
HiveJoin(condition=[true], joinType=[inner])
HiveProject(d_date=[$2], d_week_seq=[$4])
HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT NULL($4))])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, date_dim]], table:alias=[date_dim])
HiveProject(cnt=[$0])
HiveFilter(condition=[<=(sq_count_check($0), 1)])
HiveProject(cnt=[$0])
HiveAggregate(group=[{}], cnt=[COUNT()])
HiveFilter(condition=[=($2, 1998-02-21)])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, date_dim]], table:alias=[date_dim])
HiveProject(d_week_seq=[$4])
HiveFilter(condition=[AND(=($2, 1998-02-21), IS NOT NULL($4))])
HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, date_dim]], table:alias=[date_dim])
{noformat}
I will attach the formatted plan for reference as well. If planner generated the dynamic partition pruning event after "date_dim" got evaluated in "Map 7", it would be been very efficient.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)