You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Vineet Garg (JIRA)" <ji...@apache.org> on 2019/03/04 22:00:00 UTC
[jira] [Assigned] (HIVE-21381) Improve column pruning
[ https://issues.apache.org/jira/browse/HIVE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vineet Garg reassigned HIVE-21381:
----------------------------------
> Improve column pruning
> ----------------------
>
> Key: HIVE-21381
> URL: https://issues.apache.org/jira/browse/HIVE-21381
> Project: Hive
> Issue Type: Improvement
> Reporter: Vineet Garg
> Assignee: Vineet Garg
> Priority: Major
>
> Following query generate plan where right side of HiveSemiJoin contains HiveProject->HiveFilter->HiveProject where bottom HiveProject contain extra columns which can be pruned.
> {code:sql}
> explain cbo with frequent_ss_items as
> (select substr(i_item_desc,1,30) itemdesc,i_item_sk item_sk,d_date solddate,count(*) cnt
> from store_sales
> ,date_dim
> ,item
> where ss_sold_date_sk = d_date_sk
> and ss_item_sk = i_item_sk
> and d_year in (1999,1999+1,1999+2,1999+3)
> group by substr(i_item_desc,1,30),i_item_sk,d_date
> having count(*) >4)
> select sum(sales)
> from ((select cs_quantity*cs_list_price sales
> from catalog_sales
> ,date_dim
> where d_year = 1999
> and d_moy = 1
> and cs_sold_date_sk = d_date_sk
> and cs_item_sk in (select item_sk from frequent_ss_items))) subq limit 100;
> {code}
> CBO Plan:
> {code:sql}
> HiveSortLimit(fetch=[100])
> HiveProject($f0=[$0])
> HiveAggregate(group=[{}], agg#0=[sum($0)])
> HiveProject(sales=[*(CAST($2):DECIMAL(10, 0), $3)])
> HiveSemiJoin(condition=[=($1, $5)], joinType=[inner])
> HiveJoin(condition=[=($0, $4)], joinType=[inner], algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}])
> HiveProject(cs_sold_date_sk=[$0], cs_item_sk=[$15], cs_quantity=[$18], cs_list_price=[$20])
> HiveFilter(condition=[IS NOT NULL($0)])
> HiveTableScan(table=[[perf_constraints, catalog_sales]], table:alias=[catalog_sales])
> HiveProject(d_date_sk=[$0])
> HiveFilter(condition=[AND(=($6, 1999), =($8, 1))])
> HiveTableScan(table=[[perf_constraints, date_dim]], table:alias=[date_dim])
> HiveProject(i_item_sk=[$1])
> HiveFilter(condition=[>($3, 4)])
> HiveProject(substr=[$2], i_item_sk=[$1], d_date=[$0], $f3=[$3])
> HiveAggregate(group=[{3, 4, 5}], agg#0=[count()])
> HiveJoin(condition=[=($1, $4)], joinType=[inner], algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}])
> HiveJoin(condition=[=($0, $2)], joinType=[inner], algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}])
> HiveProject(ss_sold_date_sk=[$0], ss_item_sk=[$2])
> HiveFilter(condition=[IS NOT NULL($0)])
> HiveTableScan(table=[[perf_constraints, store_sales]], table:alias=[store_sales])
> HiveProject(d_date_sk=[$0], d_date=[$2])
> HiveFilter(condition=[IN($6, 1999, 2000, 2001, 2002)])
> HiveTableScan(table=[[perf_constraints, date_dim]], table:alias=[date_dim])
> HiveProject(i_item_sk=[$0], substr=[substr($4, 1, 30)])
> HiveTableScan(table=[[perf_constraints, item]], table:alias=[item])
> {code}
> Only {{i_item_sk}} and {{$f3/count}} are used up in the plan therefore columns {{substr}} andn {{d_date}} can be removed.
> Note that the above is generated with HIVE-21340 patch
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)