You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2019/09/30 07:14:00 UTC
[jira] [Created] (HIVE-22269) Missing stats in the operator with
"hive.optimize.sort.dynamic.partition" (SortedDynPartitionOptimizer)
misestimates reducer count
Rajesh Balamohan created HIVE-22269:
---------------------------------------
Summary: Missing stats in the operator with "hive.optimize.sort.dynamic.partition" (SortedDynPartitionOptimizer) misestimates reducer count
Key: HIVE-22269
URL: https://issues.apache.org/jira/browse/HIVE-22269
Project: Hive
Issue Type: Bug
Components: Statistics
Reporter: Rajesh Balamohan
{{hive.optimize.sort.dynamic.partition=true}} introduces new stage to reduce number of writes in dynamic partitioning usecase. Earlier {{SortedDynPartitionOptimizer}} added this new operator via {{Optimizer.java}} and the stats for the newly added operator was populated via {{StatsRulesProcFactory$ReduceSinkStatsRule}}.
However, with "HIVE-20703" this got changed. This is moved to {{TezCompiler}} for cost based decision. Though the operator gets added correctly, the stats for this does not get added (as it runs after runStatsAnnotation()).
This causes reducer count to be mis-estimated in the query.
{noformat}
e.g For the following query, reducer_2 would be estimated as "2" instead of "1009". This causes huge delay in the runtime.
explain
from tpcds_xtext_1000.store_sales ss
insert overwrite table store_sales partition (ss_sold_date_sk)
select
ss.ss_sold_time_sk,
ss.ss_item_sk,
ss.ss_customer_sk,
ss.ss_cdemo_sk,
ss.ss_hdemo_sk,
ss.ss_addr_sk,
ss.ss_store_sk,
ss.ss_promo_sk,
ss.ss_ticket_number,
ss.ss_quantity,
ss.ss_wholesale_cost,
ss.ss_list_price,
ss.ss_sales_price,
ss.ss_ext_discount_amt,
ss.ss_ext_sales_price,
ss.ss_ext_wholesale_cost,
ss.ss_ext_list_price,
ss.ss_ext_tax,
ss.ss_coupon_amt,
ss.ss_net_paid,
ss.ss_net_paid_inc_tax,
ss.ss_net_profit,
ss.ss_sold_date_sk
where ss.ss_sold_date_sk is not null
insert overwrite table store_sales partition (ss_sold_date_sk)
select
ss.ss_sold_time_sk,
ss.ss_item_sk,
ss.ss_customer_sk,
ss.ss_cdemo_sk,
ss.ss_hdemo_sk,
ss.ss_addr_sk,
ss.ss_store_sk,
ss.ss_promo_sk,
ss.ss_ticket_number,
ss.ss_quantity,
ss.ss_wholesale_cost,
ss.ss_list_price,
ss.ss_sales_price,
ss.ss_ext_discount_amt,
ss.ss_ext_sales_price,
ss.ss_ext_wholesale_cost,
ss.ss_ext_list_price,
ss.ss_ext_tax,
ss.ss_coupon_amt,
ss.ss_net_paid,
ss.ss_net_paid_inc_tax,
ss.ss_net_profit,
ss.ss_sold_date_sk
where ss.ss_sold_date_sk is null
distribute by ss.ss_item_sk
;
{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)