You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Rajesh Balamohan (Jira)" <ji...@apache.org> on 2019/09/30 07:14:00 UTC

[jira] [Created] (HIVE-22269) Missing stats in the operator with "hive.optimize.sort.dynamic.partition" (SortedDynPartitionOptimizer) misestimates reducer count

Rajesh Balamohan created HIVE-22269:
---------------------------------------

             Summary: Missing stats in the operator with "hive.optimize.sort.dynamic.partition" (SortedDynPartitionOptimizer) misestimates reducer count
                 Key: HIVE-22269
                 URL: https://issues.apache.org/jira/browse/HIVE-22269
             Project: Hive
          Issue Type: Bug
          Components: Statistics
            Reporter: Rajesh Balamohan


{{hive.optimize.sort.dynamic.partition=true}} introduces new stage to reduce number of writes in dynamic partitioning usecase. Earlier {{SortedDynPartitionOptimizer}} added this new operator via {{Optimizer.java}} and the stats for the newly added operator was populated via {{StatsRulesProcFactory$ReduceSinkStatsRule}}.

However, with "HIVE-20703" this got changed. This is moved to {{TezCompiler}} for cost based decision. Though the operator gets added correctly, the stats for this does not get added (as it runs after runStatsAnnotation()).

This causes reducer count to be mis-estimated in the query.
{noformat}
e.g For the following query, reducer_2 would be estimated as "2" instead of "1009". This causes huge delay in the runtime.

explain 
from tpcds_xtext_1000.store_sales ss
insert overwrite table store_sales partition (ss_sold_date_sk)
select
        ss.ss_sold_time_sk,
        ss.ss_item_sk,
        ss.ss_customer_sk,
        ss.ss_cdemo_sk,
        ss.ss_hdemo_sk,
        ss.ss_addr_sk,
        ss.ss_store_sk,
        ss.ss_promo_sk,
        ss.ss_ticket_number,
        ss.ss_quantity,
        ss.ss_wholesale_cost,
        ss.ss_list_price,
        ss.ss_sales_price,
        ss.ss_ext_discount_amt,
        ss.ss_ext_sales_price,
        ss.ss_ext_wholesale_cost,
        ss.ss_ext_list_price,
        ss.ss_ext_tax,
        ss.ss_coupon_amt,
        ss.ss_net_paid,
        ss.ss_net_paid_inc_tax,
        ss.ss_net_profit,
        ss.ss_sold_date_sk
        where ss.ss_sold_date_sk is not null
insert overwrite table store_sales partition (ss_sold_date_sk)
select
        ss.ss_sold_time_sk,
        ss.ss_item_sk,
        ss.ss_customer_sk,
        ss.ss_cdemo_sk,
        ss.ss_hdemo_sk,
        ss.ss_addr_sk,
        ss.ss_store_sk,
        ss.ss_promo_sk,
        ss.ss_ticket_number,
        ss.ss_quantity,
        ss.ss_wholesale_cost,
        ss.ss_list_price,
        ss.ss_sales_price,
        ss.ss_ext_discount_amt,
        ss.ss_ext_sales_price,
        ss.ss_ext_wholesale_cost,
        ss.ss_ext_list_price,
        ss.ss_ext_tax,
        ss.ss_coupon_amt,
        ss.ss_net_paid,
        ss.ss_net_paid_inc_tax,
        ss.ss_net_profit,
        ss.ss_sold_date_sk
        where ss.ss_sold_date_sk is null
        distribute by ss.ss_item_sk
;
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)