You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2017/09/12 08:23:00 UTC
[jira] [Commented] (HIVE-17474) Poor Performance about subquery like DS/query70

    [ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162657#comment-16162657 ] 

liyunzhang_intel commented on HIVE-17474:
-----------------------------------------

the execution plan of hive on spark about DS/query70 is [attached|https://issues.apache.org/jira/secure/attachment/12886590/explain.70.vec].
Investigate the problem, i found that several points
1. the statistics for sub-query is not correct, it estimates nearly 36g about the result while actually the result is very small(nearly 30 rows about state info). Because of this, the join between part1 and part2(see jira description) is common join not map join. Maybe the calculation of statistics estimation need be more intelligent in such complex sub-query.
{code}
  Reducer 12 
            Reduce Operator Tree:
              Select Operator
                expressions: KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 (type: double)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 4991930471 Data size: 109822470377 Basic stats: COMPLETE Column stats: NONE
                PTF Operator
                  Function definitions:
                      Input definition
                        input alias: ptf_0
                        output shape: _col0: string, _col1: double
                        type: WINDOWING
                      Windowing table definition
                        input alias: ptf_1
                        name: windowingtablefunction
                        order by: _col1 DESC NULLS LAST
                        partition by: _col0
                        raw input shape:
                        window functions:
                            window function definition
                              alias: rank_window_0
                              arguments: _col1
                              name: rank
                              window function: GenericUDAFRankEvaluator
                              window frame: PRECEDING(MAX)~FOLLOWING(MAX)
                              isPivotResult: true
                  Statistics: Num rows: 4991930471 Data size: 109822470377 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (rank_window_0 <= 5) (type: boolean)
                    Statistics: Num rows: 1663976823 Data size: 36607490111 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: _col0 (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1663976823 Data size: 36607490111 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 1663976823 Data size: 36607490111 Basic stats: COMPLETE Column stats: NONE
{code}


> Poor Performance about subquery like DS/query70
> -----------------------------------------------
>
>                 Key: HIVE-17474
>                 URL: https://issues.apache.org/jira/browse/HIVE-17474
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>         Attachments: explain.70.vec
>
>
> in [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. {code}
> select  
>     sum(ss_net_profit) as total_sum
>    ,s_state
>    ,s_county
>    ,grouping__id as lochierarchy
>    , rank() over(partition by grouping__id, case when grouping__id == 2 then s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
>     store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
>     join store s on s.s_store_sk  = ss.ss_store_sk
>  where
>     d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>              ( select s_state
>                from  (select s_state as s_state, sum(ss_net_profit),
>                              rank() over ( partition by s_state order by sum(ss_net_profit) desc) as ranking
>                       from   store_sales, store, date_dim
>                       where  d_month_seq between 1193 and 1193+11
>                             and date_dim.d_date_sk = store_sales.ss_sold_date_sk
>                             and store.s_store_sk  = store_sales.ss_store_sk
>                       group by s_state
>                      ) tmp1 
>                where ranking <= 5
>              )
>  group by s_state,s_county with rollup
> order by
>    lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 and part2 is low as there are not very different values about states( actually there are 30 different values in the table store).  If use common join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)