You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2017/09/19 06:45:00 UTC
[jira] [Comment Edited] (HIVE-17474) Poor Performance about subquery like DS/query70 on HoS

    [ https://issues.apache.org/jira/browse/HIVE-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171184#comment-16171184 ] 

liyunzhang_intel edited comment on HIVE-17474 at 9/19/17 6:44 AM:
------------------------------------------------------------------

I found that we need execute
"analyze table xxx compute statistics for columns" before executing the query.
Attach the different explain([before_analyze|https://issues.apache.org/jira/secure/attachment/12887836/explain.70.before.analyze],[after_analyze|https://issues.apache.org/jira/secure/attachment/12887837/explain.70.after.analyze])
give an example to show the influence of column statistics 
{code}(select s_state as s_state, sum(ss_net_profit),
                             rank() over ( partition by s_state order by sum(ss_net_profit) desc) as ranking
                      from   store_sales, store, date_dim
                      where  d_month_seq between 1193 and 1193+11
                            and date_dim.d_date_sk = store_sales.ss_sold_date_sk
                            and store.s_store_sk  = store_sales.ss_store_sk
                      group by s_state
                     ) {code}
before compute column statistics
{code}
 Map 9 
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: (ss_store_sk is not null and ss_sold_date_sk is not null) (type: boolean)
                  Statistics: Num rows: 27504814 Data size: 825144420 Basic stats: COMPLETE Column stats: PARTIAL
                  Filter Operator
                    predicate: ss_store_sk is not null (type: boolean)
                    Statistics: Num rows: 27504814 Data size: 220038512 Basic stats: COMPLETE Column stats: PARTIAL
                    Select Operator
                      expressions: ss_store_sk (type: bigint), ss_net_profit (type: double), ss_sold_date_sk (type: bigint)
                      outputColumnNames: _col0, _col1, _col2
                      Statistics: Num rows: 27504814 Data size: 220038512 Basic stats: COMPLETE Column stats: PARTIAL
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: bigint)
                          1 _col0 (type: bigint)
                        outputColumnNames: _col1, _col2, _col4
                        input vertices:
                          1 Map 12
                        Statistics: Num rows: 30255296 Data size: 242042368 Basic stats: COMPLETE Column stats: NONE
                        Map Join Operator
                          condition map:
                               Inner Join 0 to 1
                          keys:
                            0 _col2 (type: bigint)
                            1 _col0 (type: bigint)
                          outputColumnNames: _col1, _col4
                          input vertices:
                            1 Map 13
                          Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE
                          Select Operator
                            expressions: _col4 (type: string), _col1 (type: double)
                            outputColumnNames: _col4, _col1
                            Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE
                            Group By Operator
                              aggregations: sum(_col1)
                              keys: _col4 (type: string)
                              mode: hash
                              outputColumnNames: _col0, _col1
                              Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE
                              Reduce Output Operator
                                key expressions: _col0 (type: string)
                                sort order: +
                                Map-reduce partition columns: _col0 (type: string)
                                Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE
                                value expressions: _col1 (type: double)

{code}
the data size is 266246610

After computing column statistics
{code}
  Map 7 
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: (ss_store_sk is not null and ss_sold_date_sk is not null) (type: boolean)
                  Statistics: Num rows: 27504814 Data size: 649740104 Basic stats: COMPLETE Column stats: PARTIAL
                  Filter Operator
                    predicate: ss_store_sk is not null (type: boolean)
                    Statistics: Num rows: 26856871 Data size: 634433888 Basic stats: COMPLETE Column stats: PARTIAL
                    Select Operator
                      expressions: ss_store_sk (type: bigint), ss_net_profit (type: double), ss_sold_date_sk (type: bigint)
                      outputColumnNames: _col0, _col1, _col2
                      Statistics: Num rows: 26856871 Data size: 634433888 Basic stats: COMPLETE Column stats: PARTIAL
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col2 (type: bigint)
                          1 _col0 (type: bigint)
                        outputColumnNames: _col0, _col1
                        input vertices:
                          1 Map 10
                        Statistics: Num rows: 2983893 Data size: 47742288 Basic stats: COMPLETE Column stats: PARTIAL
                        Map Join Operator
                          condition map:
                               Inner Join 0 to 1
                          keys:
                            0 _col0 (type: bigint)
                            1 _col0 (type: bigint)
                          outputColumnNames: _col1, _col6
                          input vertices:
                            1 Map 11
                          Statistics: Num rows: 2983893 Data size: 280485942 Basic stats: COMPLETE Column stats: PARTIAL
                          Group By Operator
                            aggregations: sum(_col1)
                            keys: _col6 (type: string)
                            mode: hash
                            outputColumnNames: _col0, _col1
                            Statistics: Num rows: 9 Data size: 846 Basic stats: COMPLETE Column stats: PARTIAL
                            Reduce Output Operator
                              key expressions: _col0 (type: string)
                              sort order: +
                              Map-reduce partition columns: _col0 (type: string)
                              Statistics: Num rows: 9 Data size: 846 Basic stats: COMPLETE Column stats: PARTIAL
                              value expressions: _col1 (type: double)
{code}
the datasize is 846

 without analyzing column statistics, the cardinaltiy is parentNumRows and the datasize is parentDataSize
 [here|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L1204]
 {code}
  // Case 1: NO column stats, NO hash aggregation, NO grouping sets
                cardinality = parentNumRows;
 {code}
 with analyzing column statistics
 [here|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L1132] to calculate the cardinality of the Group By Operator, actually it use {{StatsUtils.safeMult(ndvProduct, parallelism)}}(ndvProduct is 3, parallelism is 3, parentNumRows is 2983893, cardinality=Math.min(2983893/2, 3*3)=9)
 the ndvProduct means the distinct value count of table store.Because the distinct value count of store is very small,thus the cardinality is very small. The new datasize is updated to smaller value 846.
 {code}
  // Case 3: column stats, hash aggregation, NO grouping sets
   cardinality = Math.min(parentNumRows / 2, StatsUtils.safeMult(ndvProduct, parallelism));
 {code}



was (Author: kellyzly):
I found that we need to execute
"analyze table xxx compute statistics for columns" before executing the query.
Attach the different explain before and after analyze statistics.
give an example to show the influence of column statistics 
{code}(select s_state as s_state, sum(ss_net_profit),
                             rank() over ( partition by s_state order by sum(ss_net_profit) desc) as ranking
                      from   store_sales, store, date_dim
                      where  d_month_seq between 1193 and 1193+11
                            and date_dim.d_date_sk = store_sales.ss_sold_date_sk
                            and store.s_store_sk  = store_sales.ss_store_sk
                      group by s_state
                     ) {code}
before compute column statistics
{code}
 Map 9 
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: (ss_store_sk is not null and ss_sold_date_sk is not null) (type: boolean)
                  Statistics: Num rows: 27504814 Data size: 825144420 Basic stats: COMPLETE Column stats: PARTIAL
                  Filter Operator
                    predicate: ss_store_sk is not null (type: boolean)
                    Statistics: Num rows: 27504814 Data size: 220038512 Basic stats: COMPLETE Column stats: PARTIAL
                    Select Operator
                      expressions: ss_store_sk (type: bigint), ss_net_profit (type: double), ss_sold_date_sk (type: bigint)
                      outputColumnNames: _col0, _col1, _col2
                      Statistics: Num rows: 27504814 Data size: 220038512 Basic stats: COMPLETE Column stats: PARTIAL
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: bigint)
                          1 _col0 (type: bigint)
                        outputColumnNames: _col1, _col2, _col4
                        input vertices:
                          1 Map 12
                        Statistics: Num rows: 30255296 Data size: 242042368 Basic stats: COMPLETE Column stats: NONE
                        Map Join Operator
                          condition map:
                               Inner Join 0 to 1
                          keys:
                            0 _col2 (type: bigint)
                            1 _col0 (type: bigint)
                          outputColumnNames: _col1, _col4
                          input vertices:
                            1 Map 13
                          Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE
                          Select Operator
                            expressions: _col4 (type: string), _col1 (type: double)
                            outputColumnNames: _col4, _col1
                            Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE
                            Group By Operator
                              aggregations: sum(_col1)
                              keys: _col4 (type: string)
                              mode: hash
                              outputColumnNames: _col0, _col1
                              Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE
                              Reduce Output Operator
                                key expressions: _col0 (type: string)
                                sort order: +
                                Map-reduce partition columns: _col0 (type: string)
                                Statistics: Num rows: 33280826 Data size: 266246610 Basic stats: COMPLETE Column stats: NONE
                                value expressions: _col1 (type: double)

{code}
the data size is 266246610

After computing column statistics
{code}
  Map 7 
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: (ss_store_sk is not null and ss_sold_date_sk is not null) (type: boolean)
                  Statistics: Num rows: 27504814 Data size: 649740104 Basic stats: COMPLETE Column stats: PARTIAL
                  Filter Operator
                    predicate: ss_store_sk is not null (type: boolean)
                    Statistics: Num rows: 26856871 Data size: 634433888 Basic stats: COMPLETE Column stats: PARTIAL
                    Select Operator
                      expressions: ss_store_sk (type: bigint), ss_net_profit (type: double), ss_sold_date_sk (type: bigint)
                      outputColumnNames: _col0, _col1, _col2
                      Statistics: Num rows: 26856871 Data size: 634433888 Basic stats: COMPLETE Column stats: PARTIAL
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col2 (type: bigint)
                          1 _col0 (type: bigint)
                        outputColumnNames: _col0, _col1
                        input vertices:
                          1 Map 10
                        Statistics: Num rows: 2983893 Data size: 47742288 Basic stats: COMPLETE Column stats: PARTIAL
                        Map Join Operator
                          condition map:
                               Inner Join 0 to 1
                          keys:
                            0 _col0 (type: bigint)
                            1 _col0 (type: bigint)
                          outputColumnNames: _col1, _col6
                          input vertices:
                            1 Map 11
                          Statistics: Num rows: 2983893 Data size: 280485942 Basic stats: COMPLETE Column stats: PARTIAL
                          Group By Operator
                            aggregations: sum(_col1)
                            keys: _col6 (type: string)
                            mode: hash
                            outputColumnNames: _col0, _col1
                            Statistics: Num rows: 9 Data size: 846 Basic stats: COMPLETE Column stats: PARTIAL
                            Reduce Output Operator
                              key expressions: _col0 (type: string)
                              sort order: +
                              Map-reduce partition columns: _col0 (type: string)
                              Statistics: Num rows: 9 Data size: 846 Basic stats: COMPLETE Column stats: PARTIAL
                              value expressions: _col1 (type: double)
{code}
the datasize is 846

 without analyzing column statistics, the cardinaltiy is parentNumRows and the datasize is parentDataSize
 [here|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L1204]
 {code}
  // Case 1: NO column stats, NO hash aggregation, NO grouping sets
                cardinality = parentNumRows;
 {code}
 with analyzing column statistics
 [here|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L1132] to calculate the cardinality of the Group By Operator, actually it use {{StatsUtils.safeMult(ndvProduct, parallelism)}}(ndvProduct is 3, parallelism is 3, parentNumRows is 2983893, cardinality=Math.min(2983893/2, 3*3)=9)
 the ndvProduct means the distinct value count of table store.Because the distinct value count of store is very small,thus the cardinality is very small. The new datasize is updated to smaller value 846.
 {code}
  // Case 3: column stats, hash aggregation, NO grouping sets
   cardinality = Math.min(parentNumRows / 2, StatsUtils.safeMult(ndvProduct, parallelism));
 {code}


> Poor Performance about subquery like DS/query70 on HoS
> ------------------------------------------------------
>
>                 Key: HIVE-17474
>                 URL: https://issues.apache.org/jira/browse/HIVE-17474
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>         Attachments: explain.70.after.analyze, explain.70.before.analyze, explain.70.vec
>
>
> in [DS/query70|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query70.sql]. {code}
> select  
>     sum(ss_net_profit) as total_sum
>    ,s_state
>    ,s_county
>    ,grouping__id as lochierarchy
>    , rank() over(partition by grouping__id, case when grouping__id == 2 then s_state end order by sum(ss_net_profit)) as rank_within_parent
> from
>     store_sales ss join date_dim d1 on d1.d_date_sk = ss.ss_sold_date_sk
>     join store s on s.s_store_sk  = ss.ss_store_sk
>  where
>     d1.d_month_seq between 1193 and 1193+11
>  and s.s_state in
>              ( select s_state
>                from  (select s_state as s_state, sum(ss_net_profit),
>                              rank() over ( partition by s_state order by sum(ss_net_profit) desc) as ranking
>                       from   store_sales, store, date_dim
>                       where  d_month_seq between 1193 and 1193+11
>                             and date_dim.d_date_sk = store_sales.ss_sold_date_sk
>                             and store.s_store_sk  = store_sales.ss_store_sk
>                       group by s_state
>                      ) tmp1 
>                where ranking <= 5
>              )
>  group by s_state,s_county with rollup
> order by
>    lochierarchy desc
>   ,case when lochierarchy = 0 then s_state end
>   ,rank_within_parent
>  limit 100;
> {code}
>  let's analyze the query,
> part1: it calculates the sub-query and get the result of the state which ss_net_profit is less than 5.
> part2: big table store_sales join small tables date_dim, store and get the result.
> part3: part1 join part2
> the problem is on the part3, this is common join. The cardinality of part1 and part2 is low as there are not very different values about states( actually there are 30 different values in the table store).  If use common join, big data will go to the 30 reducers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)