You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Jesus Camacho Rodriguez (JIRA)" <ji...@apache.org> on 2016/05/28 09:18:12 UTC

[jira] [Updated] (HIVE-13254) GBY cardinality estimation is wrong partition columns is involved

     [ https://issues.apache.org/jira/browse/HIVE-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jesus Camacho Rodriguez updated HIVE-13254:
-------------------------------------------
    Target Version/s:   (was: 2.1.0)

> GBY cardinality estimation is wrong partition columns is involved
> -----------------------------------------------------------------
>
>                 Key: HIVE-13254
>                 URL: https://issues.apache.org/jira/browse/HIVE-13254
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 1.3.0, 2.0.0, 2.1.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Jesus Camacho Rodriguez
>         Attachments: q3.svg, q3_ef_transpose_aggr.svg
>
>
> When running the following query on TPCDS-1000 scale, setting hive.transpose.aggr.join=true is expected to generate optimal plan but it was not generating. 
> {code:title=Query}
> SELECT `date_dim`.`d_day_name` AS `d_day_name`, 
>        `item`.`i_category`     AS `i_category` 
> FROM   `store_sales` `store_sales` 
>        INNER JOIN `item` `item` 
>                ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` ) 
>        INNER JOIN `date_dim` `date_dim` 
>                ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk` ) 
> GROUP  BY `d_day_name`, 
>           `i_category`;
> {code}
> The reason for that is stats annotation rule for GROUP BY is not considering partition column into account. For the above query, the generated plan is attached. As we can see from the plan, GBY is pushed to fact table (store_sales) but that output of GBY shuffled to perform join instead of MapJoin conversion. This is because of wrong estimation of cardinality/data size of GBY on store_sales (Map 1). 
> What's happening internally is, GBY computes estimated cardinality which in this case is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M. This estimate is wrong as ss_sold_date_sk is partition column and estimator assumes its non-partition column. In this case, not every tasks reads data from all partitions. We need to take estimated task parallelism into account. For example: If task parallelism is determined to be 100 the estimate from GBY should be ~6M which should convert this vertex into map join vertex. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)