You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Hari Sankar Sivarama Subramaniyan (JIRA)" <ji...@apache.org> on 2016/06/23 21:37:16 UTC

[jira] [Updated] (HIVE-13995) Hive generates inefficient metastore queries for TPCDS tables with 1800+ partitions leading to higher compile time

     [ https://issues.apache.org/jira/browse/HIVE-13995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hari Sankar Sivarama Subramaniyan updated HIVE-13995:
-----------------------------------------------------
    Attachment: HIVE-13995.1.patch

Draft patch while I am still improving on the changes.

> Hive generates inefficient metastore queries for TPCDS tables with 1800+ partitions leading to higher compile time
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-13995
>                 URL: https://issues.apache.org/jira/browse/HIVE-13995
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 2.2.0
>            Reporter: Nita Dembla
>            Assignee: Hari Sankar Sivarama Subramaniyan
>         Attachments: HIVE-13995.1.patch
>
>
> TPCDS fact tables (store_sales, catalog_sales) have 1800+ partitions and when the query does not a filter on the partition column, metastore queries generated have a large IN clause listing all the partition names. Most RDBMS systems have issues optimizing large IN clause and even when a good index plan is chosen , comparing to 1800+ string values will not lead to best execution time.
> When all partitions are chosen, not specifying the partition list and having filters only on table and column name will generate the same result set as long as there are no concurrent modifications to partition list of the hive table (adding/dropping partitions).
> For eg: For TPCDS query18, the metastore query gathering partition column statistics runs in 0.5 secs in Mysql. Following is output from mysql log
> {noformat}
> -- Query_time: 0.482063  Lock_time: 0.003037 Rows_sent: 1836  Rows_examined: 18360
> select count("COLUMN_NAME") from "PART_COL_STATS"
>  where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 'catalog_sales' 
>  and "COLUMN_NAME" in 
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit') and "PARTITION_NAME" in 
> ('cs_sold_date_sk=2450815','cs_sold_date_sk=2450816','cs_sold_date_sk=2450817','cs_sold_date_sk=2450818','cs_sold_date_sk=2450819','cs_sold_date_sk=2450820','cs_sold_date_sk=2450821','cs_sold_date_sk=2450822','cs_sold_date_sk=2450823','cs_sold_date_sk=2450824','cs_sold_date_sk=2450825','cs_sold_date_sk=2450826','cs_sold_date_sk=2450827','cs_sold_date_sk=2450828','cs_sold_date_sk=2450829','cs_sold_date_sk=2450830','cs_sold_date_sk=2450831','cs_sold_date_sk=2450832','cs_sold_date_sk=2450833','cs_sold_date_sk=2450834','cs_sold_date_sk=2450835','cs_sold_date_sk=2450836','cs_sold_date_sk=2450837','cs_sold_date_sk=2450838','cs_sold_date_sk=2450839','cs_sold_date_sk=2450840','cs_sold_date_sk=2450841','cs_sold_date_sk=2450842','cs_sold_date_sk=2450843','cs_sold_date_sk=2450844','cs_sold_date_sk=2450845','cs_sold_date_sk=2450846','cs_sold_date_sk=2450847','cs_sold_date_sk=2450848','cs_sold_date_sk=2450849','cs_sold_date_sk=2450850','cs_sold_date_sk=2450851','cs_sold_date_sk=2450852','cs_sold_date_sk=2450853','cs_sold_date_sk=2450854','cs_sold_date_sk=2450855','cs_sold_date_sk=2450856',...,'cs_sold_date_sk=2452654') group by "PARTITION_NAME";
> {noformat}
> Functionally equivalent query runs in 0.1 seconds
> {noformat}
> --Query_time: 0.121296  Lock_time: 0.000156 Rows_sent: 1836  Rows_examined: 18360
> select count("COLUMN_NAME") from "PART_COL_STATS"
>  where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 'catalog_sales'  and "COLUMN_NAME" in 
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
>  group by "PARTITION_NAME";
> {noformat}
> If removing the partition list seems drastic, its also possible to simply list the range since hive gets a ordered list of partition names. This performs equally well as earlier query
> {noformat}
> # Query_time: 0.143874  Lock_time: 0.000154 Rows_sent: 1836  Rows_examined: 18360
> SET timestamp=1464014881;
> select count("COLUMN_NAME") from "PART_COL_STATS" where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 'catalog_sales'  and "COLUMN_NAME" in 
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
>   and "PARTITION_NAME" >= 'cs_sold_date_sk=2450815' and "PARTITION_NAME" <= 'cs_sold_date_sk=2452654' 
> group by "PARTITION_NAME";
> {noformat}
> Another thing to check is the IN clause of column names. Columns in projection list of hive query are mentioned here. Not sure if statistics of these columns are required for hive query optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)