You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2016/12/16 05:38:58 UTC

[jira] [Comment Edited] (HIVE-15339) Batch metastore calls to get column stats for fields needed in FilterSelectivityEstimator

    [ https://issues.apache.org/jira/browse/HIVE-15339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15753530#comment-15753530 ] 

Rajesh Balamohan edited comment on HIVE-15339 at 12/16/16 5:38 AM:
-------------------------------------------------------------------

Thanks [~jcamachorodriguez]. I agree that it would be better to have this in HiveRelFieldTrimmer. I will revise the patch and share it.


was (Author: rajesh.balamohan):
Thanks [~jcamachorodriguez]. I agree that it would be better to have this in HiveRelFieldTrimmer. Will revise the patch and share it.

> Batch metastore calls to get column stats for fields needed in FilterSelectivityEstimator
> -----------------------------------------------------------------------------------------
>
>                 Key: HIVE-15339
>                 URL: https://issues.apache.org/jira/browse/HIVE-15339
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HIVE-15339.1.patch, HIVE-15339.3.patch
>
>
> Based on query pattern, {{FilterSelectivityEstimator}} gets column statistics from metastore in multiple calls. For instance, in the following query, it ends up getting individual column statistics for for flights multiple number of times.
> When the table has large number of partitions, getting statistics for columns via multiple calls can be very expensive. This would adversely impact the overall compilation time. The following query took 14 seconds to compile.
> {noformat}
> SELECT COUNT(`flights`.`flightnum`) AS `cnt_flightnum_ok`,
> YEAR(`flights`.`dateofflight`) AS `yr_flightdate_ok`
> FROM `flights` as `flights`
> JOIN `airlines` ON (`flights`.`uniquecarrier` = `airlines`.`code`)
> JOIN `airports` as `source_airport` ON (`flights`.`origin` = `source_airport`.`iata`)
> JOIN `airports` as `dest_airport` ON (`flights`.`dest` = `dest_airport`.`iata`)
> GROUP BY YEAR(`flights`.`dateofflight`);
> {noformat}
> It may be helpful to club all columns that need statistics and fetch these details in single remote call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)