You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2018/01/05 23:16:00 UTC

[jira] [Updated] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types

     [ https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated SPARK-22913:
------------------------------
    Fix Version/s:     (was: 2.3.0)

> Hive Partition Pruning, Fractional and Timestamp types
> ------------------------------------------------------
>
>                 Key: SPARK-22913
>                 URL: https://issues.apache.org/jira/browse/SPARK-22913
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Ameen Tayyebi
>
> Spark currently pushes the predicates it has in the SQL query to Hive Metastore. This only applies to predicates that are placed on top of partitioning columns. As more and more hive metastore implementations come around, this is an important optimization to allow data to be prefiltered to only relevant partitions. Consider the following example:
> Table:
> create external table data (key string, quantity long)
> partitioned by (processing-date timestamp)
> Query:
> select * from data where processing-date = '2017-10-23 00:00:00'
> Currently, no filters will be pushed to the hive metastore for the above query. The reason is that the code that tries to compute predicates to be sent to hive metastore, only deals with integral and string column types. It doesn't know how to handle fractional and timestamp columns.
> I have tables in my metastore (AWS Glue) with millions of partitions of type timestamp and double. In my specific case, it takes Spark's master node about 6.5 minutes to download all partitions for the table, and then filter the partitions client-side. The actual processing time of my query is only 6 seconds. In other words, without partition pruning, I'm looking at 6.5 minutes of processing and with partition pruning, I'm looking at 6 seconds only.
> I have a fix for this developed locally that I'll provide shortly as a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org