You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ameent <gi...@git.apache.org> on 2017/12/28 03:30:00 UTC

[GitHub] spark pull request #20100: [SPARK-22913][SQL] Improved Hive Partition Prunin...

GitHub user ameent opened a pull request:

    https://github.com/apache/spark/pull/20100

    [SPARK-22913][SQL] Improved Hive Partition Pruning

    Adding support for Timestamp and Fractional column types. The pruning
    of partitions of these types is being put behind default options
    that are set to false, as it's not clear which hive metastore
    implementations support predicates on these types of columns.
    
    The AWS Glue Catalog http://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html
    does support filters on timestamp and fractional columns and pushing these filters
    down to it has significant performance improvements in our use cases.
    
    As part of this change the hive pruning suite is renamed (a TODO) and 2
    ignored tests are added that will validate the functionality of partition
    pruning through integration tests. The tests are ignored since the integration
    test setup uses a Hive client that throws errors when it sees partition column
    filters on non-integral and non-string columns.
    
    Unit tests are added to validate filtering, which are active.
    
    ## What changes were proposed in this pull request?
    
    See https://issues.apache.org/jira/browse/SPARK-22913
    
    This change addresses the JIRA. I'm looking for feedback on the change itself and whether the config values I added make sense. I was not able to find official Hive specification on which filters a metastore needs to support and as such, feel hesitant to turn on this behavior by default. Piggybacking on top of "advancedPartitionPruning" option felt wrong because that config toggles whether "in (...)" queries are expanded in a series of "ors" and I don't want people to be forced to turn off that behavior alongside not pushing timestamp predicates.
    
    ## How was this patch tested?
    
    This change is tested via unit tests, modified integration tests (that are ignored) and manual tests on EMR 5.10 running against AWS Glue Catalog as the Hive metastore.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ameent/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20100.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20100
    
----
commit 6b1d5dc8874bba7c707428818123ec63fd7e84f0
Author: Ameen Tayyebi <am...@...>
Date:   2017-12-28T02:56:13Z

    [SPARK-22913][SQL] Improved Hive Partition Pruning
    
    Adding support for Timestamp and Fractional column types. The pruning
    of partitions of these types is being put behind default options
    that are set to false, as it's not clear which hive metastore
    implementations support predicates on these types of columns.
    
    The AWS Glue Catalog http://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html
    does support filters on timestamp and fractional columns and pushing these filters
    down to it has significant performance improvements in our use cases.
    
    As part of this change the hive pruning suite is renamed (a TODO) and 2
    ignored tests are added that will validate the functionality of partition
    pruning through integration tests. The tests are ignored since the integration
    test setup uses a Hive client that throws errors when it sees partition column
    filters on non-integral and non-string columns.
    
    Unit tests are added to validate filtering, which are active.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20100: [SPARK-22913][SQL] Improved Hive Partition Pruning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20100
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20100: [SPARK-22913][SQL] Improved Hive Partition Pruning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20100
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20100: [SPARK-22913][SQL] Improved Hive Partition Pruning

Posted by ameent <gi...@git.apache.org>.

Github user ameent commented on the issue:

    https://github.com/apache/spark/pull/20100
  
    @srowen can you help find someone to review this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20100: [SPARK-22913][SQL] Improved Hive Partition Pruning

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20100
  
    instead of having configs, we should delegate the partition pruning logic to HiveShim and only support these types for certain hive versions. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20100: [SPARK-22913][SQL] Improved Hive Partition Pruning

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20100
  
    Spark doesn't officially support Glue, I think Glue is plugged into Spark by pretending itself as a certain hive version, and that hive version should support timestamp and fraction.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20100: [SPARK-22913][SQL] Improved Hive Partition Prunin...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20100


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20100: [SPARK-22913][SQL] Improved Hive Partition Pruning

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20100
  
    Sorry for a late response. I am now checking PRs queued in my list.
    I agree with @cloud-fan's for now and I think we should better leave this closed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20100: [SPARK-22913][SQL] Improved Hive Partition Pruning

Posted by ameent <gi...@git.apache.org>.

Github user ameent commented on the issue:

    https://github.com/apache/spark/pull/20100
  
    Any updates on this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20100: [SPARK-22913][SQL] Improved Hive Partition Pruning

Posted by ameent <gi...@git.apache.org>.

Github user ameent commented on the issue:

    https://github.com/apache/spark/pull/20100
  
    Thanks @cloud-fan. Do you propose that we model "AWS Glue" as its own Hive version?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20100: [SPARK-22913][SQL] Improved Hive Partition Pruning

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20100
  
    @ameent BTW, we can't directly close this. I'd appreciate it if you manually close this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20100: [SPARK-22913][SQL] Improved Hive Partition Pruning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20100
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20100: [SPARK-22913][SQL] Improved Hive Partition Pruning

Posted by ameent <gi...@git.apache.org>.

Github user ameent commented on the issue:

    https://github.com/apache/spark/pull/20100
  
    CCing @cloud-fan @tdas @HyukjinKwon @xubo245 I need help finding someone who can provide feedback on this pull request.
    
    This change reduces run-time of one of our use cases from 6 minutes to around 11 seconds. We have tables with large # of partitions (over 1 million) and retrieving all partitions over the wire to the master node does add considerable amount of time.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20100: [SPARK-22913][SQL] Improved Hive Partition Pruning

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20100
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org