You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by Parth-Brahmbhatt <gi...@git.apache.org> on 2016/08/15 22:15:53 UTC

[GitHub] spark pull request #14655: [SPARK-16669][SQL]Adding partition prunning to Me...

GitHub user Parth-Brahmbhatt opened a pull request:

    https://github.com/apache/spark/pull/14655

    [SPARK-16669][SQL]Adding partition prunning to Metastore statistics f\u2026

    ## What changes were proposed in this pull request?
    
    Adding partition prunning to Metastore statistics for better join selection.
    
    Currently the metastore statistics returns the size of entire table which results in Join selection stretagy to not use broadcast joins even when only a single partition from a large table is selected.This PR addresses that issue by only estimating the size of the partition by applying partition pruning during size estimation. Currently it only works with partition columns used with equality checks under AND,OR,IN Operators. If a partition column is used in any other operator, it defaults back to total table size. We have also introduced a configuration to enable this optimization which will be off by default. Instead of trying to calculate the path we could make a metastore query to get all the valid paths but for simplicity we are just building the path in code.
    
    
    ## How was this patch tested?
    Unit tests added.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Parth-Brahmbhatt/spark Spark-16669

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14655.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14655
    
----
commit 7380d4a3450b985386b2f59516baa50c896c2659
Author: Parth Brahmbhatt <pb...@netflix.com>
Date:   2016-07-15T23:21:21Z

    [SPARK-16669][SQL]Adding partition prunning to Metastore statistics for better join selection.
    
    Currently the metastore statistics returns the size of entire table which results in Join selection stretagy to not use broadcast joins even when only a single partition from a large table is selected.This PR addresses that issue by only estimating the size of the partition by applying partition pruning during size estimation. Currently it only works with partition columns used with equality checks under AND,OR,IN Operators. If a partition column is used in any other operator, it defaults back to total table size. We have also introduced a configuration to enable this optimization which will be off by default. Instead of trying to calculate the path we could make a metastore query to get all the valid paths but for simplicity we are just building the path in code.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    @Parth-Brahmbhatt Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.
Github user Parth-Brahmbhatt commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    I will re-evaluate and update or close the PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    Seems this PR solves similar problem as [SPARK-15616](https://github.com/apache/spark/pull/18193)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    @Parth-Brahmbhatt Are you still interested in this PR? Our stats refactoring has been finished in the release of 2.2. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    If we gonna do this, I'd like to have a more general approach, which should also work for data source tables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    Will this be part of the CBO work? The size estimation or statistics collection is being re-designed for CBO, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.
Github user Parth-Brahmbhatt commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    @gatorsmile not sure if it will simplify much in this case as most of the complexity is in figuring out what partitions can be pruned which I don't think will go away. We will rely on hive metastore instead of hdfs for size calculation whenever partition level stats are stored and available but that part of the code is not really complex.
    
    I am fine waiting for the patch to be delivered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    How about waiting for a few days until that is delivered? Let us see whether that might simplify your PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    cc @cloud-fan and @gatorsmile - both are working on refactoring some of these code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by lianhuiwang <gi...@git.apache.org>.
Github user lianhuiwang commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    @wzhfy Yes, I think this is same with SPARK-15616.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.
Github user Parth-Brahmbhatt commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    Closing this PR given its a duplicate at this point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.
Github user Parth-Brahmbhatt commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    @cloud-fan How do you suggest to change this? I started with Metastore as internally that is the most used datasource and will benefit from partition pruning at planning stage. I am open to any suggestions and will modify the code accordingly.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14655: [SPARK-16669][SQL]Adding partition prunning to Me...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.
Github user Parth-Brahmbhatt closed the pull request at:

    https://github.com/apache/spark/pull/14655


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.
Github user Parth-Brahmbhatt commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    @gatorsmile not sure if its the same issue. The issue you are pointing at talks about storing the actual partition level stats, which could be used by this PR but until its available we could rely on file system level statistics. Also given this is config driven which is disabled by default it should have no perf impact.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14655: [SPARK-16669][SQL]Adding partition prunning to Metastore...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14655
  
    Found a related JIRA: https://issues.apache.org/jira/browse/SPARK-17129


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org