You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yijieshen <gi...@git.apache.org> on 2015/04/14 16:10:22 UTC

[GitHub] spark pull request: [SQL]Eliminate partition filters from executio...

GitHub user yijieshen opened a pull request:

    https://github.com/apache/spark/pull/5509

    [SQL]Eliminate partition filters from execution.Filter if partition key is included neither in original schema nor in filter's parents

    Suppose I have a table t(id: String, event: String) saved as parquet file, and have directory hierarchy: hdfs://path/to/data/root/dt=2015-01-01/hr=00
    
    After partition discovery, the result schema should be (id: String, event: String, dt: String, hr: Int)
    
    If I have a query like:
    
    df.select($“id”).filter(event match).filter($“dt” > “2015-01-01”).filter($”hr” > 13)
    
    In current implementation, after (dt > 2015-01-01 && hr >13) is used to filter partitions, 
    these two filters remains in execution plan and result in each row returned from parquet add two fields dt & hr each time,  
    
    This PR just eliminate the partition filters from execution.Filter as well as requestedColumns from parquetRelation2, if partition key is included neither in original schema nor in filter's parents, therefore avoid the row construction involved in each row.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yijieshen/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5509.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5509
    
----
commit 1649b0e35e77ab08f9014f9de50af2b987bde990
Author: Yijie Shen <he...@gmail.com>
Date:   2015-04-14T13:52:40Z

    Eliminate partition filters from execution.Filter if partition key is included neither in original schema nor in filter's parents

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6903][SQL]Eliminate partition filters f...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/5509#issuecomment-93857108
  
    I don't think we should add one-off hacks to the query planning logic for data sources to avoid a filter.  They are pretty cheap to evaluate.  Do you have an measurements that suggest this change makes a noticeable performance difference?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL]Eliminate partition filters from executio...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5509#issuecomment-92887752
  
    Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. This is missing a JIRA, among other things.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6903][SQL]Eliminate partition filters f...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5509#issuecomment-96767276
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6903][SQL]Eliminate partition filters f...

Posted by yijieshen <gi...@git.apache.org>.
Github user yijieshen commented on the pull request:

    https://github.com/apache/spark/pull/5509#issuecomment-92902675
  
    Added a ticket: https://issues.apache.org/jira/browse/SPARK-6903


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL]Eliminate partition filters from executio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5509#issuecomment-92866993
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org