You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ndimiduk <gi...@git.apache.org> on 2017/02/21 00:39:00 UTC

[GitHub] spark pull request #17006: [SPARK-17636] Parquet filter push down doesn't ha...

GitHub user ndimiduk opened a pull request:

    https://github.com/apache/spark/pull/17006

    [SPARK-17636] Parquet filter push down doesn't handle struct fields

    ## What changes were proposed in this pull request?
    
    Changes `DataSourceStrategy#translateFilter` to operate on a `NamedExpressions` rather than just and `Attribute`, as it switches off of the `name` property, defined in `NamedExpressions`. `GetStructField` now implements `NamedExpression`.
    
    ## How was this patch tested?
    
    Testing thus far is manual, using debugger, explain plan output, and logging from the ElasticSearch connector. This patch includes no test changes; please advise on where/how to add test coverage.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ndimiduk/spark SPARK-17636

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17006.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17006
    
----
commit 43ae0de846927630fbeaa410c43cdced39d2b173
Author: Nick Dimiduk <nd...@gmail.com>
Date:   2017-02-17T22:42:43Z

    [SPARK-17636] Parquet filter push down doesn't handle struct fields

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17006: [SPARK-17636] Parquet filter push down doesn't handle st...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17006
  
    Yes, I meant closing this PR if you don't have a better idea for now or some time to try another approach soon. Feel free to reopen or make another one whenever you are willing to. Yes, I would not. Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17006: [SPARK-17636] Parquet filter push down doesn't handle st...

Posted by ndimiduk <gi...@git.apache.org>.
Github user ndimiduk commented on the issue:

    https://github.com/apache/spark/pull/17006
  
    Thanks for having a look @HyukjinKwon. I agree a test is needed. I'm new to Spark code base, so please direct me -- I assume there's an existing suite or suites that I can extend with coverage for this case. Also, please track the history I refer to on the ticket -- my use-case is not parquet, but the Elastic Search connector. This patch is enough to push necessary information down to ES, at least with my basic testing.
    
    My original ticket was closed as a dupe of the Parquet issue; notice also my comment requesting SPARK-17636 be renamed so as to not be Parquet-specific. Would it be better to re-open my original ticket and address the catalyst changes there? After that, 17636 can make subsequent changes to get fixes into Parquet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17006: [SPARK-17636] Parquet filter push down doesn't handle st...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17006
  
    BTW, even if we push the dot-separated names, this would not support the push-down in Parquet because `ParquetFilters` itself does not handle nested columns (see https://github.com/apache/spark/blob/7730426cb95eec2652a9ea979ae2c4faf7e585f2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L187-L234). 
    
    As it checks the field names in the schema, it pushs down nothing. Also, I think we definitely need a test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17006: [SPARK-17636] Parquet filter push down doesn't handle st...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17006
  
    See https://github.com/apache/spark/blob/7730426cb95eec2652a9ea979ae2c4faf7e585f2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L158-L160 too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17006: [SPARK-17636] Parquet filter push down doesn't handle st...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17006
  
    No, this is filter push down whereas that one is column pruneing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17006: [SPARK-17636] Parquet filter push down doesn't ha...

Posted by ndimiduk <gi...@git.apache.org>.
Github user ndimiduk closed the pull request at:

    https://github.com/apache/spark/pull/17006


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17006: [SPARK-17636] Parquet filter push down doesn't handle st...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17006
  
    @ndimiduk Let's close this for now if it is going to be inactive. I am not sure of the current way due to the test failure and we can continue to discuss this in the JIRA. If you think the JIRA should be re-open, please re-open with leaving a comment to explain why it should be re-open. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17006: [SPARK-17636] Parquet filter push down doesn't handle st...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17006
  
    Hi @ndimiduk, it seems failed to build. Extending `NamedExpression` seems breaking `ResolveAliases`.
    
    I tested this with the codes below:
    
    ```scala
    Seq(Tuple1(Tuple1(1) :: Nil)).toDF.selectExpr("_1[0]._1")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17006: [SPARK-17636] Parquet filter push down doesn't handle st...

Posted by ndimiduk <gi...@git.apache.org>.
Github user ndimiduk commented on the issue:

    https://github.com/apache/spark/pull/17006
  
    Close the PR you mean? That's fine, I'm afraid I won't be able to make this a priority in the near term. Please don't close the JIRA though -- this is a big problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17006: [SPARK-17636] Parquet filter push down doesn't handle st...

Posted by Gauravshah <gi...@git.apache.org>.
Github user Gauravshah commented on the issue:

    https://github.com/apache/spark/pull/17006
  
    https://github.com/apache/spark/pull/16578 PR should solve this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17006: [SPARK-17636] Parquet filter push down doesn't handle st...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17006
  
    For adding a test, I usually go to the blame button and check where the tests were added in the recent commits. It seems `FilteredScanSuite` is the right place assuming from https://github.com/ndimiduk/spark/commit/7bc9a8c6249300ded31ea931c463d0a8f798e193.
    
    Yes, I saw that JIRA was closed. I think anyone definitely can re-open the JIRA if anyone is pretty sure and it could be fixed separately as a coherent unit. I think that's fine.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17006: [SPARK-17636] Parquet filter push down doesn't handle st...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/17006
  
    Does that one deal with nested filter access as well we nested column pruning?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17006: [SPARK-17636] Parquet filter push down doesn't handle st...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17006
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org