You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dbtsai <gi...@git.apache.org> on 2018/09/24 09:14:30 UTC

[GitHub] spark pull request #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdow...

GitHub user dbtsai opened a pull request:

    https://github.com/apache/spark/pull/22535

    [SPARK-17636][SQL][WIP] Parquet predicate pushdown in nested fields

    ## What changes were proposed in this pull request?
    
    Support Parquet predicate pushdown in nested fields
    
    ## How was this patch tested?
    
    Existing tests and new tests are added.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dbtsai/spark parquetNesting

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22535.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22535
    
----
commit c95706f60e4d576caca78a32000d4a7bbb12c141
Author: DB Tsai <d_...@...>
Date:   2018-09-06T00:22:09Z

    Nested parquet pushdown

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96528/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99673/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    Build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    **[Test build #96528 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96528/testReport)** for PR 22535 at commit [`c95706f`](https://github.com/apache/spark/commit/c95706f60e4d576caca78a32000d4a7bbb12c141).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    **[Test build #99673 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99673/testReport)** for PR 22535 at commit [`c95706f`](https://github.com/apache/spark/commit/c95706f60e4d576caca78a32000d4a7bbb12c141).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    I'm breaking this PRs into three smaller PR. I'll fix the tests in those smaller PRs. Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdow...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22535#discussion_r219948634
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala ---
    @@ -437,53 +436,63 @@ object DataSourceStrategy {
        * @return a `Some[Filter]` if the input [[Expression]] is convertible, otherwise a `None`.
        */
       protected[sql] def translateFilter(predicate: Expression): Option[Filter] = {
    +    // Recursively try to find an attribute name in top level or struct that can be pushed down.
    +    def attrName(e: Expression): Option[String] = e match {
    +      // In Spark and many data sources such as parquet, dots are used as a column path delimiter;
    +      // thus, we don't push down such filters.
    +      case a: Attribute if !a.name.contains(".") =>
    +        Some(a.name)
    +      case s: GetStructField if !s.childSchema(s.ordinal).name.contains(".") =>
    +        attrName(s.child).map(_ + s".${s.childSchema(s.ordinal).name}")
    +      case _ =>
    +        None
    +    }
    +
         predicate match {
    -      case expressions.EqualTo(a: Attribute, Literal(v, t)) =>
    -        Some(sources.EqualTo(a.name, convertToScala(v, t)))
    -      case expressions.EqualTo(Literal(v, t), a: Attribute) =>
    -        Some(sources.EqualTo(a.name, convertToScala(v, t)))
    -
    -      case expressions.EqualNullSafe(a: Attribute, Literal(v, t)) =>
    -        Some(sources.EqualNullSafe(a.name, convertToScala(v, t)))
    -      case expressions.EqualNullSafe(Literal(v, t), a: Attribute) =>
    -        Some(sources.EqualNullSafe(a.name, convertToScala(v, t)))
    -
    -      case expressions.GreaterThan(a: Attribute, Literal(v, t)) =>
    -        Some(sources.GreaterThan(a.name, convertToScala(v, t)))
    -      case expressions.GreaterThan(Literal(v, t), a: Attribute) =>
    -        Some(sources.LessThan(a.name, convertToScala(v, t)))
    -
    -      case expressions.LessThan(a: Attribute, Literal(v, t)) =>
    -        Some(sources.LessThan(a.name, convertToScala(v, t)))
    -      case expressions.LessThan(Literal(v, t), a: Attribute) =>
    -        Some(sources.GreaterThan(a.name, convertToScala(v, t)))
    -
    -      case expressions.GreaterThanOrEqual(a: Attribute, Literal(v, t)) =>
    -        Some(sources.GreaterThanOrEqual(a.name, convertToScala(v, t)))
    -      case expressions.GreaterThanOrEqual(Literal(v, t), a: Attribute) =>
    -        Some(sources.LessThanOrEqual(a.name, convertToScala(v, t)))
    -
    -      case expressions.LessThanOrEqual(a: Attribute, Literal(v, t)) =>
    -        Some(sources.LessThanOrEqual(a.name, convertToScala(v, t)))
    -      case expressions.LessThanOrEqual(Literal(v, t), a: Attribute) =>
    -        Some(sources.GreaterThanOrEqual(a.name, convertToScala(v, t)))
    -
    -      case expressions.InSet(a: Attribute, set) =>
    -        val toScala = CatalystTypeConverters.createToScalaConverter(a.dataType)
    -        Some(sources.In(a.name, set.toArray.map(toScala)))
    +      case expressions.EqualTo(e: Expression, Literal(v, t)) =>
    --- End diff --
    
    This PR will be a good performance improvement for Spark 2.5.0.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    **[Test build #99673 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99673/testReport)** for PR 22535 at commit [`c95706f`](https://github.com/apache/spark/commit/c95706f60e4d576caca78a32000d4a7bbb12c141).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3404/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    **[Test build #96528 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96528/testReport)** for PR 22535 at commit [`c95706f`](https://github.com/apache/spark/commit/c95706f60e4d576caca78a32000d4a7bbb12c141).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3415/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    **[Test build #96506 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96506/testReport)** for PR 22535 at commit [`c95706f`](https://github.com/apache/spark/commit/c95706f60e4d576caca78a32000d4a7bbb12c141).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22535
  
    Hi, @dbtsai . Could you fix the UT failure?
    ```scala
    [info] ParquetFilterSuite:
    ...
    [info] - SPARK-12218 Converting conjunctions into Parquet filter predicates *** FAILED *** (19 milliseconds)
    [info]   Expected None, but got Some(lt(a, 10)) (ParquetFilterSuite.scala:802)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org