You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by dongjoon-hyun <gi...@git.apache.org> on 2018/10/01 04:55:50 UTC

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/22597

    [SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates

    ## What changes were proposed in this pull request?
    
    This PR aims to fix an ORC performance regression at Spark 2.4.0 RCs from Spark 2.3.2. Currently, for column names with `.`, the pushed predicates are ignored.
    
    **Test Data**
    ```scala
    scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot")
    scala> df.write.mode("overwrite").orc("/tmp/orc")
    ```
    
    **Spark 2.3.2**
    ```scala
    scala> spark.sql("set spark.sql.orc.impl=native")
    scala> spark.sql("set spark.sql.orc.filterPushdown=true")
    scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
    +------------+
    |col.with.dot|
    +------------+
    |           1|
    |           8|
    +------------+
    
    Time taken: 1486 ms
    
    scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
    +------------+
    |col.with.dot|
    +------------+
    |           1|
    |           8|
    +------------+
    
    Time taken: 163 ms
    ```
    
    **Spark 2.4.0 RC2**
    ```scala
    scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
    +------------+
    |col.with.dot|
    +------------+
    |           1|
    |           8|
    +------------+
    
    Time taken: 4087 ms
    
    scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
    +------------+
    |col.with.dot|
    +------------+
    |           1|
    |           8|
    +------------+
    
    Time taken: 1998 ms
    ```
    
    **This PR**
    ```scala
    scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
    +------------+
    |col.with.dot|
    +------------+
    |           1|
    |           8|
    +------------+
    
    Time taken: 2477 ms
    
    scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
    +------------+
    |col.with.dot|
    +------------+
    |           1|
    |           8|
    +------------+
    
    Time taken: 253 ms
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins with the existing test and manually performance test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-25579

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22597.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22597
    
----
commit f6c3dca65b85888392f8299cc5fc20f698c6afc5
Author: Dongjoon Hyun <do...@...>
Date:   2018-10-01T04:33:04Z

    [SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r221546273
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala ---
    @@ -67,6 +67,16 @@ private[sql] object OrcFilters {
         }
       }
     
    +  // Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` characters
    +  // in order to distinguish predicate pushdown for nested columns.
    +  private def quoteAttributeNameIfNeeded(name: String) : String = {
    +    if (!name.contains("`") && name.contains(".")) {
    --- End diff --
    
    Does this condition take the backtick in column name into account? For instance,
    
    ```
    >>> spark.range(1).toDF("abc`.abc").show()
    +--------+
    |abc`.abc|
    +--------+
    |       0|
    +--------+
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    **[Test build #97329 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97329/testReport)** for PR 22597 at commit [`335a39f`](https://github.com/apache/spark/commit/335a39f58c2103a41c6c30340746734d3aeda954).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22597


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r225398146
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    +
    +    withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
    +      withTempDir { dir =>
    +        val path = new File(dir, "orc").getCanonicalPath
    +        Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
    --- End diff --
    
    Sure. Thank you for confirmation, @cloud-fan and @HyukjinKwon .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r224947556
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala ---
    @@ -67,6 +67,16 @@ private[sql] object OrcFilters {
         }
       }
     
    +  // Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` characters
    +  // in order to distinguish predicate pushdown for nested columns.
    +  private def quoteAttributeNameIfNeeded(name: String) : String = {
    +    if (!name.contains("`") && name.contains(".")) {
    --- End diff --
    
    @HyukjinKwon . Actually, Spark 2.3.2 ORC (native/hive) doesn't support a backtick character in column names.  It fails on **writing** operation. And, although Spark 2.4.0 broadens the supported special characters like `.` and `"` in column names, the backtick character is not handled yet.
    
    So, for that one, I'll proceed in another PR since it's an improvement instead of a regression.
    
    Also, cc @gatorsmile and @dbtsai .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    **[Test build #97366 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97366/testReport)** for PR 22597 at commit [`849c7fa`](https://github.com/apache/spark/commit/849c7fa67007aea784e95b81c65b52f23ea24425).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r225024728
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    +
    +    withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
    +      withTempDir { dir =>
    +        val path = new File(dir, "orc").getCanonicalPath
    +        Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
    +        val df = spark.read.orc(path).where("`col.dot.1` = 1 and `col.dot.2` = 2")
    +        checkAnswer(stripSparkFilter(df), Row(1, 2))
    --- End diff --
    
    to confirm, this only works when `(1, 2)` and `(3, 4)` are in the same row group? (not sure what's the terminology in ORC)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3965/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96810/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Is it possible to add tests like parquet to remove the filter in Spark SQL to ensure that the predicate is pushed down to the reader? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    **[Test build #97441 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97441/testReport)** for PR 22597 at commit [`7686179`](https://github.com/apache/spark/commit/7686179678b86369e3b62fb2ae8ae9a4384e5c14).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    **[Test build #97441 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97441/testReport)** for PR 22597 at commit [`7686179`](https://github.com/apache/spark/commit/7686179678b86369e3b62fb2ae8ae9a4384e5c14).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r224984073
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +384,15 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    +
    +    withTempDir { dir =>
    +      val path = new File(dir, "orc").getCanonicalPath
    +      Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
    +      val df = spark.read.orc(path).where("`col.dot.1` = 1 and `col.dot.2` = 2")
    +      checkAnswer(stripSparkFilter(df), Row(1, 2))
    --- End diff --
    
    @dongjoon-hyun, technically shouldn't we test if the stripes are filtered? I added some tests a long ago (`stripSparkFilter` is added by me FWIW as well):
    
    https://github.com/apache/spark/blob/5d572fc7c35f76e27b2ab400674923eb8ba91745/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala#L445-L459
    
    
    
    
     


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r225397862
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    +
    +    withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
    +      withTempDir { dir =>
    +        val path = new File(dir, "orc").getCanonicalPath
    +        Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
    +        val df = spark.read.orc(path).where("`col.dot.1` = 1 and `col.dot.2` = 2")
    +        checkAnswer(stripSparkFilter(df), Row(1, 2))
    --- End diff --
    
    Thank you for review, @dbtsai ! I ignored PPDs with nested columns here because Spark doesn't pushdown in Spark 2.4 and until now. With your PR (#22573), Spark 3.0 will support that and we can update this to handle that cases, too.
    
    @cloud-fan . Actually, ORC 1.5.0 starts to support PPD with nested columns [ORC-323](https://issues.apache.org/jira/browse/ORC-323). So, @dbtsai and I discussed about supporting that before. We are going to support ORC PPDs with nested columns in Spark 3.0 without regression.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Merged to master and branch-2.4.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    > I haven't looked into, but the parquet record-level filtering is disabled by default, so if we remove predicates from spark side, the result can be wrong even if the predicates are pushed ro parquet.
    
    That's explicitly enabled for the parquet tests (that's disabled by my FWIW). For ORC tests, since it doesn't support record by record filter, it tests if the output is smaller then the original data.
    
    Some parquet tests do this as well for instance, 
    
    https://github.com/apache/spark/blob/5d726b865948f993911fd5b9730b25cfa94e16c7/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala#L1016-L1040


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4029/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r225024900
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    +
    +    withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
    +      withTempDir { dir =>
    +        val path = new File(dir, "orc").getCanonicalPath
    +        Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
    --- End diff --
    
    How about explicitly repartition to make separate output files?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    In `ParquetFilter`, the way we test if a predicate pushdown works is by removing that predicate from Spark SQL physical plan, and only relying on the reader to do the filter. Thus, if there is a bug in pushdown filter in reader, Spark will get the incorrect result. This can use in test to ensure no regression later.   


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97436/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    **[Test build #96810 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96810/testReport)** for PR 22597 at commit [`f6c3dca`](https://github.com/apache/spark/commit/f6c3dca65b85888392f8299cc5fc20f698c6afc5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3937/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97329/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r224983962
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +384,15 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    --- End diff --
    
    Can we add a test at `OrcFilterSuite` too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    **[Test build #96810 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96810/testReport)** for PR 22597 at commit [`f6c3dca`](https://github.com/apache/spark/commit/f6c3dca65b85888392f8299cc5fc20f698c6afc5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r225373279
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    +
    +    withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
    +      withTempDir { dir =>
    +        val path = new File(dir, "orc").getCanonicalPath
    +        Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
    +        val df = spark.read.orc(path).where("`col.dot.1` = 1 and `col.dot.2` = 2")
    +        checkAnswer(stripSparkFilter(df), Row(1, 2))
    --- End diff --
    
    ORC data source doesn't support nested column pruning yet.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    **[Test build #97445 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97445/testReport)** for PR 22597 at commit [`7686179`](https://github.com/apache/spark/commit/7686179678b86369e3b62fb2ae8ae9a4384e5c14).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Thank you all!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    **[Test build #97366 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97366/testReport)** for PR 22597 at commit [`849c7fa`](https://github.com/apache/spark/commit/849c7fa67007aea784e95b81c65b52f23ea24425).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    **[Test build #97445 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97445/testReport)** for PR 22597 at commit [`7686179`](https://github.com/apache/spark/commit/7686179678b86369e3b62fb2ae8ae9a4384e5c14).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    **[Test build #97329 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97329/testReport)** for PR 22597 at commit [`335a39f`](https://github.com/apache/spark/commit/335a39f58c2103a41c6c30340746734d3aeda954).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97441/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r225309479
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    +
    +    withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
    +      withTempDir { dir =>
    +        val path = new File(dir, "orc").getCanonicalPath
    +        Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
    +        val df = spark.read.orc(path).where("`col.dot.1` = 1 and `col.dot.2` = 2")
    +        checkAnswer(stripSparkFilter(df), Row(1, 2))
    --- End diff --
    
    How do we generalize this into nested cases? The parent struct can contain dot as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4025/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r225014099
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +384,15 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    --- End diff --
    
    Okay. One end to end test should be enough


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    **[Test build #97436 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97436/testReport)** for PR 22597 at commit [`7686179`](https://github.com/apache/spark/commit/7686179678b86369e3b62fb2ae8ae9a4384e5c14).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r225373223
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    +
    +    withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
    +      withTempDir { dir =>
    +        val path = new File(dir, "orc").getCanonicalPath
    +        Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
    --- End diff --
    
    Do not rely on implicit environment values, let's make the test as explicit as possible.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Yes. Please add a test case. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r225004937
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +384,15 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    +
    +    withTempDir { dir =>
    +      val path = new File(dir, "orc").getCanonicalPath
    +      Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
    +      val df = spark.read.orc(path).where("`col.dot.1` = 1 and `col.dot.2` = 2")
    +      checkAnswer(stripSparkFilter(df), Row(1, 2))
    --- End diff --
    
    Like that test, this test also generates two ORC files with one row and test if PPD works.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Thanks. I got it. You mean `stripSparkFilter` which is used in both `OrcQuerySuite.scala` and `ParquetFilterSuite.scala`. Sure!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97366/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    > In ParquetFilter, the way we test if a predicate pushdown works is by removing that predicate from Spark SQL physical plan, and only relying on the reader to do the filter.
    
    I haven't looked into, but the parquet record-level filtering is disabled by default, so if we remove predicates from spark side, the result can be wrong even if the predicates are pushed ro parquet.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97445/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r225025622
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    +
    +    withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
    +      withTempDir { dir =>
    +        val path = new File(dir, "orc").getCanonicalPath
    +        Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
    +        val df = spark.read.orc(path).where("`col.dot.1` = 1 and `col.dot.2` = 2")
    +        checkAnswer(stripSparkFilter(df), Row(1, 2))
    --- End diff --
    
    Yup.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r224947824
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala ---
    @@ -67,6 +67,16 @@ private[sql] object OrcFilters {
         }
       }
     
    +  // Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` characters
    +  // in order to distinguish predicate pushdown for nested columns.
    +  private def quoteAttributeNameIfNeeded(name: String) : String = {
    +    if (!name.contains("`") && name.contains(".")) {
    --- End diff --
    
    For `ORC` and `AVRO` improvement, [SPARK-25722](https://issues.apache.org/jira/browse/SPARK-25722) is created.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r225001748
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +384,15 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    --- End diff --
    
    Ur, this is `OrcFilterSuite`. 
    > Can we add a test at OrcFilterSuite too?
    
    For `HiveOrcFilterSuite`, `hive` ORC implementation doesn't support `dot`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r225291092
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    +
    +    withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
    +      withTempDir { dir =>
    +        val path = new File(dir, "orc").getCanonicalPath
    +        Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
    +        val df = spark.read.orc(path).where("`col.dot.1` = 1 and `col.dot.2` = 2")
    +        checkAnswer(stripSparkFilter(df), Row(1, 2))
    --- End diff --
    
    Yep. It works when they are in different `stripe`s.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r221711903
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala ---
    @@ -67,6 +67,16 @@ private[sql] object OrcFilters {
         }
       }
     
    +  // Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` characters
    +  // in order to distinguish predicate pushdown for nested columns.
    +  private def quoteAttributeNameIfNeeded(name: String) : String = {
    +    if (!name.contains("`") && name.contains(".")) {
    --- End diff --
    
    Thank you for review. I'll consider that, too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r225295336
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ---
    @@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext {
           )).get.toString
         }
       }
    +
    +  test("SPARK-25579 ORC PPD should support column names with dot") {
    +    import testImplicits._
    +
    +    withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
    +      withTempDir { dir =>
    +        val path = new File(dir, "orc").getCanonicalPath
    +        Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
    --- End diff --
    
    We are using the default parallelism from `TestSparkSession` on two rows and it generates [separate output files](https://github.com/apache/spark/pull/22597#discussion_r225004937) already.
    
    If you are concerning some possibility of flakiness, we are able to increase the number of rows to `10` and call `repartition(10)` and check `assert(actual < 10)` as you did [before](https://github.com/apache/spark/blob/5d726b865948f993911fd5b9730b25cfa94e16c7/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala#L1016-L1040). Do you want that?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22597#discussion_r225404390
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala ---
    @@ -106,4 +106,14 @@ abstract class OrcTest extends QueryTest with SQLTestUtils with BeforeAndAfterAl
           df: DataFrame, path: File): Unit = {
         df.write.mode(SaveMode.Overwrite).orc(path.getCanonicalPath)
       }
    +
    +  protected def checkPredicatePushDown(df: DataFrame, numRows: Int, predicate: String): Unit = {
    --- End diff --
    
    @HyukjinKwon . I refactor this since it's repeated three times now.
    And, this function should be here because the existing two instances are in `OrcQueryTest` and new instance is in `OrcQuerySuite`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    **[Test build #97436 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97436/testReport)** for PR 22597 at commit [`7686179`](https://github.com/apache/spark/commit/7686179678b86369e3b62fb2ae8ae9a4384e5c14).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4023/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Thank you for review, @dbtsai and @gatorsmile .
    
    BTW, what do you mean by removing? The pushed filter doesn't introduce correctness issue like Parquet. Since it's a performance slowdown, this PR want to fix it. We don't want to **remove** filters in this PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22597
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3600/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org