You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sameeragarwal <gi...@git.apache.org> on 2016/07/21 18:36:09 UTC

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/14304

    [SPARK-16668][TEST] Test parquet reader for row groups containing both dictionary and plain encoded pages

    ## What changes were proposed in this pull request?
    
    This patch adds an explicit test for [SPARK-14217] by setting the parquet dictionary and page size the generated parquet file spans across 3 pages (within a single row group) where the first page is dictionary encoded and the remaining two are plain encoded.
    
    ## How was this patch tested?
    
    1. ParquetEncodingSuite
    2. Also manually tested that this test fails without https://github.com/apache/spark/pull/12279

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark hybrid-encoding-test

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14304.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14304
    
----
commit adffc4407a783bdf86d5ee5a26d289ee496d1247
Author: Sameer Agarwal <sa...@cs.berkeley.edu>
Date:   2016-07-21T06:08:17Z

    experiments

commit 5e7556cf96d991b2f38fda82d28256687f056474
Author: Sameer Agarwal <sa...@cs.berkeley.edu>
Date:   2016-07-21T07:59:34Z

    works

commit 6b688e97310f903066b4085cb0374e76a9baef0a
Author: Sameer Agarwal <sa...@cs.berkeley.edu>
Date:   2016-07-21T18:29:53Z

    cleanup

commit f3029080c449d40c1dde8e97b97f0354866788c4
Author: Sameer Agarwal <sa...@cs.berkeley.edu>
Date:   2016-07-21T18:30:47Z

    cleanup

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

Posted by jaceklaskowski <gi...@git.apache.org>.

Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14304#discussion_r71774436
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala ---
    @@ -78,4 +78,29 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
           }}
         }
       }
    +
    +  test("Read row group containing both dictionary and plain encoded pages") {
    +    spark.conf.set("parquet.dictionary.page.size", "2048")
    +    spark.conf.set("parquet.page.size", "4096")
    +
    +    withTempPath { dir =>
    +      // In order to explicitly test for SPARK-14217, we set the parquet dictionary and page size
    +      // such that the following data spans across 3 pages (within a single row group) where the
    +      // first page is dictionary encoded and the remaining two are plain encoded.
    +      val data = (0 until 512).flatMap(i => List(i.toString, i.toString, i.toString))
    +      data.toDF("f").coalesce(1).write.parquet(dir.getCanonicalPath)
    +      val file = SpecificParquetRecordReaderBase.listDirectory(dir).toArray.head
    +
    +      val reader = new VectorizedParquetRecordReader
    +      reader.initialize(file.asInstanceOf[String], null)
    +      val batch = reader.resultBatch()
    +      assert(reader.nextBatch())
    +
    +      (0 until 512).foreach { i =>
    +        assert(batch.column(0).getUTF8String(3 * i).toString == i.toString)
    --- End diff --
    
    Two things here:
    
    1. Create column in line 96 (`batch.column(0)`).
    2. Since you convert `toString`, what do you think about `toInt` instead (since `i` is `Int` anyway). One conversion less :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    **[Test build #62683 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62683/consoleFull)** for PR 14304 at commit [`f302908`](https://github.com/apache/spark/commit/f3029080c449d40c1dde8e97b97f0354866788c4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    **[Test build #62740 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62740/consoleFull)** for PR 14304 at commit [`16e6b91`](https://github.com/apache/spark/commit/16e6b91688ad8f73336b7729745189e2bd7f880f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    **[Test build #62739 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62739/consoleFull)** for PR 14304 at commit [`4f98c7f`](https://github.com/apache/spark/commit/4f98c7fc9c91893d22b78ed693d9a8f33bbb1146).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14304#discussion_r71903616
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala ---
    @@ -78,4 +78,30 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
           }}
         }
       }
    +
    +  test("Read row group containing both dictionary and plain encoded pages") {
    +    spark.conf.set("parquet.dictionary.page.size", "2048")
    +    spark.conf.set("parquet.page.size", "4096")
    --- End diff --
    
    Nit: You can use the following constants instead of hard code the key strings:
    
    - `ParquetOutputFormat.DICTIONARY_PAGE_SIZE`
    - `ParquetOutputFormat.PAGE_SIZE`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    Thanks @jaceklaskowski, addressed your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62739/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

Posted by jaceklaskowski <gi...@git.apache.org>.

Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14304#discussion_r71844762
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala ---
    @@ -78,4 +78,30 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
           }}
         }
       }
    +
    +  test("Read row group containing both dictionary and plain encoded pages") {
    +    spark.conf.set("parquet.dictionary.page.size", "2048")
    +    spark.conf.set("parquet.page.size", "4096")
    +
    +    withTempPath { dir =>
    +      // In order to explicitly test for SPARK-14217, we set the parquet dictionary and page size
    +      // such that the following data spans across 3 pages (within a single row group) where the
    +      // first page is dictionary encoded and the remaining two are plain encoded.
    +      val data = (0 until 512).flatMap(i => Seq.fill(3)(i.toString))
    +      data.toDF("f").coalesce(1).write.parquet(dir.getCanonicalPath)
    +      val file =
    +        SpecificParquetRecordReaderBase.listDirectory(dir).toArray.head.asInstanceOf[String]
    +
    +      val reader = new VectorizedParquetRecordReader
    +      reader.initialize(file, null /* set columns to null to project all columns */)
    +      val column = reader.resultBatch().column(0)
    +      assert(reader.nextBatch())
    +
    +      (0 until 512).foreach { i =>
    +        assert(column.getUTF8String(3 * i).toString == i.toString)
    --- End diff --
    
    What about `toInt` as follows:
    
    ```
    assert(column.getUTF8String(3 * i).toInt == i)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    **[Test build #62692 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62692/consoleFull)** for PR 14304 at commit [`255e067`](https://github.com/apache/spark/commit/255e0679523a9e6205ff43d0ce6eb14b9a7a95f8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    **[Test build #62683 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62683/consoleFull)** for PR 14304 at commit [`f302908`](https://github.com/apache/spark/commit/f3029080c449d40c1dde8e97b97f0354866788c4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

Posted by jaceklaskowski <gi...@git.apache.org>.

Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14304#discussion_r71844632
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala ---
    @@ -78,4 +78,30 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
           }}
         }
       }
    +
    +  test("Read row group containing both dictionary and plain encoded pages") {
    +    spark.conf.set("parquet.dictionary.page.size", "2048")
    +    spark.conf.set("parquet.page.size", "4096")
    +
    +    withTempPath { dir =>
    +      // In order to explicitly test for SPARK-14217, we set the parquet dictionary and page size
    +      // such that the following data spans across 3 pages (within a single row group) where the
    +      // first page is dictionary encoded and the remaining two are plain encoded.
    +      val data = (0 until 512).flatMap(i => Seq.fill(3)(i.toString))
    +      data.toDF("f").coalesce(1).write.parquet(dir.getCanonicalPath)
    +      val file =
    +        SpecificParquetRecordReaderBase.listDirectory(dir).toArray.head.asInstanceOf[String]
    +
    +      val reader = new VectorizedParquetRecordReader
    +      reader.initialize(file, null /* set columns to null to project all columns */)
    --- End diff --
    
    I meant `initialize(file, columns = null)` or even:
    
    ```
    val projectAllColumns = null
    initialize(file, projectAllColumns)
    ```
    
    So you code what your intention is (without extra comments).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    Two minor issues, otherwise LGTM. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

Posted by jaceklaskowski <gi...@git.apache.org>.

Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14304#discussion_r71780859
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala ---
    @@ -78,4 +78,29 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
           }}
         }
       }
    +
    +  test("Read row group containing both dictionary and plain encoded pages") {
    +    spark.conf.set("parquet.dictionary.page.size", "2048")
    +    spark.conf.set("parquet.page.size", "4096")
    +
    +    withTempPath { dir =>
    +      // In order to explicitly test for SPARK-14217, we set the parquet dictionary and page size
    +      // such that the following data spans across 3 pages (within a single row group) where the
    +      // first page is dictionary encoded and the remaining two are plain encoded.
    +      val data = (0 until 512).flatMap(i => List(i.toString, i.toString, i.toString))
    --- End diff --
    
    What do you think about `Seq.fill(3)(i.toString)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14304#discussion_r71954540
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala ---
    @@ -78,4 +78,30 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
           }}
         }
       }
    +
    +  test("Read row group containing both dictionary and plain encoded pages") {
    +    spark.conf.set("parquet.dictionary.page.size", "2048")
    +    spark.conf.set("parquet.page.size", "4096")
    +
    +    withTempPath { dir =>
    +      // In order to explicitly test for SPARK-14217, we set the parquet dictionary and page size
    +      // such that the following data spans across 3 pages (within a single row group) where the
    +      // first page is dictionary encoded and the remaining two are plain encoded.
    +      val data = (0 until 512).flatMap(i => Seq.fill(3)(i.toString))
    +      data.toDF("f").coalesce(1).write.parquet(dir.getCanonicalPath)
    +      val file =
    +        SpecificParquetRecordReaderBase.listDirectory(dir).toArray.head.asInstanceOf[String]
    +
    +      val reader = new VectorizedParquetRecordReader
    +      reader.initialize(file, null /* set columns to null to project all columns */)
    +      val column = reader.resultBatch().column(0)
    +      assert(reader.nextBatch())
    +
    +      (0 until 512).foreach { i =>
    +        assert(column.getUTF8String(3 * i).toString == i.toString)
    --- End diff --
    
    Ah, gotcha!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14304#discussion_r71901937
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala ---
    @@ -78,4 +78,30 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
           }}
         }
       }
    +
    +  test("Read row group containing both dictionary and plain encoded pages") {
    +    spark.conf.set("parquet.dictionary.page.size", "2048")
    +    spark.conf.set("parquet.page.size", "4096")
    --- End diff --
    
    Let's use `withSQLConf` to alter these settings so that they are automatically reverted to their original values at the end of the scope.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14304#discussion_r71787031
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala ---
    @@ -78,4 +78,29 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
           }}
         }
       }
    +
    +  test("Read row group containing both dictionary and plain encoded pages") {
    +    spark.conf.set("parquet.dictionary.page.size", "2048")
    +    spark.conf.set("parquet.page.size", "4096")
    +
    +    withTempPath { dir =>
    +      // In order to explicitly test for SPARK-14217, we set the parquet dictionary and page size
    +      // such that the following data spans across 3 pages (within a single row group) where the
    +      // first page is dictionary encoded and the remaining two are plain encoded.
    +      val data = (0 until 512).flatMap(i => List(i.toString, i.toString, i.toString))
    +      data.toDF("f").coalesce(1).write.parquet(dir.getCanonicalPath)
    +      val file = SpecificParquetRecordReaderBase.listDirectory(dir).toArray.head
    +
    +      val reader = new VectorizedParquetRecordReader
    +      reader.initialize(file.asInstanceOf[String], null)
    --- End diff --
    
    This is calling into java code, so named parameters wouldn't work. I added a comment to make it clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62683/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14304#discussion_r71954942
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala ---
    @@ -78,4 +78,30 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
           }}
         }
       }
    +
    +  test("Read row group containing both dictionary and plain encoded pages") {
    +    spark.conf.set("parquet.dictionary.page.size", "2048")
    +    spark.conf.set("parquet.page.size", "4096")
    +
    +    withTempPath { dir =>
    +      // In order to explicitly test for SPARK-14217, we set the parquet dictionary and page size
    +      // such that the following data spans across 3 pages (within a single row group) where the
    +      // first page is dictionary encoded and the remaining two are plain encoded.
    +      val data = (0 until 512).flatMap(i => Seq.fill(3)(i.toString))
    +      data.toDF("f").coalesce(1).write.parquet(dir.getCanonicalPath)
    +      val file =
    +        SpecificParquetRecordReaderBase.listDirectory(dir).toArray.head.asInstanceOf[String]
    +
    +      val reader = new VectorizedParquetRecordReader
    +      reader.initialize(file, null /* set columns to null to project all columns */)
    +      val column = reader.resultBatch().column(0)
    +      assert(reader.nextBatch())
    +
    +      (0 until 512).foreach { i =>
    +        assert(column.getUTF8String(3 * i).toString == i.toString)
    --- End diff --
    
    Seems like there's no `toInt` function in `org.apache.spark.unsafe.types.UTF8String`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62692/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

Posted by jaceklaskowski <gi...@git.apache.org>.

Github user jaceklaskowski commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14304#discussion_r71773933
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala ---
    @@ -78,4 +78,29 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
           }}
         }
       }
    +
    +  test("Read row group containing both dictionary and plain encoded pages") {
    +    spark.conf.set("parquet.dictionary.page.size", "2048")
    +    spark.conf.set("parquet.page.size", "4096")
    +
    +    withTempPath { dir =>
    +      // In order to explicitly test for SPARK-14217, we set the parquet dictionary and page size
    +      // such that the following data spans across 3 pages (within a single row group) where the
    +      // first page is dictionary encoded and the remaining two are plain encoded.
    +      val data = (0 until 512).flatMap(i => List(i.toString, i.toString, i.toString))
    +      data.toDF("f").coalesce(1).write.parquet(dir.getCanonicalPath)
    +      val file = SpecificParquetRecordReaderBase.listDirectory(dir).toArray.head
    +
    +      val reader = new VectorizedParquetRecordReader
    +      reader.initialize(file.asInstanceOf[String], null)
    --- End diff --
    
    What do you think about moving this `asInstanceOf` to line 92 and using a name parameter for `null`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62740/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14304#discussion_r71902498
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala ---
    @@ -78,4 +78,30 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
           }}
         }
       }
    +
    +  test("Read row group containing both dictionary and plain encoded pages") {
    +    spark.conf.set("parquet.dictionary.page.size", "2048")
    +    spark.conf.set("parquet.page.size", "4096")
    +
    +    withTempPath { dir =>
    +      // In order to explicitly test for SPARK-14217, we set the parquet dictionary and page size
    +      // such that the following data spans across 3 pages (within a single row group) where the
    +      // first page is dictionary encoded and the remaining two are plain encoded.
    +      val data = (0 until 512).flatMap(i => Seq.fill(3)(i.toString))
    +      data.toDF("f").coalesce(1).write.parquet(dir.getCanonicalPath)
    +      val file =
    +        SpecificParquetRecordReaderBase.listDirectory(dir).toArray.head.asInstanceOf[String]
    --- End diff --
    
    We can use `JavaConverters` here:
    
    ```scala
    import scala.collection.JavaConverters._
    
    val file = SpecificParquetRecordReaderBase.listDirectory(dir).asScala.head
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14304#discussion_r71787200
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala ---
    @@ -78,4 +78,29 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
           }}
         }
       }
    +
    +  test("Read row group containing both dictionary and plain encoded pages") {
    +    spark.conf.set("parquet.dictionary.page.size", "2048")
    +    spark.conf.set("parquet.page.size", "4096")
    +
    +    withTempPath { dir =>
    +      // In order to explicitly test for SPARK-14217, we set the parquet dictionary and page size
    +      // such that the following data spans across 3 pages (within a single row group) where the
    +      // first page is dictionary encoded and the remaining two are plain encoded.
    +      val data = (0 until 512).flatMap(i => List(i.toString, i.toString, i.toString))
    +      data.toDF("f").coalesce(1).write.parquet(dir.getCanonicalPath)
    +      val file = SpecificParquetRecordReaderBase.listDirectory(dir).toArray.head
    +
    +      val reader = new VectorizedParquetRecordReader
    +      reader.initialize(file.asInstanceOf[String], null)
    +      val batch = reader.resultBatch()
    +      assert(reader.nextBatch())
    +
    +      (0 until 512).foreach { i =>
    +        assert(batch.column(0).getUTF8String(3 * i).toString == i.toString)
    --- End diff --
    
    Unfortunately, using ints wouldn't produce this hybrid encoding that we're testing for (it just ends up producing 2 dictionary encoded pages).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    **[Test build #62739 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62739/consoleFull)** for PR 14304 at commit [`4f98c7f`](https://github.com/apache/spark/commit/4f98c7fc9c91893d22b78ed693d9a8f33bbb1146).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14304


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    I'm merging this to master. Thanks for fixing this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14304: [SPARK-16668][TEST] Test parquet reader for row g...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14304#discussion_r71901576
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala ---
    @@ -78,4 +78,30 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
           }}
         }
       }
    +
    +  test("Read row group containing both dictionary and plain encoded pages") {
    +    spark.conf.set("parquet.dictionary.page.size", "2048")
    +    spark.conf.set("parquet.page.size", "4096")
    +
    +    withTempPath { dir =>
    +      // In order to explicitly test for SPARK-14217, we set the parquet dictionary and page size
    +      // such that the following data spans across 3 pages (within a single row group) where the
    +      // first page is dictionary encoded and the remaining two are plain encoded.
    +      val data = (0 until 512).flatMap(i => Seq.fill(3)(i.toString))
    +      data.toDF("f").coalesce(1).write.parquet(dir.getCanonicalPath)
    +      val file =
    +        SpecificParquetRecordReaderBase.listDirectory(dir).toArray.head.asInstanceOf[String]
    +
    +      val reader = new VectorizedParquetRecordReader
    +      reader.initialize(file, null /* set columns to null to project all columns */)
    --- End diff --
    
    `VectorizedParquetRecordReader` is a Java class instead of a Scala class, so named parameter isn't feasible here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    **[Test build #62740 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62740/consoleFull)** for PR 14304 at commit [`16e6b91`](https://github.com/apache/spark/commit/16e6b91688ad8f73336b7729745189e2bd7f880f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14304: [SPARK-16668][TEST] Test parquet reader for row groups c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14304
  
    **[Test build #62692 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62692/consoleFull)** for PR 14304 at commit [`255e067`](https://github.com/apache/spark/commit/255e0679523a9e6205ff43d0ce6eb14b9a7a95f8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org