You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jamesthomp <gi...@git.apache.org> on 2018/02/11 22:56:06 UTC

[GitHub] spark pull request #20580: [SPARK-23388][SQL] Support for Parquet Binary Dec...

GitHub user jamesthomp opened a pull request:

    https://github.com/apache/spark/pull/20580

    [SPARK-23388][SQL] Support for Parquet Binary DecimalType in VectorizedColumnReader

    ## What changes were proposed in this pull request?
    
    Re-add support for parquet binary DecimalType in VectorizedColumnReader
    
    ## How was this patch tested?
    
    Existing test suite

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jamesthomp/spark jt/add-back-binary-decimal

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20580.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20580
    
----
commit e04d8a8fe0b9464f46e20326c18ca064a714a1a6
Author: James Thompson <ja...@...>
Date:   2018-02-11T22:48:58Z

    add back binary decimal

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by a10y <gi...@git.apache.org>.

Github user a10y commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    If you did add a test it should probably generate the Parquet file programmatically rather than checking it in. Some examples in https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala.
    
    However, given that there weren't tests before and this just fixes a break that was recently introduced, it seems reasonable that creation of tests for the vectorized reader can be done in a folowup?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    **[Test build #87335 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87335/testReport)** for PR 20580 at commit [`378ce28`](https://github.com/apache/spark/commit/378ce2818b3dc8519891c4e806683e994b95d175).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    It was an accident, thanks for the fix!
    
    Can we add a test? It's always good to have a test for a bug fix, even the bug was introduced recently.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20580: [SPARK-23388][SQL] Support for Parquet Binary Dec...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20580#discussion_r167458369
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java ---
    @@ -444,7 +444,7 @@ private void readBinaryBatch(int rowId, int num, WritableColumnVector column) {
         // This is where we implement support for the valid type conversions.
         // TODO: implement remaining type conversions
         VectorizedValuesReader data = (VectorizedValuesReader) dataColumn;
    -    if (column.dataType() == DataTypes.StringType || column.dataType() == DataTypes.BinaryType) {
    +    if (column.isArray()) {
    --- End diff --
    
    Do we need new test cases for supporting a new type?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    You don't need to generate the parquet file manually, just write a parquet file using Spark and read it back. We can probably add this test in `ParquetFileFormatSuite`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20580: [SPARK-23388][SQL] Support for Parquet Binary Dec...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20580#discussion_r167458326
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
    @@ -691,11 +696,6 @@ public WritableColumnVector arrayData() {
        */
       protected abstract WritableColumnVector reserveNewColumn(int capacity, DataType type);
     
    -  protected boolean isArray() {
    -    return type instanceof ArrayType || type instanceof BinaryType || type instanceof StringType ||
    --- End diff --
    
    Would it be better to minimize this change? i.e. just `protected` -> `public` without changing the place.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    @cloud-fan Is there any reason that the above PR removed to support some types such as `Array`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by jamesthomp <gi...@git.apache.org>.

Github user jamesthomp commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    I'll see if I can generate a parquet file with the right schema to add for a test, but probably cannot look at this till tomorrow.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    IIUC, does this PR support array type, too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    **[Test build #87335 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87335/testReport)** for PR 20580 at commit [`378ce28`](https://github.com/apache/spark/commit/378ce2818b3dc8519891c4e806683e994b95d175).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20580: [SPARK-23388][SQL] Support for Parquet Binary Dec...

Posted by jamesthomp <gi...@git.apache.org>.

Github user jamesthomp commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20580#discussion_r167459070
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
    @@ -691,11 +696,6 @@ public WritableColumnVector arrayData() {
        */
       protected abstract WritableColumnVector reserveNewColumn(int capacity, DataType type);
     
    -  protected boolean isArray() {
    -    return type instanceof ArrayType || type instanceof BinaryType || type instanceof StringType ||
    --- End diff --
    
    happy to keep the previous place but I noticed that all the `public` methods in this class are grouped together so I thought that I should move this to keep that consistent


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    LGTM. 
    
    Since this is a regression + blocker of Spark 2.3 release, I am merging it now. Please submit a follow-up PR to add the tests. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20580: [SPARK-23388][SQL] Support for Parquet Binary Dec...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20580


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20580: [SPARK-23388][SQL] Support for Parquet Binary Dec...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20580#discussion_r167482592
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
    @@ -691,11 +696,6 @@ public WritableColumnVector arrayData() {
        */
       protected abstract WritableColumnVector reserveNewColumn(int capacity, DataType type);
     
    -  protected boolean isArray() {
    -    return type instanceof ArrayType || type instanceof BinaryType || type instanceof StringType ||
    --- End diff --
    
    `WritableColumnVector` class will be public API in Spark 2.3. It seem to be necessary to discuss with @cloud-fan for changing the visibility of a method or field.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20580: [SPARK-23388][SQL] Support for Parquet Binary Dec...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20580#discussion_r167656976
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java ---
    @@ -444,7 +444,8 @@ private void readBinaryBatch(int rowId, int num, WritableColumnVector column) {
         // This is where we implement support for the valid type conversions.
         // TODO: implement remaining type conversions
         VectorizedValuesReader data = (VectorizedValuesReader) dataColumn;
    -    if (column.dataType() == DataTypes.StringType || column.dataType() == DataTypes.BinaryType) {
    +    if (column.dataType() == DataTypes.StringType || column.dataType() == DataTypes.BinaryType
    +            || DecimalType.isByteArrayDecimalType(column.dataType())) {
    --- End diff --
    
    Also please correct the indent too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by jamesthomp <gi...@git.apache.org>.

Github user jamesthomp commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    Yeah I believe it will add support for the array type too. Spark actually previously supported these types but the support was removed in this PR: https://github.com/apache/spark/commit/9c29c557635caf739fde942f53255273aac0d7b1#diff-7bdf5fd0ce0b1ccbf4ecf083611976e6R428
    
    I'm just trying to add it back.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by a10y <gi...@git.apache.org>.

Github user a10y commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    As far as we can tell this is an accidental breaking change, as dropping support for this in vectorized Parquet reader was never called out. We have Parquet datasets with binary columns with logical type DECIMAL that were loadable before the change and have since become unloadable, throwing in `readBinaryBatch`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by jamesthomp <gi...@git.apache.org>.

Github user jamesthomp commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    @kiszk - I've changed the implementation to no longer use `column.isArray()` and instead just inline the decimal type check (so no changes needed to the public api). I don't think you could actually hit this codepath with ArrayType, so that part was unnecessary.
    
    As for testing, it might be easiest to check in a parquet file with the binary decimal format and then check that spark can read it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20580: [SPARK-23388][SQL] Support for Parquet Binary DecimalTyp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20580
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87335/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org