You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/25 13:57:18 UTC

[GitHub] [spark] yaooqinn opened a new pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

yaooqinn opened a new pull request #31960:
URL: https://github.com/apache/spark/pull/31960


   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   A companion PR for SPARK-34817, when we handle the unsigned int(<=32) logical types. In this PR, we map the unsigned int64 to decimal(20, 0) for better compatibility. 
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   Spark won't have unsigned types, but spark should be able to read existing parquet files written by other systems that support unsigned types for better compatibility. 
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   yes, we can read parquet uint64 now
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   new unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807319956


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136523/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807128424






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807615210


   **[Test build #136534 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136534/testReport)** for PR 31960 at commit [`5e9a423`](https://github.com/apache/spark/commit/5e9a423f4ccd420389f618cb8e322f6415e3ded6).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-806938488


   **[Test build #136526 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136526/testReport)** for PR 31960 at commit [`58c8396`](https://github.com/apache/spark/commit/58c8396a8518e3ad592683239edbbb98124e906f).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-806962791


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136526/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807319956


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136523/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31960:
URL: https://github.com/apache/spark/pull/31960#discussion_r622694764



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -144,7 +141,7 @@ class ParquetToSparkSchemaConverter(
         originalType match {
           case INT_64 | null => LongType
           case DECIMAL => makeDecimalType(Decimal.MAX_LONG_DIGITS)
-          case UINT_64 => typeNotSupported()
+          case UINT_64 => DecimalType.LongDecimal

Review comment:
       OK, I will send a followup




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807039719


   **[Test build #136531 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136531/testReport)** for PR 31960 at commit [`a7e0b58`](https://github.com/apache/spark/commit/a7e0b5829781fe0efb5af6f091bbc8513b509fef).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn closed pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
yaooqinn closed pull request #31960:
URL: https://github.com/apache/spark/pull/31960


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807128424


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41110/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807873422


   Thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807063608


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136531/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807116923


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41115/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-806938668


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41108/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807618430


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136534/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-806962668


   **[Test build #136526 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136526/testReport)** for PR 31960 at commit [`58c8396`](https://github.com/apache/spark/commit/58c8396a8518e3ad592683239edbbb98124e906f).
    * This patch **fails Java style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31960:
URL: https://github.com/apache/spark/pull/31960#discussion_r601574901



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -613,6 +631,8 @@ private void readLongBatch(int rowId, int num, WritableColumnVector column) thro
       defColumn.readLongs(
         num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn,
         DecimalType.is32BitDecimalType(column.dataType()));
+    } else if (originalType == OriginalType.UINT_64) {
+      defColumn.readUnsignedLongs(num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);

Review comment:
       OK~




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-806962791


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136526/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807251514


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41118/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807135039


   **[Test build #136534 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136534/testReport)** for PR 31960 at commit [`5e9a423`](https://github.com/apache/spark/commit/5e9a423f4ccd420389f618cb8e322f6415e3ded6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807129310


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41115/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31960:
URL: https://github.com/apache/spark/pull/31960#discussion_r601535949



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -98,9 +98,6 @@ class ParquetToSparkSchemaConverter(
     def typeString =
       if (originalType == null) s"$typeName" else s"$typeName ($originalType)"
 
-    def typeNotSupported() =
-      throw QueryCompilationErrors.parquetTypeUnsupportedError(typeString)

Review comment:
       can we remove QueryCompilationErrors.parquetTypeUnsupportedError?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31960:
URL: https://github.com/apache/spark/pull/31960#discussion_r601534168



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetDictionary.java
##########
@@ -61,6 +63,14 @@ public double decodeToDouble(int id) {
 
   @Override
   public byte[] decodeToBinary(int id) {
-    return dictionary.decodeToBinary(id).getBytes();
+    if (needTransform) {
+      // For unsigned int64, it stores as dictionary encoded signed int64 in Parquet
+      // whenever dictionary is available.
+      // Here we lazily decode it to the original signed long value then convert to decimal(20, 0).
+      long signed = dictionary.decodeToLong(id);
+      return new BigInteger(Long.toUnsignedString(signed)).toByteArray();

Review comment:
       Does the binary of `BigInteger` can be read back as `BigDecimal`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-806938668


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41108/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31960:
URL: https://github.com/apache/spark/pull/31960#discussion_r601571505



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetDictionary.java
##########
@@ -61,6 +63,14 @@ public double decodeToDouble(int id) {
 
   @Override
   public byte[] decodeToBinary(int id) {
-    return dictionary.decodeToBinary(id).getBytes();
+    if (needTransform) {
+      // For unsigned int64, it stores as dictionary encoded signed int64 in Parquet
+      // whenever dictionary is available.
+      // Here we lazily decode it to the original signed long value then convert to decimal(20, 0).
+      long signed = dictionary.decodeToLong(id);
+      return new BigInteger(Long.toUnsignedString(signed)).toByteArray();

Review comment:
       use `BigInteger` for caller side
    https://github.com/apache/spark/blob/a418548dad57775fbb10b4ea690610bad1a8bfb0/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L371




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807251607


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41118/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807063608


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136531/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807618430


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136534/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807063467


   **[Test build #136531 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136531/testReport)** for PR 31960 at commit [`a7e0b58`](https://github.com/apache/spark/commit/a7e0b5829781fe0efb5af6f091bbc8513b509fef).
    * This patch **fails Java style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807233584


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41118/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807129411


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41115/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31960:
URL: https://github.com/apache/spark/pull/31960#discussion_r601535113



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -613,6 +631,8 @@ private void readLongBatch(int rowId, int num, WritableColumnVector column) thro
       defColumn.readLongs(
         num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn,
         DecimalType.is32BitDecimalType(column.dataType()));
+    } else if (originalType == OriginalType.UINT_64) {
+      defColumn.readUnsignedLongs(num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);

Review comment:
       can we add some comments here as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-806838704


   **[Test build #136523 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136523/testReport)** for PR 31960 at commit [`b796752`](https://github.com/apache/spark/commit/b79675271b85d50953fa69ac07273b72728ea79b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-806838704


   **[Test build #136523 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136523/testReport)** for PR 31960 at commit [`b796752`](https://github.com/apache/spark/commit/b79675271b85d50953fa69ac07273b72728ea79b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-806938488






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31960:
URL: https://github.com/apache/spark/pull/31960#discussion_r627112742



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
##########
@@ -459,6 +461,40 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession
     }
   }
 
+  test("SPARK-34817: Read UINT_64 as Decimal from parquet") {
+    Seq(true, false).foreach { dictionaryEnabled =>
+      def makeRawParquetFile(path: Path): Unit = {
+        val schemaStr =
+          """message root {
+            |  required INT64 a(UINT_64);
+            |}
+        """.stripMargin
+        val schema = MessageTypeParser.parseMessageType(schemaStr)
+
+        val writer = createParquetWriter(schema, path, dictionaryEnabled)
+
+        val factory = new SimpleGroupFactory(schema)
+        (-500 until 500).foreach { i =>
+          val group = factory.newGroup()
+            .append("a", i % 100L)
+          writer.write(group)
+        }
+        writer.close()
+      }
+
+      withTempDir { dir =>
+        val path = new Path(dir.toURI.toString, "part-r-0.parquet")
+        makeRawParquetFile(path)
+        readParquetFile(path.toString) { df =>

Review comment:
       can we also test with user-specified schema? What if the user-specified schema is not `decimal(20, 0)`? Do we return wrong result or fail?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807251607


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41118/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807039719


   **[Test build #136531 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136531/testReport)** for PR 31960 at commit [`a7e0b58`](https://github.com/apache/spark/commit/a7e0b5829781fe0efb5af6f091bbc8513b509fef).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807020694


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41110/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807135039


   **[Test build #136534 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136534/testReport)** for PR 31960 at commit [`5e9a423`](https://github.com/apache/spark/commit/5e9a423f4ccd420389f618cb8e322f6415e3ded6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31960:
URL: https://github.com/apache/spark/pull/31960#discussion_r601575472



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -98,9 +98,6 @@ class ParquetToSparkSchemaConverter(
     def typeString =
       if (originalType == null) s"$typeName" else s"$typeName ($originalType)"
 
-    def typeNotSupported() =
-      throw QueryCompilationErrors.parquetTypeUnsupportedError(typeString)

Review comment:
       good catch!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31960:
URL: https://github.com/apache/spark/pull/31960#discussion_r601571505



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetDictionary.java
##########
@@ -61,6 +63,14 @@ public double decodeToDouble(int id) {
 
   @Override
   public byte[] decodeToBinary(int id) {
-    return dictionary.decodeToBinary(id).getBytes();
+    if (needTransform) {
+      // For unsigned int64, it stores as dictionary encoded signed int64 in Parquet
+      // whenever dictionary is available.
+      // Here we lazily decode it to the original signed long value then convert to decimal(20, 0).
+      long signed = dictionary.decodeToLong(id);
+      return new BigInteger(Long.toUnsignedString(signed)).toByteArray();

Review comment:
       use `BigInteger` for caller side
    https://github.com/apache/spark/blob/a418548dad57775fbb10b4ea690610bad1a8bfb0/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L368-375

##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetDictionary.java
##########
@@ -61,6 +63,14 @@ public double decodeToDouble(int id) {
 
   @Override
   public byte[] decodeToBinary(int id) {
-    return dictionary.decodeToBinary(id).getBytes();
+    if (needTransform) {
+      // For unsigned int64, it stores as dictionary encoded signed int64 in Parquet
+      // whenever dictionary is available.
+      // Here we lazily decode it to the original signed long value then convert to decimal(20, 0).
+      long signed = dictionary.decodeToLong(id);
+      return new BigInteger(Long.toUnsignedString(signed)).toByteArray();

Review comment:
       use `BigInteger` for caller side
    https://github.com/apache/spark/blob/a418548dad57775fbb10b4ea690610bad1a8bfb0/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L368-L375




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807104733


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41110/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-806926310


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41108/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31960:
URL: https://github.com/apache/spark/pull/31960#discussion_r601571505



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetDictionary.java
##########
@@ -61,6 +63,14 @@ public double decodeToDouble(int id) {
 
   @Override
   public byte[] decodeToBinary(int id) {
-    return dictionary.decodeToBinary(id).getBytes();
+    if (needTransform) {
+      // For unsigned int64, it stores as dictionary encoded signed int64 in Parquet
+      // whenever dictionary is available.
+      // Here we lazily decode it to the original signed long value then convert to decimal(20, 0).
+      long signed = dictionary.decodeToLong(id);
+      return new BigInteger(Long.toUnsignedString(signed)).toByteArray();

Review comment:
       use `BigInteger` for caller side
    https://github.com/apache/spark/blob/a418548dad57775fbb10b4ea690610bad1a8bfb0/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L371~375

##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetDictionary.java
##########
@@ -61,6 +63,14 @@ public double decodeToDouble(int id) {
 
   @Override
   public byte[] decodeToBinary(int id) {
-    return dictionary.decodeToBinary(id).getBytes();
+    if (needTransform) {
+      // For unsigned int64, it stores as dictionary encoded signed int64 in Parquet
+      // whenever dictionary is available.
+      // Here we lazily decode it to the original signed long value then convert to decimal(20, 0).
+      long signed = dictionary.decodeToLong(id);
+      return new BigInteger(Long.toUnsignedString(signed)).toByteArray();

Review comment:
       use `BigInteger` for caller side
    https://github.com/apache/spark/blob/a418548dad57775fbb10b4ea690610bad1a8bfb0/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L371-375




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31960:
URL: https://github.com/apache/spark/pull/31960#discussion_r622691560



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -144,7 +141,7 @@ class ParquetToSparkSchemaConverter(
         originalType match {
           case INT_64 | null => LongType
           case DECIMAL => makeDecimalType(Decimal.MAX_LONG_DIGITS)
-          case UINT_64 => typeNotSupported()
+          case UINT_64 => DecimalType.LongDecimal

Review comment:
       I don't think it's related to `DecimalType.LongDecimal`. We want it to store uint64, so the precision we need is `java.lang.Long.toUnsignedString(-1).length`, which is 20.
   
   Let's make it more explicit
   ```
   // The precision to hold the largest unsigned long is: `java.lang.Long.toUnsignedString(-1).length` = 20
   case UINT_64 => DecimalType(20, 0)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-806836418


   cc @cloud-fan @maropu @HyukjinKwon please take a look at this PR thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807262920


   **[Test build #136523 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136523/testReport)** for PR 31960 at commit [`b796752`](https://github.com/apache/spark/commit/b79675271b85d50953fa69ac07273b72728ea79b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org