You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/22 03:40:57 UTC

[GitHub] [spark] yaooqinn opened a new pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type

yaooqinn opened a new pull request #31921:
URL: https://github.com/apache/spark/pull/31921


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   Unsigned types may be used to produce smaller in-memory representations of the data. These types used by frameworks(e.g. hive, pig) using parquet. And parquet will map the to its base types.
   
   ```thrift
     /**
      * An unsigned integer value.
      *
      * The number describes the maximum number of meaningful data bits in
      * the stored value. 8, 16 and 32 bit values are stored using the
      * INT32 physical type.  64 bit values are stored using the INT64
      * physical type.
      *
      */
     UINT_8 = 11;
     UINT_16 = 12;
     UINT_32 = 13;
     UINT_64 = 14;
   ```
   
   ```
   UInt8-[0:255]
   UInt16-[0:65535]
   UInt32-[0:4294967295]
   UInt64-[0:18446744073709551615]
   ```
   
   In this PR, we support read UINT_8 as ShortType, UINT_16 as IntegerType, UINT_32 as LongType to fit their range. Support for UINT_64 will be in another PR.
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   yes, we can read unit[8/16/31] from parquet files
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   new tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803801410


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40913/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598477793



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -565,6 +565,10 @@ private void readIntBatch(int rowId, int num, WritableColumnVector column) throw
         canReadAsIntDecimal(column.dataType())) {
       defColumn.readIntegers(
           num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
+    } else if (column.dataType() == DataTypes.LongType) {

Review comment:
       These values fix signed int32 well. Parquet will not convert them forth and back, so it's fine for us to read the raw data




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598473861



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -565,6 +565,10 @@ private void readIntBatch(int rowId, int num, WritableColumnVector column) throw
         canReadAsIntDecimal(column.dataType())) {
       defColumn.readIntegers(
           num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
+    } else if (column.dataType() == DataTypes.LongType) {
+      //  We use LongType to handle UINT32
+      defColumn.readIntegersAsUnsigned(

Review comment:
       nit: `readUnsighedIntegers`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806420725


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41081/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598463339



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -130,13 +130,11 @@ class ParquetToSparkSchemaConverter(
       case INT32 =>
         originalType match {
           case INT_8 => ByteType
-          case INT_16 => ShortType
-          case INT_32 | null => IntegerType
+          case INT_16 | UINT_8 => ShortType
+          case INT_32 | UINT_16 | null => IntegerType
           case DATE => DateType
           case DECIMAL => makeDecimalType(Decimal.MAX_INT_DIGITS)
-          case UINT_8 => typeNotSupported()
-          case UINT_16 => typeNotSupported()
-          case UINT_32 => typeNotSupported()
+          case UINT_32 => LongType

Review comment:
       Thanks, @HyukjinKwon,
   Yea, I have checked that PR too. There's also a suggestion that we support them.
   Lately, Wenchen created https://issues.apache.org/jira/browse/SPARK-34786 for reading uint64. As other unsigned types are not supported too,  and they are a bit more clear than uint64 which needs a decimal.
   
   IMO, for Spark, it is worthwhile to be able to support more storage layer features without breaking our own rules. So I raised this PR to collect more opinions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805961385


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41039/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806398414


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136490/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805963145


   **[Test build #136472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136472/testReport)** for PR 31921 at commit [`d9afc79`](https://github.com/apache/spark/commit/d9afc7916fa08f3bea9e89ab7a48cfd38c76c190).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598455740



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -130,13 +130,11 @@ class ParquetToSparkSchemaConverter(
       case INT32 =>
         originalType match {
           case INT_8 => ByteType
-          case INT_16 => ShortType
-          case INT_32 | null => IntegerType
+          case INT_16 | UINT_8 => ShortType
+          case INT_32 | UINT_16 | null => IntegerType
           case DATE => DateType
           case DECIMAL => makeDecimalType(Decimal.MAX_INT_DIGITS)
-          case UINT_8 => typeNotSupported()
-          case UINT_16 => typeNotSupported()
-          case UINT_32 => typeNotSupported()
+          case UINT_32 => LongType

Review comment:
       But it's very old. Almost 6 years ago lol. @liancheng do you have a different thought now?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806515811


   **[Test build #136497 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136497/testReport)** for PR 31921 at commit [`71496bd`](https://github.com/apache/spark/commit/71496bdc4d5c8081139e5a26fa9bddfa1ddc38ed).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598475266



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -565,6 +565,10 @@ private void readIntBatch(int rowId, int num, WritableColumnVector column) throw
         canReadAsIntDecimal(column.dataType())) {
       defColumn.readIntegers(
           num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
+    } else if (column.dataType() == DataTypes.LongType) {

Review comment:
       where do we handle uint8 and unit16?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805868130


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41035/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803817012


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40913/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806359741


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41076/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806218864


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41056/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803787635


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40909/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805822627


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41036/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r600360823



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -565,6 +565,10 @@ private void readIntBatch(int rowId, int num, WritableColumnVector column) throw
         canReadAsIntDecimal(column.dataType())) {
       defColumn.readIntegers(
           num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
+    } else if (column.dataType() == DataTypes.LongType) {
+      //  We use LongType to handle UINT32
+      defColumn.readIntegersAsUnsigned(

Review comment:
       I have added the dictionary decoding code path, change the parquet data generator a bit to produce right encoded/plain data




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803745469


   **[Test build #136327 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136327/testReport)** for PR 31921 at commit [`0c8b6d4`](https://github.com/apache/spark/commit/0c8b6d45455744745d2df87793f4acd94d58656e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598689607



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -565,6 +565,10 @@ private void readIntBatch(int rowId, int num, WritableColumnVector column) throw
         canReadAsIntDecimal(column.dataType())) {
       defColumn.readIntegers(
           num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
+    } else if (column.dataType() == DataTypes.LongType) {
+      //  We use LongType to handle UINT32
+      defColumn.readIntegersAsUnsigned(

Review comment:
       Looks irrelevant to me




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r600614841



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -370,6 +373,13 @@ private void decodeDictionaryIds(
               column.putInt(i, dictionary.decodeToInt(dictionaryIds.getDictId(i)));
             }
           }
+        } else if (column.dataType() == DataTypes.LongType) {

Review comment:
       For signed and unsigned int (<=32)  types they share the same `PrimitiveType` - `INT32`. The Unsigned ones are just logical types.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806352122


   **[Test build #136497 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136497/testReport)** for PR 31921 at commit [`71496bd`](https://github.com/apache/spark/commit/71496bdc4d5c8081139e5a26fa9bddfa1ddc38ed).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598494071



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##########
@@ -203,6 +203,41 @@ public void readIntegers(
     }
   }
 
+  // A fork of `readIntegers`, reading the signed integers as unsigned in long type
+  public void readIntegersAsUnsigned(
+    int total,
+    WritableColumnVector c,
+    int rowId,
+    int level,
+    VectorizedValuesReader data) throws IOException {
+    int left = total;
+    while (left > 0) {
+      if (this.currentCount == 0) this.readNextGroup();
+      int n = Math.min(left, this.currentCount);
+      switch (mode) {
+        case RLE:
+          if (currentValue == level) {
+            data.readIntegersAsUnsigned(n, c, rowId);
+          } else {
+            c.putNulls(rowId, n);
+          }
+          break;
+        case PACKED:
+          for (int i = 0; i < n; ++i) {
+            if (currentBuffer[currentBufferIdx++] == level) {
+              c.putLong(rowId + i, Integer.toUnsignedLong(data.readInteger()));

Review comment:
       @yaooqinn do you mean `data.readInteger()` also works for uint8 and unit16?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805868186


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41035/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806393928


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41076/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806087005


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136472/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598500388



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##########
@@ -203,6 +203,41 @@ public void readIntegers(
     }
   }
 
+  // A fork of `readIntegers`, reading the signed integers as unsigned in long type
+  public void readIntegersAsUnsigned(
+    int total,
+    WritableColumnVector c,
+    int rowId,
+    int level,
+    VectorizedValuesReader data) throws IOException {
+    int left = total;
+    while (left > 0) {
+      if (this.currentCount == 0) this.readNextGroup();
+      int n = Math.min(left, this.currentCount);
+      switch (mode) {
+        case RLE:
+          if (currentValue == level) {
+            data.readIntegersAsUnsigned(n, c, rowId);
+          } else {
+            c.putNulls(rowId, n);
+          }
+          break;
+        case PACKED:
+          for (int i = 0; i < n; ++i) {
+            if (currentBuffer[currentBufferIdx++] == level) {
+              c.putLong(rowId + i, Integer.toUnsignedLong(data.readInteger()));

Review comment:
       Yes, in Parquet, `8, 16, and 32 bit values are stored using the INT32 physical type`, we read them by `readInteger` and cast them based on Catalyst type later. The [0, Int.MaxValue] is enough for uint 8, and 16, while [Int.MinValue, 0) is needed for uint32




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805685174


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136451/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805773313


   **[Test build #136455 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136455/testReport)** for PR 31921 at commit [`3642f91`](https://github.com/apache/spark/commit/3642f91059689d498304ff08fab15608d6e8b9f2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805739301


   **[Test build #136452 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136452/testReport)** for PR 31921 at commit [`efe9c4a`](https://github.com/apache/spark/commit/efe9c4af56fa76aac00ca7f819cce50cb03eaf43).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r600617028



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -565,6 +575,10 @@ private void readIntBatch(int rowId, int num, WritableColumnVector column) throw
         canReadAsIntDecimal(column.dataType())) {
       defColumn.readIntegers(
           num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
+    } else if (column.dataType() == DataTypes.LongType) {

Review comment:
       This is deterministic and controlled by our own, which seems not necessary. see https://github.com/apache/spark/pull/31921/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R137




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r600588640



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -565,6 +575,10 @@ private void readIntBatch(int rowId, int num, WritableColumnVector column) throw
         canReadAsIntDecimal(column.dataType())) {
       defColumn.readIntegers(
           num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
+    } else if (column.dataType() == DataTypes.LongType) {

Review comment:
       shall we add an extra check to make sure we are reading unsigned values?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #31921:
URL: https://github.com/apache/spark/pull/31921


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805822627


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41036/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805685081


   **[Test build #136451 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136451/testReport)** for PR 31921 at commit [`0da5d07`](https://github.com/apache/spark/commit/0da5d076e5f7a54a9b8b4426e93394a6e5eb37cd).
    * This patch **fails Scala style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805809998


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136455/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806082681


   **[Test build #136472 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136472/testReport)** for PR 31921 at commit [`d9afc79`](https://github.com/apache/spark/commit/d9afc7916fa08f3bea9e89ab7a48cfd38c76c190).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803763315


   **[Test build #136330 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136330/testReport)** for PR 31921 at commit [`8ff3267`](https://github.com/apache/spark/commit/8ff3267b82dc20961a31b6b6a8d565a221497132).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r600587869



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -370,6 +373,13 @@ private void decodeDictionaryIds(
               column.putInt(i, dictionary.decodeToInt(dictionaryIds.getDictId(i)));
             }
           }
+        } else if (column.dataType() == DataTypes.LongType) {

Review comment:
       when will we hit this branch? it's `case INT32` not unsigned.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803795712


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40913/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598463339



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -130,13 +130,11 @@ class ParquetToSparkSchemaConverter(
       case INT32 =>
         originalType match {
           case INT_8 => ByteType
-          case INT_16 => ShortType
-          case INT_32 | null => IntegerType
+          case INT_16 | UINT_8 => ShortType
+          case INT_32 | UINT_16 | null => IntegerType
           case DATE => DateType
           case DECIMAL => makeDecimalType(Decimal.MAX_INT_DIGITS)
-          case UINT_8 => typeNotSupported()
-          case UINT_16 => typeNotSupported()
-          case UINT_32 => typeNotSupported()
+          case UINT_32 => LongType

Review comment:
       Thanks, @HyukjinKwon,
   Yea, I have checked that PR too. There's also a suggestion that we support them.
   Lately, Wenchen created https://issues.apache.org/jira/browse/SPARK-34786 for reading uint64. As other unsigned types are not supported too and they are a bit more clear than uint64 which needs a decimal, I raised this PR to collect more opinions.
   
   IMO, for Spark, it is worthwhile to be able to support more storage layer features without breaking our own rules. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805929604


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41039/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805807769


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41035/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806421935


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41081/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r600366042



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala
##########
@@ -49,26 +52,6 @@ import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
-// Write support class for nested groups: ParquetWriter initializes GroupWriteSupport

Review comment:
       we don't need this anymore, the `ExampleParquetWriter ` meets our needs 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805813390


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41036/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598471813



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -130,13 +130,11 @@ class ParquetToSparkSchemaConverter(
       case INT32 =>
         originalType match {
           case INT_8 => ByteType
-          case INT_16 => ShortType
-          case INT_32 | null => IntegerType
+          case INT_16 | UINT_8 => ShortType
+          case INT_32 | UINT_16 | null => IntegerType
           case DATE => DateType
           case DECIMAL => makeDecimalType(Decimal.MAX_INT_DIGITS)
-          case UINT_8 => typeNotSupported()
-          case UINT_16 => typeNotSupported()
-          case UINT_32 => typeNotSupported()
+          case UINT_32 => LongType

Review comment:
       It's mostly about compatibility. Spark won't have unsigned types, but spark should be able to read existing parquet files written by other systems that support unsigned types.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803763315


   **[Test build #136330 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136330/testReport)** for PR 31921 at commit [`8ff3267`](https://github.com/apache/spark/commit/8ff3267b82dc20961a31b6b6a8d565a221497132).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805680315


   **[Test build #136451 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136451/testReport)** for PR 31921 at commit [`0da5d07`](https://github.com/apache/spark/commit/0da5d076e5f7a54a9b8b4426e93394a6e5eb37cd).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803817012


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40913/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805868186


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41035/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liancheng commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
liancheng commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598464064



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -130,13 +130,11 @@ class ParquetToSparkSchemaConverter(
       case INT32 =>
         originalType match {
           case INT_8 => ByteType
-          case INT_16 => ShortType
-          case INT_32 | null => IntegerType
+          case INT_16 | UINT_8 => ShortType
+          case INT_32 | UINT_16 | null => IntegerType
           case DATE => DateType
           case DECIMAL => makeDecimalType(Decimal.MAX_INT_DIGITS)
-          case UINT_8 => typeNotSupported()
-          case UINT_16 => typeNotSupported()
-          case UINT_32 => typeNotSupported()
+          case UINT_32 => LongType

Review comment:
       My hunch is that Spark SQL didn't support unsigned integral types at all back then. As long as we support that now, it's OK to have.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805861909


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41039/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805822593


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41036/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805746736


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136452/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806393928


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41076/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806220276


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41056/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806529209


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136497/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806396161


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41081/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598463339



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -130,13 +130,11 @@ class ParquetToSparkSchemaConverter(
       case INT32 =>
         originalType match {
           case INT_8 => ByteType
-          case INT_16 => ShortType
-          case INT_32 | null => IntegerType
+          case INT_16 | UINT_8 => ShortType
+          case INT_32 | UINT_16 | null => IntegerType
           case DATE => DateType
           case DECIMAL => makeDecimalType(Decimal.MAX_INT_DIGITS)
-          case UINT_8 => typeNotSupported()
-          case UINT_16 => typeNotSupported()
-          case UINT_32 => typeNotSupported()
+          case UINT_32 => LongType

Review comment:
       Thanks, @HyukjinKwon,
   Yea, I have checked that PR too. There's also a suggestion that we support them.
   Lately, Wenchen created https://issues.apache.org/jira/browse/SPARK-34786 for reading uint64. As other unsigned types are not supported too,  a bit more clear than uint64 which needs a decimal.
   
   IMO, for Spark, it is worthwhile to be able to support more storage layer features without breaking our own rules. So I raised this PR to collect more opinions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598455484



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -130,13 +130,11 @@ class ParquetToSparkSchemaConverter(
       case INT32 =>
         originalType match {
           case INT_8 => ByteType
-          case INT_16 => ShortType
-          case INT_32 | null => IntegerType
+          case INT_16 | UINT_8 => ShortType
+          case INT_32 | UINT_16 | null => IntegerType
           case DATE => DateType
           case DECIMAL => makeDecimalType(Decimal.MAX_INT_DIGITS)
-          case UINT_8 => typeNotSupported()
-          case UINT_16 => typeNotSupported()
-          case UINT_32 => typeNotSupported()
+          case UINT_32 => LongType

Review comment:
       These were explicitly unsupported at https://github.com/apache/spark/pull/9646 .. per @liancheng's advice (who's also Parquet committer). So I'm less sure if this is something we should support.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806411811


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803894371


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136330/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598477793



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -565,6 +565,10 @@ private void readIntBatch(int rowId, int num, WritableColumnVector column) throw
         canReadAsIntDecimal(column.dataType())) {
       defColumn.readIntegers(
           num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
+    } else if (column.dataType() == DataTypes.LongType) {

Review comment:
       These values fix signed int32 well. Parquet will not convert them forth and back, so it;s fine for us to read the raw data




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803762580


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136327/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803801061


   cc @HyukjinKwon @cloud-fan @dongjoon-hyun thanks for reviewing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806203995


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41056/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803780201


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40909/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803891595


   **[Test build #136330 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136330/testReport)** for PR 31921 at commit [`8ff3267`](https://github.com/apache/spark/commit/8ff3267b82dc20961a31b6b6a8d565a221497132).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805739301


   **[Test build #136452 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136452/testReport)** for PR 31921 at commit [`efe9c4a`](https://github.com/apache/spark/commit/efe9c4af56fa76aac00ca7f819cce50cb03eaf43).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806421935


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41081/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598473997



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -565,6 +565,10 @@ private void readIntBatch(int rowId, int num, WritableColumnVector column) throw
         canReadAsIntDecimal(column.dataType())) {
       defColumn.readIntegers(
           num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
+    } else if (column.dataType() == DataTypes.LongType) {
+      //  We use LongType to handle UINT32
+      defColumn.readIntegersAsUnsigned(

Review comment:
       can we follow https://github.com/apache/spark/commit/38fbe560fd08168e90c575f7707368ddf758c3a9 and check if dictionary encoding also needs update?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805685174


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136451/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r600718602



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##########
@@ -203,6 +203,41 @@ public void readIntegers(
     }
   }
 
+  // A fork of `readIntegers`, reading the signed integers as unsigned in long type
+  public void readUnsignedIntegers(
+    int total,

Review comment:
       nit: let's follow other methods in this file and use 4 spaces for parameter indentation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r600620188



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
##########
@@ -83,6 +83,15 @@ public final void readIntegers(int total, WritableColumnVector c, int rowId) {
     }
   }
 
+  @Override
+  public final void readUnsignedIntegers(int total, WritableColumnVector c, int rowId) {
+    int requiredBytes = total * 4;
+    ByteBuffer buffer = getBuffer(requiredBytes);
+    for (int i = 0; i < total; i += 1) {
+      c.putLong(rowId + i, Integer.toUnsignedLong(buffer.getInt()));

Review comment:
       maybe we can improve here by coverting the `buffer.array()` to unsigned staffs, but I am not sure it's faster and how to do that right now.

##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
##########
@@ -83,6 +83,15 @@ public final void readIntegers(int total, WritableColumnVector c, int rowId) {
     }
   }
 
+  @Override
+  public final void readUnsignedIntegers(int total, WritableColumnVector c, int rowId) {
+    int requiredBytes = total * 4;
+    ByteBuffer buffer = getBuffer(requiredBytes);
+    for (int i = 0; i < total; i += 1) {
+      c.putLong(rowId + i, Integer.toUnsignedLong(buffer.getInt()));

Review comment:
       maybe we can improve here by coverting the `buffer.array()` to unsigned stuffs, but I am not sure it's faster and how to do that right now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806386585


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41076/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806316357


   **[Test build #136490 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136490/testReport)** for PR 31921 at commit [`02cee4f`](https://github.com/apache/spark/commit/02cee4f7751f63528bb280d67df7492a33488463).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806397705


   **[Test build #136490 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136490/testReport)** for PR 31921 at commit [`02cee4f`](https://github.com/apache/spark/commit/02cee4f7751f63528bb280d67df7492a33488463).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806352122


   **[Test build #136497 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136497/testReport)** for PR 31921 at commit [`71496bd`](https://github.com/apache/spark/commit/71496bdc4d5c8081139e5a26fa9bddfa1ddc38ed).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806398414


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136490/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805809998


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136455/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803745469


   **[Test build #136327 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136327/testReport)** for PR 31921 at commit [`0c8b6d4`](https://github.com/apache/spark/commit/0c8b6d45455744745d2df87793f4acd94d58656e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598471813



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala
##########
@@ -130,13 +130,11 @@ class ParquetToSparkSchemaConverter(
       case INT32 =>
         originalType match {
           case INT_8 => ByteType
-          case INT_16 => ShortType
-          case INT_32 | null => IntegerType
+          case INT_16 | UINT_8 => ShortType
+          case INT_32 | UINT_16 | null => IntegerType
           case DATE => DateType
           case DECIMAL => makeDecimalType(Decimal.MAX_INT_DIGITS)
-          case UINT_8 => typeNotSupported()
-          case UINT_16 => typeNotSupported()
-          case UINT_32 => typeNotSupported()
+          case UINT_32 => LongType

Review comment:
       It's more about compatibility. Spark won't have unsigned types, but spark should be able to read existing parquet files written by other systems that support unsigned types.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803894371


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136330/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805963145


   **[Test build #136472 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136472/testReport)** for PR 31921 at commit [`d9afc79`](https://github.com/apache/spark/commit/d9afc7916fa08f3bea9e89ab7a48cfd38c76c190).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806316357


   **[Test build #136490 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136490/testReport)** for PR 31921 at commit [`02cee4f`](https://github.com/apache/spark/commit/02cee4f7751f63528bb280d67df7492a33488463).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806529209


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136497/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805746679


   **[Test build #136452 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136452/testReport)** for PR 31921 at commit [`efe9c4a`](https://github.com/apache/spark/commit/efe9c4af56fa76aac00ca7f819cce50cb03eaf43).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803762580


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136327/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805961385


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41039/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805746736


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136452/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803787635


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40909/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806220276


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41056/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805680315


   **[Test build #136451 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136451/testReport)** for PR 31921 at commit [`0da5d07`](https://github.com/apache/spark/commit/0da5d076e5f7a54a9b8b4426e93394a6e5eb37cd).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r600614841



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -370,6 +373,13 @@ private void decodeDictionaryIds(
               column.putInt(i, dictionary.decodeToInt(dictionaryIds.getDictId(i)));
             }
           }
+        } else if (column.dataType() == DataTypes.LongType) {

Review comment:
       On Parquet side, for signed and unsigned int (<=32)  types they share the same `PrimitiveType` - `INT32`. The Unsigned ones are just logical types.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803786190


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40909/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-805909628


   @cloud-fan @liancheng @HyukjinKwon @maropu please take another look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-806087005


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136472/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31921:
URL: https://github.com/apache/spark/pull/31921#issuecomment-803752499


   **[Test build #136327 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136327/testReport)** for PR 31921 at commit [`0c8b6d4`](https://github.com/apache/spark/commit/0c8b6d45455744745d2df87793f4acd94d58656e).
    * This patch **fails Java style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yaooqinn commented on a change in pull request #31921: [SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet

Posted by GitBox <gi...@apache.org>.
yaooqinn commented on a change in pull request #31921:
URL: https://github.com/apache/spark/pull/31921#discussion_r598477885



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
##########
@@ -565,6 +565,10 @@ private void readIntBatch(int rowId, int num, WritableColumnVector column) throw
         canReadAsIntDecimal(column.dataType())) {
       defColumn.readIntegers(
           num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn);
+    } else if (column.dataType() == DataTypes.LongType) {
+      //  We use LongType to handle UINT32
+      defColumn.readIntegersAsUnsigned(

Review comment:
       OK, checking~




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org