You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/11 18:49:08 UTC

[GitHub] [spark] kazuyukitanimura opened a new pull request, #38628: [SPARK-41096][SQL] Support reading parquet FIXED_LEN_BYTE_ARRAY type

kazuyukitanimura opened a new pull request, #38628:
URL: https://github.com/apache/spark/pull/38628

   ### What changes were proposed in this pull request?
   Parquet supports FIXED_LEN_BYTE_ARRAY (FLBA) data type. However, Spark Parquet reader currently cannot handle FLBA.
   This PR proposes to read FLBA column as BinaryType data in Spark.
   
   ### Why are the changes needed?
   Iceberg Parquet reader, for example, can handle FLBA. This PR reduces the implementation gap between Spark and Iceberg Parquet reader.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Unit test added


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kazuyukitanimura commented on pull request #38628: [SPARK-41096][SQL] Support reading parquet FIXED_LEN_BYTE_ARRAY type

Posted by GitBox <gi...@apache.org>.
kazuyukitanimura commented on PR #38628:
URL: https://github.com/apache/spark/pull/38628#issuecomment-1314250536

   Thank you @huaxingao @sunchao @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kazuyukitanimura commented on pull request #38628: [SPARK-41096][SQL] Support reading parquet FIXED_LEN_BYTE_ARRAY type

Posted by GitBox <gi...@apache.org>.
kazuyukitanimura commented on PR #38628:
URL: https://github.com/apache/spark/pull/38628#issuecomment-1314336519

   Thanks @viirya I also realized PR https://github.com/apache/spark/pull/35902 along with https://github.com/apache/spark/pull/20826 and https://github.com/apache/spark/pull/1737 after I submit this PR.
   I believe all attempts were not merged due to missing Parquet test as pointed here https://github.com/apache/spark/pull/20826#issuecomment-521070164
   This PR has a dedicated Parquet test and merged.
   
   The avro compatibility tests are nice to have. Wondering if the previous authors are still interested to work on. @ghost @aws-awinstan @nicolaslrveiga


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #38628: [SPARK-41096][SQL] Support reading parquet FIXED_LEN_BYTE_ARRAY type

Posted by GitBox <gi...@apache.org>.
huaxingao commented on PR #38628:
URL: https://github.com/apache/spark/pull/38628#issuecomment-1313171254

   @kazuyukitanimura Thanks for working on this! 
   
   I took a look at how Iceberg handles FLBA. For iceberg type `Types.FixedType`, the underneath Parquet type is `fixed_len_byte_array`. In `Scan.readSchema`, Iceberg map `Types.FixedType` to Spark `BinaryType`. Since FLBA is a valid parquet data type, it makes sense to me for Spark support this type and map it to `BinaryType`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #38628: [SPARK-41096][SQL] Support reading parquet FIXED_LEN_BYTE_ARRAY type

Posted by GitBox <gi...@apache.org>.
viirya commented on PR #38628:
URL: https://github.com/apache/spark/pull/38628#issuecomment-1314270163

   Just found a previous PR #35902. The change is the same, but there are some avro test stuff that we can consider to add as a followup too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao closed pull request #38628: [SPARK-41096][SQL] Support reading parquet FIXED_LEN_BYTE_ARRAY type

Posted by GitBox <gi...@apache.org>.
sunchao closed pull request #38628: [SPARK-41096][SQL] Support reading parquet FIXED_LEN_BYTE_ARRAY type
URL: https://github.com/apache/spark/pull/38628


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #38628: [SPARK-41096][SQL] Support reading parquet FIXED_LEN_BYTE_ARRAY type

Posted by GitBox <gi...@apache.org>.
sunchao commented on PR #38628:
URL: https://github.com/apache/spark/pull/38628#issuecomment-1314225366

   Committed to master, thanks @kazuyukitanimura !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] kazuyukitanimura commented on pull request #38628: [SPARK-41096][SQL] Support reading parquet FIXED_LEN_BYTE_ARRAY type

Posted by GitBox <gi...@apache.org>.
kazuyukitanimura commented on PR #38628:
URL: https://github.com/apache/spark/pull/38628#issuecomment-1312069028

   cc @huaxingao @sunchao @viirya


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org