You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/23 09:38:23 UTC

[GitHub] [iceberg] lcspinter opened a new pull request, #4841: Parquet: Fix Iceberg's parquet reader returning binary data instead of string data

lcspinter opened a new pull request, #4841:
URL: https://github.com/apache/iceberg/pull/4841

   String columns from Parquet files written by writers that don't use logical type annotations are converted into binary data. This can be reproduced with older versions of Hive or Impala. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #4841: Parquet: Fix Iceberg's parquet reader returning binary data instead of string data

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on PR #4841:
URL: https://github.com/apache/iceberg/pull/4841#issuecomment-1134672985

   LGTM too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] lcspinter commented on pull request #4841: Parquet: Fix Iceberg's parquet reader returning binary data instead of string data

Posted by GitBox <gi...@apache.org>.
lcspinter commented on PR #4841:
URL: https://github.com/apache/iceberg/pull/4841#issuecomment-1138182806

   @kbendick I might be able to generate some binary data files with an older version of Impala and write some tests on reading those. Though, I'm not sure if it's a good practice to check in a data set generated by a 3rd party tool, rather than generating it in place. I would like to hear your opinion. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #4841: Parquet: Fix Iceberg's parquet reader returning binary data instead of string data

Posted by GitBox <gi...@apache.org>.
kbendick commented on PR #4841:
URL: https://github.com/apache/iceberg/pull/4841#issuecomment-1139926598

   All that said, I do wonder if we need to account for certain spark configs etc. Respecting the type as given in the schema is likely the best way to go though =)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #4841: Parquet: Fix Iceberg's parquet reader returning binary data instead of string data

Posted by GitBox <gi...@apache.org>.
kbendick commented on PR #4841:
URL: https://github.com/apache/iceberg/pull/4841#issuecomment-1139921740

   Hey @lcspinter. Sorry for the delayed response.
   
   I agree that checking in files is generally something to avoid, though if need absolutely be (especially for interfacing with systems that we don't support directly within this repository) it might be ok.
   
   If you can generate a file with Impala, we could potentially compare the output to generating code using Spark with the configuration values `spark.sql.parquet.binaryAsString` or something else. I can take a look this weekend.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4841: Parquet: Fix Iceberg's parquet reader returning binary data instead of string data

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #4841:
URL: https://github.com/apache/iceberg/pull/4841#issuecomment-1140525944

   Thanks, @lcspinter!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] lcspinter commented on pull request #4841: Parquet: Fix Iceberg's parquet reader returning binary data instead of string data

Posted by GitBox <gi...@apache.org>.
lcspinter commented on PR #4841:
URL: https://github.com/apache/iceberg/pull/4841#issuecomment-1134433310

   cc: @pvary @szlta 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #4841: Parquet: Fix Iceberg's parquet reader returning binary data instead of string data

Posted by GitBox <gi...@apache.org>.
kbendick commented on PR #4841:
URL: https://github.com/apache/iceberg/pull/4841#issuecomment-1135084091

   If you wanted to add a test for this, we might be able to generate data using Spark with some combination of `spark.sql.parquet.binaryAsString` and/or `spark.sql.parquet.writeLegacyFormat`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #4841: Parquet: Fix Iceberg's parquet reader returning binary data instead of string data

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #4841:
URL: https://github.com/apache/iceberg/pull/4841


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] lcspinter commented on pull request #4841: Parquet: Fix Iceberg's parquet reader returning binary data instead of string data

Posted by GitBox <gi...@apache.org>.
lcspinter commented on PR #4841:
URL: https://github.com/apache/iceberg/pull/4841#issuecomment-1141836024

   Thanks, @kbendick @pvary  @rdblue @szlta for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] lcspinter commented on pull request #4841: Parquet: Fix Iceberg's parquet reader returning binary data instead of string data

Posted by GitBox <gi...@apache.org>.
lcspinter commented on PR #4841:
URL: https://github.com/apache/iceberg/pull/4841#issuecomment-1136149541

   > If you wanted to add a test for this, we might be able to generate data using Spark with some combination of `spark.sql.parquet.binaryAsString` and/or `spark.sql.parquet.writeLegacyFormat`.
   
   @kbendick Yes, that would be great if we could add some tests as well. Could you please walk me through the data generation process? I'm not familiar with the the spark APIs. Shall we do it in a separate PR? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org