You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/18 13:26:08 UTC

[GitHub] [arrow] chairmank commented on issue #3491: parquet lz4 interop with spark appears broken

chairmank commented on issue #3491:
URL: https://github.com/apache/arrow/issues/3491#issuecomment-646015745


   I believe that [PARQUET-1241](https://issues.apache.org/jira/browse/PARQUET-1241) ("[C++] Use LZ4 frame format") does not directly address the issue that was reported here, although there is relevant discussion in the comments (like [this](https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16574328#comment-16574328) and [this](https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16585288#comment-16585288)).
   
   The stack trace in the bug report shows an exception thrown by the [Spark](https://github.com/apache/spark) class `org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader`, which uses the [parquet-mr](https://github.com/apache/parquet-mr) class `org.apache.parquet.hadoop.ParquetFileReader`, which uses the [Hadoop](https://github.com/apache/hadoop) `org.apache.hadoop.io.compress.Lz4Codec` class.
   
   As discussed in [HADOOP-12990](https://issues.apache.org/jira/browse/HADOOP-12990), the Hadoop `Lz4Codec` uses the lz4 block format, and it prepends 8 extra bytes before the compressed data. I believe that lz4 implementation used by `pyarrow.parquet` also uses the lz4 block format, but it does not prepend these 8 extra bytes. Reconciling this incompatibility does not require implementing the framed format.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org