You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Vladyslav Shamaida (JIRA)" <ji...@apache.org> on 2019/07/29 08:53:00 UTC

[jira] [Created] (ARROW-6057) Parquet files v2.0 created by spark can't be read by pyarrow

Vladyslav Shamaida created ARROW-6057:
-----------------------------------------

             Summary: Parquet files v2.0 created by spark can't be read by pyarrow
                 Key: ARROW-6057
                 URL: https://issues.apache.org/jira/browse/ARROW-6057
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Vladyslav Shamaida


PyArrow uses footer metadata to determine the format version of parquet file, while parquet-mr lib (which is used by spark) determines version on the page level by page header type. Moreover in ParquetFileWriter parquet-mr hardcodes version in footer to '1'. See: [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L913]

Thus, spark can write and read its own written files, pyarrow can write and read its own written files, but when pyarrow tries to read file of version 2.0, which was written by spark it throws an error about malformed file (because it thinks that format version is 1.0).

Depending on the compression method an error is:

- _Corrupt snappy compressed data_

- _GZipCodec failed: incorrect header check_

- _ArrowIOError: Unknown encoding type_



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)