You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2021/02/20 03:56:00 UTC

[jira] [Closed] (ARROW-6057) [Python] Parquet files v2.0 created by spark can't be read by pyarrow

     [ https://issues.apache.org/jira/browse/ARROW-6057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney closed ARROW-6057.
-------------------------------
    Resolution: Cannot Reproduce

Can't reproduce. If you can provide instructions to reproduce someone can look

> [Python] Parquet files v2.0 created by spark can't be read by pyarrow
> ---------------------------------------------------------------------
>
>                 Key: ARROW-6057
>                 URL: https://issues.apache.org/jira/browse/ARROW-6057
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 0.14.1
>            Reporter: Vladyslav Shamaida
>            Priority: Major
>              Labels: parquet
>
> PyArrow uses footer metadata to determine the format version of parquet file, while parquet-mr lib (which is used by spark) determines version on the page level by page header type. Moreover in ParquetFileWriter parquet-mr hardcodes version in footer to '1'. See: [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L913]
> Thus, spark can write and read its own written files, pyarrow can write and read its own written files, but when pyarrow tries to read file of version 2.0, which was written by spark it throws an error about malformed file (because it thinks that format version is 1.0).
> Depending on the compression method an error is:
> - _Corrupt snappy compressed data_
> - _GZipCodec failed: incorrect header check_
> - _ArrowIOError: Unknown encoding type_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)