You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2019/08/19 23:55:00 UTC

[jira] [Commented] (ARROW-6057) [Python] Parquet files v2.0 created by spark can't be read by pyarrow

    [ https://issues.apache.org/jira/browse/ARROW-6057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16910852#comment-16910852 ] 

Wes McKinney commented on ARROW-6057:
-------------------------------------

According to discussions on the Parquet mailing list, Spark should not be creating V2 files as the V2 format is not considered production

> [Python] Parquet files v2.0 created by spark can't be read by pyarrow
> ---------------------------------------------------------------------
>
>                 Key: ARROW-6057
>                 URL: https://issues.apache.org/jira/browse/ARROW-6057
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 0.14.1
>            Reporter: Vladyslav Shamaida
>            Priority: Major
>
> PyArrow uses footer metadata to determine the format version of parquet file, while parquet-mr lib (which is used by spark) determines version on the page level by page header type. Moreover in ParquetFileWriter parquet-mr hardcodes version in footer to '1'. See: [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L913]
> Thus, spark can write and read its own written files, pyarrow can write and read its own written files, but when pyarrow tries to read file of version 2.0, which was written by spark it throws an error about malformed file (because it thinks that format version is 1.0).
> Depending on the compression method an error is:
> - _Corrupt snappy compressed data_
> - _GZipCodec failed: incorrect header check_
> - _ArrowIOError: Unknown encoding type_



--
This message was sent by Atlassian Jira
(v8.3.2#803003)