You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/04/09 15:03:00 UTC

[jira] [Updated] (ARROW-2369) Large (>~20 GB) files written to Parquet via PyArrow are corrupted

     [ https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated ARROW-2369:
----------------------------------
    Labels: Parquet bug pandas parquetWriter pull-request-available pyarrow  (was: Parquet bug pandas parquetWriter pyarrow)

> Large (>~20 GB) files written to Parquet via PyArrow are corrupted
> ------------------------------------------------------------------
>
>                 Key: ARROW-2369
>                 URL: https://issues.apache.org/jira/browse/ARROW-2369
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>         Environment: Reproduced on Ubuntu + Mac OSX
>            Reporter: Justin Tan
>            Assignee: Antoine Pitrou
>            Priority: Major
>              Labels: Parquet, bug, pandas, parquetWriter, pull-request-available, pyarrow
>             Fix For: 0.10.0
>
>         Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png
>
>
> When writing large Parquet files (above 10 GB or so) from Pandas to Parquet via the command
> {{pq.write_table(my_df, 'table.parquet')}}
> The write succeeds, but when the parquet file is loaded, the error message
> {{ArrowIOError: Invalid parquet file. Corrupt footer.}}
> appears. This same error occurs when the parquet file is written chunkwise as well. When the parquet files are small, say < 5 GB or so (drawn randomly from the same dataset), everything proceeds as normal. I've also tried this with Pandas df.to_parquet(), with the same results.
> Update: Looks like any DataFrame with size above ~5GB (on disk) returns the same error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)