You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Justin Tan (JIRA)" <ji...@apache.org> on 2018/03/30 13:01:00 UTC
[jira] [Updated] (ARROW-2369) Large (>~20 GB) files written to
Parquet via PyArrow are corrupted
[ https://issues.apache.org/jira/browse/ARROW-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Justin Tan updated ARROW-2369:
------------------------------
Description:
When writing large Parquet files (above 20 GB or so) from Pandas to Parquet via the command
{{pq.write_table(my_df, 'table.parquet')}}
The write succeeds, but when the parquet file is loaded, the error message
{{ArrowIOError: Invalid parquet file. Corrupt footer.}}
appears. This same error occurs when the parquet file is written chunkwise as well. When the parquet files are small, say < 10 GB or so (drawn randomly from the same dataset), everything proceeds as normal. Could this be related to memory constraints
Details:
Arrow v0.9.0
Reproduced on Ubuntu, Mac osx
was:
When writing large Parquet files (above 20 GB or so) from Pandas to Parquet via the command
{{pq.write_table(my_df, 'table.parquet')}}
The write succeeds, but when the parquet file is loaded, the error message
{{ArrowIOError: Invalid parquet file. Corrupt footer.}}
appears. This same error occurs when the parquet file is written chunkwise as well. When the parquet files are small, say < 10 GB or so (drawn randomly from the same dataset), everything proceeds as normal.
Details:
Arrow v0.9.0
Reproduced on Ubuntu, Mac osx
> Large (>~20 GB) files written to Parquet via PyArrow are corrupted
> ------------------------------------------------------------------
>
> Key: ARROW-2369
> URL: https://issues.apache.org/jira/browse/ARROW-2369
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.9.0
> Environment: Reproduced on Ubuntu + Mac OSX
> Reporter: Justin Tan
> Priority: Major
> Labels: Parquet, bug, pandas, parquetWriter, pyarrow
> Fix For: 0.9.0
>
> Attachments: Screen Shot 2018-03-30 at 11.54.01 pm.png
>
>
> When writing large Parquet files (above 20 GB or so) from Pandas to Parquet via the command
> {{pq.write_table(my_df, 'table.parquet')}}
> The write succeeds, but when the parquet file is loaded, the error message
> {{ArrowIOError: Invalid parquet file. Corrupt footer.}}
> appears. This same error occurs when the parquet file is written chunkwise as well. When the parquet files are small, say < 10 GB or so (drawn randomly from the same dataset), everything proceeds as normal. Could this be related to memory constraints
> Details:
> Arrow v0.9.0
> Reproduced on Ubuntu, Mac osx
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)