You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2018/12/11 19:45:00 UTC
[jira] [Closed] (ARROW-3999) [Python] Can't read large file that pyarrow wrote

     [ https://issues.apache.org/jira/browse/ARROW-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney closed ARROW-3999.
-------------------------------
    Resolution: Duplicate

Duplicate of ARROW-3762

> [Python] Can't read large file that pyarrow wrote
> -------------------------------------------------
>
>                 Key: ARROW-3999
>                 URL: https://issues.apache.org/jira/browse/ARROW-3999
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.11.1
>         Environment: OS: OSX High Sierra 10.13.6
> Python: 3.7.0
> PyArrow: 0.11.1
> Pandas: 0.23.4
>            Reporter: Diego Argueta
>            Priority: Major
>
> I loaded a large Pandas DataFrame from a CSV and successfully wrote it to a Parquet file using the DataFrame's {{to_parquet}} method. However, reading that same file back results in an exception. The DataFrame consists of about 32 million rows with seven columns; four are ASCII text and three are booleans.
>  
> {code:java}
> >>> source_df.shape
> (32070402, 7)
> >>> source_df.dtypes
> Url Source object
> Url Destination object
> Anchor text object
> Follow / No-Follow object
> Link No-Follow bool
> Meta No-Follow bool
> Robot No-Follow bool
> dtype: object
> >>> source_df.to_parquet('export.parq', compression='gzip',
>                          use_deprecated_int96_timestamps=True)
> >>> loaded_df = pd.read_parquet('export.parq')
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 288, in read_parquet
>    return impl.read(path, columns=columns, **kwargs)
>  File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 131, in read
>    **kwargs).to_pandas()
>  File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 1074, in read_table
>    use_pandas_metadata=use_pandas_metadata)
>  File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/filesystem.py", line 184, in read_parquet
>    use_pandas_metadata=use_pandas_metadata)
>  File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 943, in read
>    use_pandas_metadata=use_pandas_metadata)
>  File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 500, in read
>    table = reader.read(**options)
>  File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 187, in read
>    use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 721, in pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483685
> Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483685
>  {code}
>  
> One would expect that if PyArrow can write a file successfully, it can read it back as well. Fortunately the {{fastparquet}} library has no problem reading this file, so we didn't lose any data, but the roundtripping problem was a bit of a surprise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)