You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2018/12/11 19:45:00 UTC
[jira] [Closed] (ARROW-3999) [Python] Can't read large file that
pyarrow wrote
[ https://issues.apache.org/jira/browse/ARROW-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney closed ARROW-3999.
-------------------------------
Resolution: Duplicate
Duplicate of ARROW-3762
> [Python] Can't read large file that pyarrow wrote
> -------------------------------------------------
>
> Key: ARROW-3999
> URL: https://issues.apache.org/jira/browse/ARROW-3999
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.11.1
> Environment: OS: OSX High Sierra 10.13.6
> Python: 3.7.0
> PyArrow: 0.11.1
> Pandas: 0.23.4
> Reporter: Diego Argueta
> Priority: Major
>
> I loaded a large Pandas DataFrame from a CSV and successfully wrote it to a Parquet file using the DataFrame's {{to_parquet}} method. However, reading that same file back results in an exception. The DataFrame consists of about 32 million rows with seven columns; four are ASCII text and three are booleans.
>
> {code:java}
> >>> source_df.shape
> (32070402, 7)
> >>> source_df.dtypes
> Url Source object
> Url Destination object
> Anchor text object
> Follow / No-Follow object
> Link No-Follow bool
> Meta No-Follow bool
> Robot No-Follow bool
> dtype: object
> >>> source_df.to_parquet('export.parq', compression='gzip',
> use_deprecated_int96_timestamps=True)
> >>> loaded_df = pd.read_parquet('export.parq')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 288, in read_parquet
> return impl.read(path, columns=columns, **kwargs)
> File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 131, in read
> **kwargs).to_pandas()
> File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 1074, in read_table
> use_pandas_metadata=use_pandas_metadata)
> File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/filesystem.py", line 184, in read_parquet
> use_pandas_metadata=use_pandas_metadata)
> File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 943, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 500, in read
> table = reader.read(**options)
> File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 187, in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 721, in pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483685
> Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483685
> {code}
>
> One would expect that if PyArrow can write a file successfully, it can read it back as well. Fortunately the {{fastparquet}} library has no problem reading this file, so we didn't lose any data, but the roundtripping problem was a bit of a surprise.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)