You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Diego Argueta (JIRA)" <ji...@apache.org> on 2018/12/11 19:30:00 UTC
[jira] [Created] (ARROW-3999) [Python] Can't read large file that
pyarrow wrote
Diego Argueta created ARROW-3999:
------------------------------------
Summary: [Python] Can't read large file that pyarrow wrote
Key: ARROW-3999
URL: https://issues.apache.org/jira/browse/ARROW-3999
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.11.1
Environment: OS: OSX High Sierra 10.13.6
Python: 3.7.0
PyArrow: 0.11.1
Pandas: 0.23.4
Reporter: Diego Argueta
I loaded a large Pandas DataFrame from a CSV and successfully wrote it to a Parquet file using the DataFrame's {{to_parquet}} method. However, reading that same file back results in an exception:
{code:java}
>>> source_df.shape
(32070402, 7)
>>> source_df.dtypes
Url Source object
Url Destination object
Anchor text object
Follow / No-Follow object
Link No-Follow bool
Meta No-Follow bool
Robot No-Follow bool
dtype: object
>>> source_df.to_parquet('export.parq', compression='gzip',
use_deprecated_int96_timestamps=True)
>>> loaded_df = pd.read_parquet('export.parq')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 288, in read_parquet
return impl.read(path, columns=columns, **kwargs)
File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 131, in read
**kwargs).to_pandas()
File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 1074, in read_table
use_pandas_metadata=use_pandas_metadata)
File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/filesystem.py", line 184, in read_parquet
use_pandas_metadata=use_pandas_metadata)
File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 943, in read
use_pandas_metadata=use_pandas_metadata)
File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 500, in read
table = reader.read(**options)
File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 187, in read
use_threads=use_threads)
File "pyarrow/_parquet.pyx", line 721, in pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483685
Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483685
{code}
One would expect that if PyArrow can write a file successfully, it can read it back as well. Fortunately the {{fastparquet}} library has no problem reading this file, so we didn't lose any data, but the roundtripping problem was a bit of a surprise.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)