You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Diego Argueta (JIRA)" <ji...@apache.org> on 2018/12/11 19:30:00 UTC
[jira] [Created] (ARROW-3999) [Python] Can't read large file that pyarrow wrote

Diego Argueta created ARROW-3999:
------------------------------------

             Summary: [Python] Can't read large file that pyarrow wrote
                 Key: ARROW-3999
                 URL: https://issues.apache.org/jira/browse/ARROW-3999
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.11.1
         Environment: OS: OSX High Sierra 10.13.6
Python: 3.7.0
PyArrow: 0.11.1
Pandas: 0.23.4
            Reporter: Diego Argueta


I loaded a large Pandas DataFrame from a CSV and successfully wrote it to a Parquet file using the DataFrame's {{to_parquet}} method. However, reading that same file back results in an exception:
{code:java}
>>> source_df.shape
(32070402, 7)

>>> source_df.dtypes
Url Source object
Url Destination object
Anchor text object
Follow / No-Follow object
Link No-Follow bool
Meta No-Follow bool
Robot No-Follow bool
dtype: object

>>> source_df.to_parquet('export.parq', compression='gzip',
 use_deprecated_int96_timestamps=True)

>>> loaded_df = pd.read_parquet('export.parq')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 288, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 131, in read
 **kwargs).to_pandas()
 File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 1074, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/filesystem.py", line 184, in read_parquet
 use_pandas_metadata=use_pandas_metadata)
 File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 943, in read
 use_pandas_metadata=use_pandas_metadata)
 File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 500, in read
 table = reader.read(**options)
 File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 187, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 721, in pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483685

Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483685
 {code}
 

One would expect that if PyArrow can write a file successfully, it can read it back as well. Fortunately the {{fastparquet}} library has no problem reading this file, so we didn't lose any data, but the roundtripping problem was a bit of a surprise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)