You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Daniel Evans (Jira)" <ji...@apache.org> on 2021/02/26 10:26:00 UTC
[jira] [Created] (ARROW-11792) PyArrow unable to read file with large string values

Daniel Evans created ARROW-11792:
------------------------------------

             Summary: PyArrow unable to read file with large string values
                 Key: ARROW-11792
                 URL: https://issues.apache.org/jira/browse/ARROW-11792
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 3.0.0
         Environment: Scientific Linux 7.9; PyArrow 3.0.0, Pandas 1.0.5
            Reporter: Daniel Evans


I am having difficulty re-reading a Parquet file written out using Pandas. The error message hints that either the file was malformed on write, or possibly that it is corrupt on disk (hard for me to confirm or deny that option - if there's an easy way for me to check, let me know).

The original Pandas dataframe consisted of around 20 million rows with four columns. Three columns are simple `float` data, while the fourth is a string-typed column containing long strings, averaging 200 characters. Each string value is present in 10 or so rows, giving around 2 million unique strings. This is currently where my suspicion lies if it is an issue with pyarrow.

The file was written out with {{df.to_parquet(compression="brotli")}}.

As well as pyarrow 3.0.0, I have quickly tried 2.0.0 and 1.0.1, both of which fail to read. Re-generating the data and writing takes several hours, annoyingly - a test on a smaller dataset produces a readable file.

I can provide the problematic file privately - it's around 250MB.

{{ [...snip...]
    df = pd.read_parquet(data_source, columns=columns)
  File "/home/farm/farmcatenv/lib64/python3.6/site-packages/pandas/io/parquet.py", line 312, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "/home/farm/farmcatenv/lib64/python3.6/site-packages/pandas/io/parquet.py", line 127, in read
    path, columns=columns, **kwargs
  File "/home/farm/farmcatenv/lib64/python3.6/site-packages/pyarrow/parquet.py", line 1704, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File "/home/farm/farmcatenv/lib64/python3.6/site-packages/pyarrow/parquet.py", line 1582, in read
    use_threads=use_threads
  File "pyarrow/_dataset.pyx", line 372, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2266, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.
}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)