You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "adamreeve (via GitHub)" <gi...@apache.org> on 2023/05/04 00:37:08 UTC

[GitHub] [arrow] adamreeve commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

adamreeve commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1533916250

   This is a regression since Arrow 12.0.0, and I can reproduce the error by writing and then reading data with pyarrow 12.0.0:
   
   ```python
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   x = pa.array(np.linspace(0.0, 1.0, 1_000_000), type=pa.float32())
   table = pa.Table.from_arrays([x], names=['x'])
   pq.write_table(table, 'data.parquet', use_dictionary=False, use_byte_stream_split=True)
   
   table = pq.read_table('data.parquet')
   print(table)
   ```
   This crashes with:
   ```
   Traceback (most recent call last):
     File "/home/.../write_read_data.py", line 9, in <module>
   	table = pq.read_table('data.parquet')
   			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py", line 2986, in read_table
   	return dataset.read(columns=columns, use_threads=use_threads,
   		   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py", line 2614, in read
   	table = self._dataset.to_table(
   			^^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 3449, in pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
   OSError: Data size too large for number of values (padding in byte stream split data page?)
   ```
   But the above code works fine with pyarrow 11.0.0 and and 10.0.1.
   
   It appears that #34140 caused this regression. I tested building pyarrow on the current main branch (commit 42d42b1194d8a672e13dac10a8102573f787f70d) and could reproduce the error, but it was fixed after I reverted the merge of that PR (commit c31fb46544b9c8372e799138bad9223162169473).
   
   @mapleFU could you please take a look into this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org