You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "adamreeve (via GitHub)" <gi...@apache.org> on 2023/05/04 00:37:08 UTC
[GitHub] [arrow] adamreeve commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
adamreeve commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1533916250
This is a regression since Arrow 12.0.0, and I can reproduce the error by writing and then reading data with pyarrow 12.0.0:
```python
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
x = pa.array(np.linspace(0.0, 1.0, 1_000_000), type=pa.float32())
table = pa.Table.from_arrays([x], names=['x'])
pq.write_table(table, 'data.parquet', use_dictionary=False, use_byte_stream_split=True)
table = pq.read_table('data.parquet')
print(table)
```
This crashes with:
```
Traceback (most recent call last):
File "/home/.../write_read_data.py", line 9, in <module>
table = pq.read_table('data.parquet')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py", line 2986, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py", line 2614, in read
table = self._dataset.to_table(
^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3449, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Data size too large for number of values (padding in byte stream split data page?)
```
But the above code works fine with pyarrow 11.0.0 and and 10.0.1.
It appears that #34140 caused this regression. I tested building pyarrow on the current main branch (commit 42d42b1194d8a672e13dac10a8102573f787f70d) and could reproduce the error, but it was fixed after I reverted the merge of that PR (commit c31fb46544b9c8372e799138bad9223162169473).
@mapleFU could you please take a look into this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org