You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "comicfans (via GitHub)" <gi...@apache.org> on 2023/04/13 09:39:27 UTC
[GitHub] [arrow] comicfans opened a new issue, #35105: R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
comicfans opened a new issue, #35105:
URL: https://github.com/apache/arrow/issues/35105
### Describe the bug, including details regarding any error messages, version, and platform.
R arrow_info
```
> arrow::arrow_info()
Arrow package version: 11.0.0.3
Capabilities:
dataset TRUE
substrait FALSE
parquet TRUE
json TRUE
s3 TRUE
gcs TRUE
utf8proc TRUE
re2 TRUE
snappy TRUE
gzip TRUE
brotli TRUE
zstd TRUE
lz4 TRUE
lz4_frame TRUE
lzo FALSE
bz2 TRUE
jemalloc TRUE
mimalloc TRUE
Memory:
Allocator jemalloc
Current 3.82 Gb
Max 8.63 Gb
Runtime:
SIMD Level avx2
Detected SIMD Level avx2
Build:
C++ Library Version 11.0.0
C++ Compiler GNU
C++ Compiler Version 7.5.0
```
pyarrow info:
```
pyarrow-11.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl from conda pip
```
when using pyarrow to write file with column encoding as BYTE_STREAM_SPLIT, R can't parse such file
```
Error: IOError: Data size too small for number of values (corrupted file?)
```
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] comicfans commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
Posted by "comicfans (via GitHub)" <gi...@apache.org>.
comicfans commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1534901323
> @comicfans I tried to fix "data size too large" in #35428 . Would you mind take a look or wait that patch merged?
@mapleFU Hi, I've tried your branch , building R package and use its read_parquet to load the written out parquet, confirm it worked (previously it report error) and result completely same as the normal one . but I didn't try the python API
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] comicfans closed issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
Posted by "comicfans (via GitHub)" <gi...@apache.org>.
comicfans closed issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
URL: https://github.com/apache/arrow/issues/35105
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] mapleFU commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1534672116
@comicfans I tried to fix "data size too large" in https://github.com/apache/arrow/pull/35428 . Would you mind take a look or wait that patch merged?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] mapleFU commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1534003231
Could you please change to 12.0 to see how it works? Seems in 11.0, it has the issue here: https://github.com/apache/arrow/issues/15173
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] comicfans commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
Posted by "comicfans (via GitHub)" <gi...@apache.org>.
comicfans commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1507910900
seems like this is a common problem, pyarrow also give same error for the generated file.
I've attached the good input file for testing
[sample.zip](https://github.com/apache/arrow/files/11228777/sample.zip)
unzip this file , running following python code:
```python
from pyarrow import parquet
a = parquet.read_table('sample.parquet')
parquet.write_table(a,"bug.parquet", use_dictionary=["contract_name"],use_byte_stream_split=["last_price",'bid_price1'])
parquet.read_table('bug.parquet')
```
got
```
ore.py", line 2601, in read
table = self._dataset.to_table(
^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 369, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 2818, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Data size too small for number of values (corrupted file?)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] comicfans commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
Posted by "comicfans (via GitHub)" <gi...@apache.org>.
comicfans commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1554069524
>
That's ok, but I think this merge is still very new and not included in the released pyarrow/R arrow package, I hope these two packages release new version earlier.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] mapleFU commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1554072305
Yes, maybe it would be contained in 12.0.1 release
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] mapleFU commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1553960776
@comicfans My fixing is merged into master. Seems we can close this issue?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] adamreeve commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
Posted by "adamreeve (via GitHub)" <gi...@apache.org>.
adamreeve commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1533985903
Sorry I just realised the above issue was reported with pyarrow 11 and the error message is slightly different, I've opened #35423 for the "Data size too large" issue above
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] adamreeve commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
Posted by "adamreeve (via GitHub)" <gi...@apache.org>.
adamreeve commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1533916250
This is a regression since Arrow 12.0.0, and I can reproduce the error by writing and then reading data with pyarrow 12.0.0:
```python
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
x = pa.array(np.linspace(0.0, 1.0, 1_000_000), type=pa.float32())
table = pa.Table.from_arrays([x], names=['x'])
pq.write_table(table, 'data.parquet', use_dictionary=False, use_byte_stream_split=True)
table = pq.read_table('data.parquet')
print(table)
```
This crashes with:
```
Traceback (most recent call last):
File "/home/.../write_read_data.py", line 9, in <module>
table = pq.read_table('data.parquet')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py", line 2986, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py", line 2614, in read
table = self._dataset.to_table(
^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3449, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Data size too large for number of values (padding in byte stream split data page?)
```
But the above code works fine with pyarrow 11.0.0 and and 10.0.1.
It appears that #34140 caused this regression. I tested building pyarrow on the current main branch (commit 42d42b1194d8a672e13dac10a8102573f787f70d) and could reproduce the error, but it was fixed after I reverted the merge of that PR (commit c31fb46544b9c8372e799138bad9223162169473).
@mapleFU could you please take a look into this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] comicfans commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
Posted by "comicfans (via GitHub)" <gi...@apache.org>.
comicfans commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1534640636
I've tried pyarrow 12.0, seems another problem
```
>>> import pyarrow
>>> pyarrow.__version__
'12.0.0'
>>> from pyarrow import parquet
>>> a = parquet.read_table('sample.parquet')
>>> parquet.write_table(a,"bug.parquet", use_dictionary=["contract_name"],use_byte_stream_split=["last_price",'bid_price1'])
>>> parquet.read_table('bug.parquet')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/wangxinyu/miniconda3/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 2986, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
File "/home/wangxinyu/miniconda3/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 2614, in read
table = self._dataset.to_table(
File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3449, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Data size too large for number of values (padding in byte stream split data page?)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] thisisnic commented on issue #35105: R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1506680086
Thanks for reporting this @comicfans; are you able to provide a reproducible example for this, so we can take a closer look at what's going on?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org