You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "comicfans (via GitHub)" <gi...@apache.org> on 2023/05/04 11:56:06 UTC

[GitHub] [arrow] comicfans commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

comicfans commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1534640636

   I've tried pyarrow 12.0, seems another problem
   ```
   >>> import pyarrow
   >>> pyarrow.__version__
   '12.0.0'
   >>> from pyarrow import parquet
   >>> a = parquet.read_table('sample.parquet')
   >>> parquet.write_table(a,"bug.parquet", use_dictionary=["contract_name"],use_byte_stream_split=["last_price",'bid_price1'])
   >>> parquet.read_table('bug.parquet')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/home/wangxinyu/miniconda3/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 2986, in read_table
       return dataset.read(columns=columns, use_threads=use_threads,
     File "/home/wangxinyu/miniconda3/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 2614, in read
       table = self._dataset.to_table(
     File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 3449, in pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
   OSError: Data size too large for number of values (padding in byte stream split data page?)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org