You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "comicfans (via GitHub)" <gi...@apache.org> on 2023/04/13 09:39:27 UTC

[GitHub] [arrow] comicfans opened a new issue, #35105: R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

comicfans opened a new issue, #35105:
URL: https://github.com/apache/arrow/issues/35105

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   R arrow_info
   ```
   > arrow::arrow_info()
   Arrow package version: 11.0.0.3
   
   Capabilities:
                  
   dataset    TRUE
   substrait FALSE
   parquet    TRUE
   json       TRUE
   s3         TRUE
   gcs        TRUE
   utf8proc   TRUE
   re2        TRUE
   snappy     TRUE
   gzip       TRUE
   brotli     TRUE
   zstd       TRUE
   lz4        TRUE
   lz4_frame  TRUE
   lzo       FALSE
   bz2        TRUE
   jemalloc   TRUE
   mimalloc   TRUE
   
   Memory:
                     
   Allocator jemalloc
   Current    3.82 Gb
   Max        8.63 Gb
   
   Runtime:
                           
   SIMD Level          avx2
   Detected SIMD Level avx2
   
   Build:
                              
   C++ Library Version  11.0.0
   C++ Compiler            GNU
   C++ Compiler Version  7.5.0
   ```
   
   pyarrow info:
   ```
   pyarrow-11.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl from conda pip
   ```
   
   when using pyarrow to write file with column encoding as BYTE_STREAM_SPLIT, R can't parse such file
   ```
   Error: IOError: Data size too small for number of values (corrupted file?)
   ```
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] comicfans commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

Posted by "comicfans (via GitHub)" <gi...@apache.org>.

comicfans commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1534901323

   > @comicfans I tried to fix "data size too large" in #35428 . Would you mind take a look or wait that patch merged?
   
   @mapleFU Hi, I've tried your branch , building R package and use its read_parquet to load the written out parquet, confirm it worked (previously it report error) and result completely same as the normal one . but I didn't try the python API


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] comicfans closed issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

Posted by "comicfans (via GitHub)" <gi...@apache.org>.

comicfans closed issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding
URL: https://github.com/apache/arrow/issues/35105


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1534672116

   @comicfans I tried to fix "data size too large" in https://github.com/apache/arrow/pull/35428 . Would you mind take a look or wait that patch merged?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1534003231

   Could you please change to 12.0 to see how it works? Seems in 11.0, it has the issue here: https://github.com/apache/arrow/issues/15173


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] comicfans commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

Posted by "comicfans (via GitHub)" <gi...@apache.org>.

comicfans commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1507910900

   seems like this is a common problem, pyarrow also give same error for the generated file.
   I've attached the good input file for testing
   [sample.zip](https://github.com/apache/arrow/files/11228777/sample.zip)
   unzip this file , running following python code:
   
   ```python
    from pyarrow import parquet
   a = parquet.read_table('sample.parquet')
   parquet.write_table(a,"bug.parquet", use_dictionary=["contract_name"],use_byte_stream_split=["last_price",'bid_price1'])
   parquet.read_table('bug.parquet')
   ```
   
   got 
   ```
   ore.py", line 2601, in read
       table = self._dataset.to_table(
               ^^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 369, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 2818, in pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
   OSError: Data size too small for number of values (corrupted file?)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] comicfans commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

Posted by "comicfans (via GitHub)" <gi...@apache.org>.

comicfans commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1554069524

   > 
   
   That's ok, but I think this merge is still very new and not included in the released pyarrow/R arrow package, I hope these two packages release new version earlier.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1554072305

   Yes, maybe it would be contained in 12.0.1 release


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1553960776

   @comicfans My fixing is merged into master. Seems we can close this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] adamreeve commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

Posted by "adamreeve (via GitHub)" <gi...@apache.org>.

adamreeve commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1533985903

   Sorry I just realised the above issue was reported with pyarrow 11 and the error message is slightly different, I've opened #35423 for the "Data size too large" issue above


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] adamreeve commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

Posted by "adamreeve (via GitHub)" <gi...@apache.org>.

adamreeve commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1533916250

   This is a regression since Arrow 12.0.0, and I can reproduce the error by writing and then reading data with pyarrow 12.0.0:
   
   ```python
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   x = pa.array(np.linspace(0.0, 1.0, 1_000_000), type=pa.float32())
   table = pa.Table.from_arrays([x], names=['x'])
   pq.write_table(table, 'data.parquet', use_dictionary=False, use_byte_stream_split=True)
   
   table = pq.read_table('data.parquet')
   print(table)
   ```
   This crashes with:
   ```
   Traceback (most recent call last):
     File "/home/.../write_read_data.py", line 9, in <module>
   	table = pq.read_table('data.parquet')
   			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py", line 2986, in read_table
   	return dataset.read(columns=columns, use_threads=use_threads,
   		   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py", line 2614, in read
   	table = self._dataset.to_table(
   			^^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 3449, in pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
   OSError: Data size too large for number of values (padding in byte stream split data page?)
   ```
   But the above code works fine with pyarrow 11.0.0 and and 10.0.1.
   
   It appears that #34140 caused this regression. I tested building pyarrow on the current main branch (commit 42d42b1194d8a672e13dac10a8102573f787f70d) and could reproduce the error, but it was fixed after I reverted the merge of that PR (commit c31fb46544b9c8372e799138bad9223162169473).
   
   @mapleFU could you please take a look into this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] comicfans commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

Posted by "comicfans (via GitHub)" <gi...@apache.org>.

comicfans commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1534640636

   I've tried pyarrow 12.0, seems another problem
   ```
   >>> import pyarrow
   >>> pyarrow.__version__
   '12.0.0'
   >>> from pyarrow import parquet
   >>> a = parquet.read_table('sample.parquet')
   >>> parquet.write_table(a,"bug.parquet", use_dictionary=["contract_name"],use_byte_stream_split=["last_price",'bid_price1'])
   >>> parquet.read_table('bug.parquet')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/home/wangxinyu/miniconda3/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 2986, in read_table
       return dataset.read(columns=columns, use_threads=use_threads,
     File "/home/wangxinyu/miniconda3/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 2614, in read
       table = self._dataset.to_table(
     File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 3449, in pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
   OSError: Data size too large for number of values (padding in byte stream split data page?)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on issue #35105: R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1506680086

   Thanks for reporting this @comicfans; are you able to provide a reproducible example for this, so we can take a closer look at what's going on?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org