You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "adamreeve (via GitHub)" <gi...@apache.org> on 2023/05/04 02:11:04 UTC

[GitHub] [arrow] adamreeve opened a new issue, #35423: "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0

adamreeve opened a new issue, #35423:
URL: https://github.com/apache/arrow/issues/35423

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Arrow 12.0.0 has a regression where it can crash when reading byte-stream split encoded data written by itself or older versions of Arrow:
   
   ```python
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   x = pa.array(np.linspace(0.0, 1.0, 1_000_000), type=pa.float32())
   table = pa.Table.from_arrays([x], names=['x'])
   pq.write_table(table, 'data.parquet', use_dictionary=False, use_byte_stream_split=True)
   
   table = pq.read_table('data.parquet')
   print(table)
   ```
   This crashes with:
   ```
   Traceback (most recent call last):
     File "/home/.../write_read_data.py", line 9, in <module>
   	table = pq.read_table('data.parquet')
   			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py", line 2986, in read_table
   	return dataset.read(columns=columns, use_threads=use_threads,
   		   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/home/adam/.local/share/virtualenvs/arrow12-nieuEBn0/lib64/python3.11/site-packages/pyarrow/parquet/core.py", line 2614, in read
   	table = self._dataset.to_table(
   			^^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 546, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 3449, in pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
   OSError: Data size too large for number of values (padding in byte stream split data page?)
   ```
   But the above code works fine with pyarrow 11.0.0 and and 10.0.1.
   
   It appears that #34140 caused this regression. I tested building pyarrow on the current main branch (commit 42d42b1194d8a672e13dac10a8102573f787f70d) and could reproduce the error, but it was fixed after I reverted the merge of that PR (commit c31fb46544b9c8372e799138bad9223162169473).
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] adamreeve commented on issue #35423: "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0

Posted by "adamreeve (via GitHub)" <gi...@apache.org>.
adamreeve commented on issue #35423:
URL: https://github.com/apache/arrow/issues/35423#issuecomment-1533998334

   If I run this in a debugger, I can see `ByteStreamSplitDecoder<DType>::SetData` is called 4 times. For the first 3 times, `num_values` is 262,144 and `len` is 1,048,576, then on the 4th time `len` is again 1,048,576 but `num_values` is only 213,568 (for a total of 1,000,000) values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35423: "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35423:
URL: https://github.com/apache/arrow/issues/35423#issuecomment-1534181579

   I've found out the reason, I think it's not related to ByteStreamSplit, it's caused by `page.size()`, page reuse one buffer when ReadNextPage, and it `size` might be greater than expect size( `page.uncompressed_size()` ). However, ByteStreamSplit could expose that problem because it checks the boundary


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35423: "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35423:
URL: https://github.com/apache/arrow/issues/35423#issuecomment-1534008360

   I'll try to reproduce this problem. In `ByteStreamSplitDecoder::SetData`, it expects that `num_values` is greater or equals to the num of values in `ByteStreamSplit`, so it expects `num_values * static_cast<int64_t>(sizeof(T)) >= len`. It would be `==` if there is no null value in the page


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou closed issue #35423: [C++][Parquet] "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou closed issue #35423: [C++][Parquet] "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0
URL: https://github.com/apache/arrow/issues/35423


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35423: "Data size too large" error with byte-stream-split encoded data since Arrow 12.0.0

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35423:
URL: https://github.com/apache/arrow/issues/35423#issuecomment-1534161486

   Seems that using ParquetTableReader to just read that file is ok, and size as 100_000 or 200_000 is ok, but 300_000 would failed.  I guess `ParquetFileFormat` with a huge page size is not ok.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org