You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "comicfans (via GitHub)" <gi...@apache.org> on 2023/04/14 04:32:58 UTC

[GitHub] [arrow] comicfans commented on issue #35105: [R] R can't parse parquet written by pyarrow with BYTE_STREAM_SPLIT column encoding

comicfans commented on issue #35105:
URL: https://github.com/apache/arrow/issues/35105#issuecomment-1507910900

   seems like this is a common problem, pyarrow also give same error for the generated file.
   I've attached the good input file for testing
   [sample.zip](https://github.com/apache/arrow/files/11228777/sample.zip)
   unzip this file , running following python code:
   
   ```python
    from pyarrow import parquet
   a = parquet.read_table('sample.parquet')
   parquet.write_table(a,"bug.parquet", use_dictionary=["contract_name"],use_byte_stream_split=["last_price",'bid_price1'])
   parquet.read_table('bug.parquet')
   ```
   
   got 
   ```
   ore.py", line 2601, in read
       table = self._dataset.to_table(
               ^^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_dataset.pyx", line 369, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 2818, in pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
   OSError: Data size too small for number of values (corrupted file?)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org