You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/11 17:26:28 UTC

[GitHub] [arrow] izapolsk opened a new issue, #14624: [Python] CompressedOutputStream can't correctly compress/write files bigger than 16 GB

izapolsk opened a new issue, #14624:
URL: https://github.com/apache/arrow/issues/14624

   sorry I didn't manage to find out how I can create a Jira account in apache.org.
   So, I'm going to try submitting found issue here.
   
   I encountered an issue in pyarrow 10.0.0 with CompressedOutputStream.
   It's unable to compress files bigger than 16 GB.  I tried several times with different arrow files.
   environment: debian/ubuntu
   
   ```python
   import pyarrow as pa
   from pathlib import Path
   import datasets as ds
   #%%
   pa.__version__
   >> '10.0.0'
   #%%
   data_dir = Path('~/tmp').expanduser()
   big_dataset = data_dir.joinpath('train.arrow')
   #%%
   !ls -lh ~/tmp/train.arrow
   >> -rw-rw-r-- 1 yzapols yzapols 28G Nov 11 13:35 ~/tmp/train.arrow
   #%%
   !md5sum ~/tmp/train.arrow
   >>5afe31d206ce07249c127e067bcfa0fb  ~/tmp/train.arrow
   #%%
   schema = pa.schema([...])
   #%%
   compressed_dataset = data_dir.joinpath('train.arrow.bz2')
   with pa.ipc.open_stream(str(big_dataset)) as istream:
       with pa.OSFile(str(compressed_dataset), 'wb') as output_file:
           with pa.CompressedOutputStream(output_file, compression='bz2') as ostream:
               with pa.RecordBatchStreamWriter(ostream, schema) as writer:
                   try:
                       while True:
                           writer.write_batch(istream.read_next_batch())
                   except StopIteration:
                       print('done')
   
   >> done
   #%%
   !ls -lh ~/tmp/train.arrow.bz2
   >> -rw-rw-r-- 1 yzapols yzapols 2.4G Nov 11 17:54 ~/tmp/train.arrow.bz2
   #%%
   !mv ~/tmp/train.arrow ~/tmp/train.arrow.old
   #%%
   !bunzip2 -k ~/tmp/train.arrow.bz2
   #%%
   !ls -lh ~/tmp/train.arrow
   >> -rw-rw-r-- 1 yzapols yzapols 16G Nov 11 17:54 ~/tmp/train.arrow
   #%%
   !md5sum  ~/tmp/train.arrow
   >> 2460c2c81c5c8672f4b488cfa2ecd8c1 ~/tmp/train.arrow
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] izapolsk closed issue #14624: [Python] CompressedOutputStream can't correctly compress/write files bigger than 16 GB

Posted by GitBox <gi...@apache.org>.

izapolsk closed issue #14624: [Python] CompressedOutputStream can't correctly compress/write files bigger than 16 GB
URL: https://github.com/apache/arrow/issues/14624


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org