You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@commons.apache.org by He Shiming <he...@gmail.com> on 2014/05/25 08:43:22 UTC

[compress] Decompressing bzip2 binary produced by Python bz2?

Dear Community,

I'm porting a Python program to Java. The Python program makes use of bz2 (
https://docs.python.org/2/library/bz2.html) to compress the input into a
binary buffer. Its original decompressing code is:

decompressor = bz2.BZ2Decompressor()
decompressor.decompress(buffer)

where buffer is produced by file('binary', 'rb').read() . When using the
bzip2 decompressor in Apache Commons:

BZip2CompressorInputStream bis = new BZip2CompressorInputStream(in);

will produce an exception: "Stream is not in the BZip2 format". I checked
the binary buffer, and it does not have a header. It's not a 'bz2' file,
only a buffer segment. According to
http://commons.apache.org/proper/commons-compress/apidocs/src-html/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStream.html,
the exception is thrown while detecting a 'bz2' file header of 'BZh' +
'1'. On top of that, there appears to be other segment headers it require.

How do I get around this? Can I decompress a buffer directly, and not a bz2
file?

-- 
Best regards,
He Shiming
*Kaoya.com <http://kaoya.com> | Goals.io
<http://itunes.apple.com/us/app/goals.io-realize-your-dreams/id496228828?ls=1&mt=8>
| Toppin'Wiper <https://itunes.apple.com/app/toppinwiper/id553527232?mt=8>
| MediaMan <http://www.imediaman.com>*

Re: [compress] Decompressing bzip2 binary produced by Python bz2?

Posted by Stefan Bodewig <bo...@apache.org>.
On 2014-05-25, He Shiming wrote:

> According to
> http://commons.apache.org/proper/commons-compress/apidocs/src-html/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStream.html,
> the exception is thrown while detecting a 'bz2' file header of 'BZh' +
> '1'. On top of that, there appears to be other segment headers it
> require.

The 1 is the block size (in units of 100kB) and can be any number
between 1 and 9.  This information is crucial for BZip2 to work
properly.  Since this format compresses the whole block, the minimum
amount of data you can decompress is such a block including all its
metadata like huffman tables used, it is impossible to start
decompression in the midle of such a block.

In addition Compress' API won't allow you to start decompressing
anywhere else but at the very start of the file.  It wouldn't be too
hard to add a different mode to BZip2CompressorInputStream which would
at least need to know the size of the buffer and could start working on
a full compressed buffer - but it is not possible without modifying the
class itself.

Basically you'd need to a add a new constructor accepting a stream and
the block size as arguments, manually set a few member variables that
otherwise would get set in init and proceed to initBlock immediately.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org