You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Lasse Collin <la...@tukaani.org> on 2011/08/03 21:22:44 UTC

[compress] XZ support and inconsistencies in the existing compressors

Hi!

I have been working on XZ data compression implementation in Java
<http://tukaani.org/xz/java.html>. I was told that it could be nice
to get XZ support into Commons Compress.

I looked at the APIs and code in Commons Compress to see how XZ
support could be added. I was especially looking for details where
one would need to be careful to make different compressors behave
consistently compared to each other. I found a few possible problems
in the existing code:

(1) CompressorOutputStream should have finish(). Now
    BZip2CompressorOutputStream has finish() but
    GzipCompressorOutputStream doesn't. This should be easy to
    fix because java.util.zip.GZIPOutputStream supports finish().

(2) BZip2CompressorOutputStream.flush() calls out.flush() but it
    doesn't flush data buffered by BZip2CompressorOutputStream.
    Thus not all data written to the Bzip2 stream will be available
    in the underlying output stream after flushing. This kind of
    flush() implementation doesn't seem very useful.

    GzipCompressorOutputStream.flush() is the default version
    from InputStream and thus does nothing. Adding flush()
    into GzipCompressorOutputStream is hard because
    java.util.zip.GZIPOutputStream and java.util.zip.Deflater don't
    support sync flushing before Java 7. To get Gzip flushing in
    older Java versions one might need a complete reimplementation
    of the Deflate algorithm which isn't necessarily practical.

(3) BZip2CompressorOutputStream has finalize() that finishes a stream
    that hasn't been explicitly finished or closed. This doesn't seem
    useful. GzipCompressorOutputStream doesn't have an equivalent
    finalize().

(4) The decompressor streams don't support concatenated .gz and .bz2
    files. This can be OK when compressed data is used inside another
    file format or protocol, but with regular (standalone) .gz and
    .bz2 files it is bad to stop after the first compressed stream
    and silently ignore the remaining compressed data.

    Fixing this in BZip2CompressorInputStream should be relatively
    easy because it stops right after the last byte of the compressed
    stream. Fixing GzipCompressorInputStream is harder because the
    problem is inherited from java.util.zip.GZIPInputStream
    which reads input past the end of the first stream. One
    might need to reimplement .gz container support on top of
    java.util.zip.InflaterInputStream or java.util.zip.Inflater.

The XZ compressor supports finish() and flush(). The XZ decompressor
supports concatenated .xz files, but there is also a single-stream
version that behaves similarly to the current version of
BZip2CompressorInputStream.

Assuming that there will be some interest in adding XZ support into
Commons Compress, is it OK make Commons Compress depend on the XZ
package org.tukaani.xz, or should the XZ code be modified so that
it could be included as an internal part in Commons Compress? I
would prefer depending on org.tukaani.xz because then there is
just one code base to keep up to date.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] XZ support and inconsistencies in the existing compressors

Posted by Simone Tripodi <si...@apache.org>.

Hi Lasse!
I'd personally like if you could fill an Issue on Jira and submit your
XZ implementation as a patch that naturally fits in the
org.apache.commons.compress package and you continue contributing on
maintaining it - maybe depending on an external package would be more
difficult since commons components generally are self contained and
don't depend from any part library - unless are commons components
themselves.

Keep what I said strictly as a personal suggestion, I'm not involved
in [compress] development so I let maintainers taking decisions.

Have a nice day, all the best!
Simo

http://people.apache.org/~simonetripodi/
http://www.99soft.org/



On Wed, Aug 3, 2011 at 9:22 PM, Lasse Collin <la...@tukaani.org> wrote:
> Hi!
>
> I have been working on XZ data compression implementation in Java
> <http://tukaani.org/xz/java.html>. I was told that it could be nice
> to get XZ support into Commons Compress.
>
> I looked at the APIs and code in Commons Compress to see how XZ
> support could be added. I was especially looking for details where
> one would need to be careful to make different compressors behave
> consistently compared to each other. I found a few possible problems
> in the existing code:
>
> (1) CompressorOutputStream should have finish(). Now
>    BZip2CompressorOutputStream has finish() but
>    GzipCompressorOutputStream doesn't. This should be easy to
>    fix because java.util.zip.GZIPOutputStream supports finish().
>
> (2) BZip2CompressorOutputStream.flush() calls out.flush() but it
>    doesn't flush data buffered by BZip2CompressorOutputStream.
>    Thus not all data written to the Bzip2 stream will be available
>    in the underlying output stream after flushing. This kind of
>    flush() implementation doesn't seem very useful.
>
>    GzipCompressorOutputStream.flush() is the default version
>    from InputStream and thus does nothing. Adding flush()
>    into GzipCompressorOutputStream is hard because
>    java.util.zip.GZIPOutputStream and java.util.zip.Deflater don't
>    support sync flushing before Java 7. To get Gzip flushing in
>    older Java versions one might need a complete reimplementation
>    of the Deflate algorithm which isn't necessarily practical.
>
> (3) BZip2CompressorOutputStream has finalize() that finishes a stream
>    that hasn't been explicitly finished or closed. This doesn't seem
>    useful. GzipCompressorOutputStream doesn't have an equivalent
>    finalize().
>
> (4) The decompressor streams don't support concatenated .gz and .bz2
>    files. This can be OK when compressed data is used inside another
>    file format or protocol, but with regular (standalone) .gz and
>    .bz2 files it is bad to stop after the first compressed stream
>    and silently ignore the remaining compressed data.
>
>    Fixing this in BZip2CompressorInputStream should be relatively
>    easy because it stops right after the last byte of the compressed
>    stream. Fixing GzipCompressorInputStream is harder because the
>    problem is inherited from java.util.zip.GZIPInputStream
>    which reads input past the end of the first stream. One
>    might need to reimplement .gz container support on top of
>    java.util.zip.InflaterInputStream or java.util.zip.Inflater.
>
> The XZ compressor supports finish() and flush(). The XZ decompressor
> supports concatenated .xz files, but there is also a single-stream
> version that behaves similarly to the current version of
> BZip2CompressorInputStream.
>
> Assuming that there will be some interest in adding XZ support into
> Commons Compress, is it OK make Commons Compress depend on the XZ
> package org.tukaani.xz, or should the XZ code be modified so that
> it could be included as an internal part in Commons Compress? I
> would prefer depending on org.tukaani.xz because then there is
> just one code base to keep up to date.
>
> --
> Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] XZ support and inconsistencies in the existing compressors

Posted by Stefan Bodewig <bo...@apache.org>.

>>>> Is this <https://issues.apache.org/jira/browse/COMPRESS-146>?

On 2011-08-04, Lasse Collin wrote:

> On 2011-08-04 Stefan Bodewig wrote:
>> On 2011-08-04, Lasse Collin wrote:

>>> Yes. I didn't check the suggested fix though.

>> Would be nice if you'd find the time to do so.

> It uses in.available() == 0. It duplicates the test for "BZh" magic
> bytes and a little more from init() into complete(). I think this bug
> can be fixed in a nicer way.

Patches welcome ;-)

> Is there a need to have a bzip2 decompressor that does stop after the
> first stream (like the current code does)? Maybe .zip needs it?

Currently .zip doesn't use bzip2 at all and I don't think it will do so
before 2.x as I'd like to rework the API so that people could add their
own compression/encryption algos.  In JIRA there is at least one entry
where somebody has a company-owned implemtentation of one of the
compression algos (can't recall the details) and would like to hook that
into ZIP.

I see us defining a more genric Encoder/Decoder API, maybe similar to
java.util.zip.Deflater/Inflater and use that inside ZIP, basing a BZIP2
implementation on the current codebase.

>> We'll need standalone compressors for other formats as well (and we do
>> need LZMA 8-).  Some of the options your code provides might be
>> interesting for the ZIP package as well when we want to implement some
>> of the other supported methods.

> The .lzma format is legacy. While it may have some uses, people should
> usually move to .xz and LZMA2.

But they may already have to deal with .lzma files because they exist,
or because a process exists that requires them to write .lzmas.  I just
read that .tar.lzma could be used inside Debian packages.

> The .zip format has LZMA marked as "Early Feature Specification". Minor
> details are a little bit weird. For example, it requires storing the
> LZMA SDK version that was used for compression (what if you don't use
> unmodified LZMA SDK).

A lot of things inside the ZIP spec are "a little bit weird".  The
problem I had with Java7's interpretation of the APPNOTE when it comes
to data descriptors is such a symptom.  It turned out there is a comment
stretching more than 70 lines in InfoZIP's code explaining their
interpretation and why they do so.  Well worth reading (zipfile.c in
zip30's source code, lines 5527ff, sorry no public source code repo I
could point to).  It contains the lines

    /* This is rather klugy as the AppNote handles this poorly.
       This was the old thought:
       After discussions with other groups this is the current thinking:
       Apparent industry interpretation for data descriptors:

> What else needs LZMA? Do you plan .7z support?

Eventually.  There is a feature request for it.  It would probably best
to "simply" base it on the public domain 7Zip SDK.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] XZ support and inconsistencies in the existing compressors

Posted by Lasse Collin <la...@tukaani.org>.

On 2011-08-04 Stefan Bodewig wrote:
> On 2011-08-04, Lasse Collin wrote:
> > Using bits from the end of stream magic doesn't make sense, because
> > then one would be forced to finish the stream. Using the bits from
> > the block header magic means that one must add at least one more
> > block. This is fine if the application will want to encode at least
> > one more byte. If the application calls close() right after
> > flushing, then there's a problem unless .bz2 format allows empty
> > blocks. I get a feeling from the code that .bz2 would support empty
> > blocks, but I'm not sure at all.
> 
> It should be possible to write some unit tests to see what works and
> to create some test archives for interop testing with native tools.

Maybe, if it is possible to even create such files.

Making flush() equivalent to finish() (except that one can continue
after flush()) with bzip2 sounds much lazier and safer, even if it can
create its own problems too.

> >>> (4) The decompressor streams don't support concatenated .gz
> >>> and .bz2 files. This can be OK when compressed data is used inside
> >>>     another file format or protocol, but with regular
> >>>     (standalone) .gz and .bz2 files it is bad to stop after the
> >>>     first compressed stream and silently ignore the remaining
> >>>     compressed data.
> 
> >>>     Fixing this in BZip2CompressorInputStream should be relatively
> >>>     easy because it stops right after the last byte of the
> >>>     compressed stream.
> 
> >> Is this <https://issues.apache.org/jira/browse/COMPRESS-146>?
> 
> > Yes. I didn't check the suggested fix though.
> 
> Would be nice if you'd find the time to do so.

It uses in.available() == 0. It duplicates the test for "BZh" magic
bytes and a little more from init() into complete(). I think this bug
can be fixed in a nicer way.

Is there a need to have a bzip2 decompressor that does stop after the
first stream (like the current code does)? Maybe .zip needs it?

> We'll need standalone compressors for other formats as well (and we do
> need LZMA 8-).  Some of the options your code provides might be
> interesting for the ZIP package as well when we want to implement some
> of the other supported methods.

The .lzma format is legacy. While it may have some uses, people should
usually move to .xz and LZMA2.

The .zip format has LZMA marked as "Early Feature Specification". Minor
details are a little bit weird. For example, it requires storing the
LZMA SDK version that was used for compression (what if you don't use
unmodified LZMA SDK).

What else needs LZMA? Do you plan .7z support?

> If you need help with publishing your package to a Maven repository -
> some of your users will ask for it sooner or later - I know where to
> find people who can help.

Thanks.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] XZ support and inconsistencies in the existing compressors

Posted by Stefan Bodewig <bo...@apache.org>.

On 2011-08-04, Lasse Collin wrote:

> On 2011-08-04 Stefan Bodewig wrote:

>> This is in a big part due to the history of Commons Compress which
>> combined several different codebases with separate APIs and provided a
>> first attempt to layer a unifying API on top of it.  We are aware of
>> quite a few problems and want to address them in Commons Compress 2.x
>> and it would be really great if you would participate in the design of
>> the new APIs once that discussion kicks off.

> I'm not sure how much I can help, but I can try (depending on how much
> I have time).

Thanks.

>> On 2011-08-03, Lasse Collin wrote:

>>> (2) BZip2CompressorOutputStream.flush() calls out.flush() but it
>>>     doesn't flush data buffered by BZip2CompressorOutputStream.
>>>     Thus not all data written to the Bzip2 stream will be available
>>>     in the underlying output stream after flushing. This kind of
>>>     flush() implementation doesn't seem very useful.

>> Agreed, do you want to open a JIRA issue for this?

> There is already this:

>     https://issues.apache.org/jira/browse/COMPRESS-42

Ahh, I knew I once fiddled with flush there but a quick grep through the
changes file didn't show anything - because it was before the 1.0
release.

> I tried to understand how flushing could be done properly. I'm not
> really familiar with bzip2 so the following might have errors.

As I already said, neither of us is terribly familiar with the format
right now.  I for one didn't even know you could have multiple streams
in a single file so it took your mail for me to make sense out of
COMPRESS-146.

> I checked libbzip2 and how it's BZ_FLUSH works. It finishes the block,
> but it doesn't flush the last bits, and thus the complete block isn't
> available in the output stream. The blocks in the .bz2 format aren't
> aligned to full bytes, and there is no padding between blocks.

> The lack of alignment makes flushing tricky. One may need to write out
> up to seven bits of data from the future. The bright side is that those
> future bits can only come from the block header magic or from the end
> of stream magic. Both are constants so there are only two possibilities
> what those seven bits can be.

> Using bits from the end of stream magic doesn't make sense, because then
> one would be forced to finish the stream. Using the bits from the
> block header magic means that one must add at least one more block.
> This is fine if the application will want to encode at least one more
> byte. If the application calls close() right after flushing, then
> there's a problem unless .bz2 format allows empty blocks. I get a
> feeling from the code that .bz2 would support empty blocks, but I'm not
> sure at all.

It should be possible to write some unit tests to see what works and to
create some test archives for interop testing with native tools.

>>> (4) The decompressor streams don't support concatenated .gz and .bz2
>>>     files. This can be OK when compressed data is used inside
>>>     another file format or protocol, but with regular
>>>     (standalone) .gz and .bz2 files it is bad to stop after the
>>>     first compressed stream and silently ignore the remaining
>>>     compressed data.

>>>     Fixing this in BZip2CompressorInputStream should be relatively
>>>     easy because it stops right after the last byte of the
>>>     compressed stream.

>> Is this <https://issues.apache.org/jira/browse/COMPRESS-146>?

> Yes. I didn't check the suggested fix though.

Would be nice if you'd find the time to do so.

>>>     Fixing GzipCompressorInputStream is harder because the problem
>>>     is inherited from java.util.zip.GZIPInputStream which reads
>>>     input past the end of the first stream. One might need to
>>>     reimplement .gz container support on top of
>>>     java.util.zip.InflaterInputStream or java.util.zip.Inflater.

>> Sounds doable but would need somebody to code it, I guess ;-)

> There is a little bit hackish solution in the comments of the following
> bug report, but it lacks license information:

>     http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425

Yes.  I agree it is hacky.

>> In the past we have incorporated external codebases (ar and cpio) that
>> used to be under compatible licenses to make things simpler for our
>> users, but if you prefer to develop your code base outside of Commons
>> Compress then I can fully understand that.

> I will develop it in my own tree, but it's possible to include a copy
> in Commons Compress with modified "package" and "import" lines in the
> source files. Changes in my tree would need to be copied to Commons
> Compress now and then. I don't know if this is better than having an
> external dependency.

Don't know either.  It depends on who'd do the work of syncing, I guess.

> org.tukaani.xz will include features that aren't necessarily interesting
> in Commons Compress, for example, advanced compression options and
> random access reading. Most developers probably won't care about these.

We'll need standalone compressors for other formats as well (and we do
need LZMA 8-).  Some of the options your code provides might be
interesting for the ZIP package as well when we want to implement some
of the other supported methods.

>> From the dependency management POV I know many
>> developers prefer dependencies that are available from a Maven
>> repository, is this the case for the org.tukaani.xz package (I'm too
>> lazy to check).

> There is only build.xml for Ant.

If you need help with publishing your package to a Maven repository -
some of your users will ask for it sooner or later - I know where to
find people who can help.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] XZ support and inconsistencies in the existing compressors

Posted by Lasse Collin <la...@tukaani.org>.

On 2011-08-04 Stefan Bodewig wrote:
> On 2011-08-03, Lasse Collin wrote:
> > I looked at the APIs and code in Commons Compress to see how XZ
> > support could be added. I was especially looking for details where
> > one would need to be careful to make different compressors behave
> > consistently compared to each other.
> 
> This is in a big part due to the history of Commons Compress which
> combined several different codebases with separate APIs and provided a
> first attempt to layer a unifying API on top of it.  We are aware of
> quite a few problems and want to address them in Commons Compress 2.x
> and it would be really great if you would participate in the design of
> the new APIs once that discussion kicks off.

I'm not sure how much I can help, but I can try (depending on how much
I have time).

> > (2) BZip2CompressorOutputStream.flush() calls out.flush() but it
> >     doesn't flush data buffered by BZip2CompressorOutputStream.
> >     Thus not all data written to the Bzip2 stream will be available
> >     in the underlying output stream after flushing. This kind of
> >     flush() implementation doesn't seem very useful.
> 
> Agreed, do you want to open a JIRA issue for this?

There is already this:

    https://issues.apache.org/jira/browse/COMPRESS-42

I tried to understand how flushing could be done properly. I'm not
really familiar with bzip2 so the following might have errors.

I checked libbzip2 and how it's BZ_FLUSH works. It finishes the block,
but it doesn't flush the last bits, and thus the complete block isn't
available in the output stream. The blocks in the .bz2 format aren't
aligned to full bytes, and there is no padding between blocks.

The lack of alignment makes flushing tricky. One may need to write out
up to seven bits of data from the future. The bright side is that those
future bits can only come from the block header magic or from the end
of stream magic. Both are constants so there are only two possibilities
what those seven bits can be.

Using bits from the end of stream magic doesn't make sense, because then
one would be forced to finish the stream. Using the bits from the
block header magic means that one must add at least one more block.
This is fine if the application will want to encode at least one more
byte. If the application calls close() right after flushing, then
there's a problem unless .bz2 format allows empty blocks. I get a
feeling from the code that .bz2 would support empty blocks, but I'm not
sure at all.

Since bzip2 works on blocks that are compressed independently from each
other, the compression ratio doesn't get a big penalty if the stream is
finished and then a new stream is started. This would make it much
simpler to implement flushing. The downside is that implementations,
that don't support decoding concatenated .bz2 files, will stop after
the first stream.

> > (4) The decompressor streams don't support concatenated .gz and .bz2
> >     files. This can be OK when compressed data is used inside
> >     another file format or protocol, but with regular
> >     (standalone) .gz and .bz2 files it is bad to stop after the
> >     first compressed stream and silently ignore the remaining
> >     compressed data.
> 
> >     Fixing this in BZip2CompressorInputStream should be relatively
> >     easy because it stops right after the last byte of the
> >     compressed stream.
> 
> Is this <https://issues.apache.org/jira/browse/COMPRESS-146>?

Yes. I didn't check the suggested fix though.

> >     Fixing GzipCompressorInputStream is harder because the problem
> >     is inherited from java.util.zip.GZIPInputStream which reads
> >     input past the end of the first stream. One might need to
> >     reimplement .gz container support on top of
> >     java.util.zip.InflaterInputStream or java.util.zip.Inflater.
> 
> Sounds doable but would need somebody to code it, I guess ;-)

There is a little bit hackish solution in the comments of the following
bug report, but it lacks license information:

    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425

> In the past we have incorporated external codebases (ar and cpio) that
> used to be under compatible licenses to make things simpler for our
> users, but if you prefer to develop your code base outside of Commons
> Compress then I can fully understand that.

I will develop it in my own tree, but it's possible to include a copy
in Commons Compress with modified "package" and "import" lines in the
source files. Changes in my tree would need to be copied to Commons
Compress now and then. I don't know if this is better than having an
external dependency.

org.tukaani.xz will include features that aren't necessarily interesting
in Commons Compress, for example, advanced compression options and
random access reading. Most developers probably won't care about these.

(The above answers to Simone Tripodi's message too.)

> From the dependency management POV I know many
> developers prefer dependencies that are available from a Maven
> repository, is this the case for the org.tukaani.xz package (I'm too
> lazy to check).

There is only build.xml for Ant.

> Also I would have a problem with an external dependency on code that
> says "The APIs aren't completely stable yet".  Any tentative timeframe
> as to when you expect to have a stable API?  It might match our
> schedule for 2.x so we could target that release rather than 1.3.

It needs to be stable in 2-4 weeks or so. I need to get feedback about
the API first. I think will get some feedback next week. More people
giving feedback would naturally be welcome. ;-)

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] XZ support and inconsistencies in the existing compressors

Posted by Stefan Bodewig <bo...@apache.org>.

Hi Lasse and welcome

On 2011-08-03, Lasse Collin wrote:

> I have been working on XZ data compression implementation in Java
> <http://tukaani.org/xz/java.html>. I was told that it could be nice
> to get XZ support into Commons Compress.

Sounds interesting.

> I looked at the APIs and code in Commons Compress to see how XZ
> support could be added. I was especially looking for details where
> one would need to be careful to make different compressors behave
> consistently compared to each other.

This is in a big part due to the history of Commons Compress which
combined several different codebases with separate APIs and provided a
first attempt to layer a unifying API on top of it.  We are aware of
quite a few problems and want to address them in Commons Compress 2.x
and it would be really great if you would participate in the design of
the new APIs once that discussion kicks off.

Right now I myself am pretty busy implementing ZIP64 support for a 1.3
release of Commons Compress and intend to start the 2.x discussion once
this is done - which is (combined with some scheduled offline time)
about a month away for me.

I should probably also mention that right now probably no active
committer understands the bzip2 code well enough to make significant
changes at all.  I know that I don't.

> I found a few possible problems in the existing code:

> (1) CompressorOutputStream should have finish(). Now
>     BZip2CompressorOutputStream has finish() but
>     GzipCompressorOutputStream doesn't. This should be easy to
>     fix because java.util.zip.GZIPOutputStream supports finish().

+1

This is a good point we should earmark for 2.0 - doing so for 1.x would
break the API which we try to avoid.

> (2) BZip2CompressorOutputStream.flush() calls out.flush() but it
>     doesn't flush data buffered by BZip2CompressorOutputStream.
>     Thus not all data written to the Bzip2 stream will be available
>     in the underlying output stream after flushing. This kind of
>     flush() implementation doesn't seem very useful.

Agreed, do you want to open a JIRA issue for this?

>     GzipCompressorOutputStream.flush() is the default version
>     from InputStream and thus does nothing. Adding flush()
>     into GzipCompressorOutputStream is hard because
>     java.util.zip.GZIPOutputStream and java.util.zip.Deflater don't
>     support sync flushing before Java 7. To get Gzip flushing in
>     older Java versions one might need a complete reimplementation
>     of the Deflate algorithm which isn't necessarily practical.

Not really desirable, I agree.  As for Java7, we currently target Java5
but it might be possible to hack in flush support using reflection.  So
we could support sync flushing if the current Java classlib supports it.

> (3) BZip2CompressorOutputStream has finalize() that finishes a stream
>     that hasn't been explicitly finished or closed. This doesn't seem
>     useful. GzipCompressorOutputStream doesn't have an equivalent
>     finalize().

Removing it could cause backwards compatibility issues.  I agree it is
unnecessary but would leave fixing it to the point where we are willing
to break compatibility - i.e. 2.0.  This is in the same category as 
<https://issues.apache.org/jira/browse/COMPRESS-128> to me.

> (4) The decompressor streams don't support concatenated .gz and .bz2
>     files. This can be OK when compressed data is used inside another
>     file format or protocol, but with regular (standalone) .gz and
>     .bz2 files it is bad to stop after the first compressed stream
>     and silently ignore the remaining compressed data.

>     Fixing this in BZip2CompressorInputStream should be relatively
>     easy because it stops right after the last byte of the compressed
>     stream.

Is this <https://issues.apache.org/jira/browse/COMPRESS-146>?

>     Fixing GzipCompressorInputStream is harder because the problem is
>     inherited from java.util.zip.GZIPInputStream which reads input
>     past the end of the first stream. One might need to reimplement
>     .gz container support on top of java.util.zip.InflaterInputStream
>     or java.util.zip.Inflater.

Sounds doable but would need somebody to code it, I guess ;-)

> The XZ compressor supports finish() and flush(). The XZ decompressor
> supports concatenated .xz files, but there is also a single-stream
> version that behaves similarly to the current version of
> BZip2CompressorInputStream.

I think in the 1.x timeframe users that know they are using XZ would
simply bypass the Commons Compress interfaces like they'd do now if they
wanted to flush the bzip2 stream.  The main difference here likely is
they wouldn't need to use Commons Compress at all but could be using
your XZ package directly in that case.  They don't have that choice with
bzip2.

> Assuming that there will be some interest in adding XZ support into
> Commons Compress, is it OK make Commons Compress depend on the XZ
> package org.tukaani.xz, or should the XZ code be modified so that
> it could be included as an internal part in Commons Compress?

> I would prefer depending on org.tukaani.xz because then there is just
> one code base to keep up to date.

In the past we have incorporated external codebases (ar and cpio) that
used to be under compatible licenses to make things simpler for our
users, but if you prefer to develop your code base outside of Commons
Compress then I can fully understand that.

>From a license POV we obviously wouldn't have any problems with your
public domain code.  From the dependency management POV I know many
developers prefer dependencies that are available from a Maven
repository, is this the case for the org.tukaani.xz package (I'm too
lazy to check).  I'm an Ant person myself, but you know there are those
people who love repositories ...

Also I would have a problem with an external dependency on code that
says "The APIs aren't completely stable yet".  Any tentative timeframe
as to when you expect to have a stable API?  It might match our schedule
for 2.x so we could target that release rather than 1.3.

Cheers

        Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org