You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Stefan Bodewig <bo...@apache.org> on 2011/08/06 06:40:24 UTC

[compress] Need API Feedback/Advice for ZipArchiveOutputStream ans ZIP64

Hi,

there are eight possible permutations of compressed/uncompressed entries
that get written to a seekable-/non-seekable stream whose size is either
known or unknown prior to writing them.

One of them is prohibited (uncompressed/non-seekable/unknown size) and
has been prohibited before, no change here.

For six of the remaining seven permutations ZipArchiveOutputStream
should be generating archives that transparently enable ZIP64 features
for entries if and only if they are too big to be stored without ZIP64.
I.e. the resulting archive will either be readable by an implementation
that doesn't support ZIP64 or it contains files that would be too big
for such an implementation anyway.  The price we pay for some cases are
an additional 20 bytes per entry that are never used by anybody.

The only case that isn't covered so far is compressed / non-seekable
output / input of unknown size.

Such entries are stored using a feature that is called the "data
descriptor".  There are two different formats of the data descriptor for
ZIP64 and not-ZIP64 archives and the archive writer has to signal which
type of descriptor it is going to write before it starts writing the
entry's data.

This means ZipArchiveOutputStream must decide whether it is going to use
the ZIP64 format before it knows whether it would actually need it or
not.  If it signals it is going to use ZIP64 then an implementation that
doesn't support ZIP64 (like Compress 1.2 or java.util.zip) may fail to
read the archive, which is bad if the entry turns out to be smaller than
4GiB.  If it doesn't signal ZIP64 it can't write big entries at all.

This decision can be made at the granularity of a single entry.  I.e. it
is possible to not use ZIP64 for the majority of entries and enable it
for individual entries.

IMHO there is no right or wrong decision here that the library could
make.  The user-code will have to decide whether ZIP64 should be enabled
or not.  The main questions to me are whether we want to attach this
decision to the stream or the entry itself and what the default should
be.

InfoZIP's ZIP has decided to make it an option for the whole archive
(the command line doesn't offer much flexibility here) and make it
default to ZIP64.

My current thinking is that java.util.zip is a likely candidate for the
receiving end of ZIPs we create, so it may be better to turn ZIP64 off
by default, but I'm not sure.

I'm leaning towards adding a setUseZip64(boolean) method at the level of
ZipArchiveOutputStream and make it default to false.  This method could
be called in between putArchiveEntry calls to make it apply selectively
to indiviual entries.

The name is totally open for debate since as it stands it sounds as if
you could turn off all Zip64 features which I wouldn't want to do for
the cases that can be dealt with transparently.  Then again it could use
a Boolean argument with "null" meaning "do the best you can" and false
"don't even use Zip64 if you think it is safe".

Any ideas?

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Need API Feedback/Advice for ZipArchiveOutputStream ans ZIP64

Posted by Stefan Bodewig <bo...@apache.org>.
On 2011-08-08, Phil Steitz wrote:

> On 8/6/11 9:30 PM, Stefan Bodewig wrote:
>> On 2011-08-06, Phil Steitz wrote:

>>> On 8/5/11 9:40 PM, Stefan Bodewig wrote:

>>>> InfoZIP's ZIP has decided to make it an option for the whole archive
>>>> (the command line doesn't offer much flexibility here) and make it
>>>> default to ZIP64.
>>>> My current thinking is that java.util.zip is a likely candidate for the
>>>> receiving end of ZIPs we create, so it may be better to turn ZIP64 off
>>>> by default, but I'm not sure.
>>>> I'm leaning towards adding a setUseZip64(boolean) method at the level of
>>>> ZipArchiveOutputStream and make it default to false.  This method could
>>>> be called in between putArchiveEntry calls to make it apply selectively
>>>> to indiviual entries.
>>> Sounds reasonable.
>>>> The name is totally open for debate since as it stands it sounds as if
>>>> you could turn off all Zip64 features which I wouldn't want to do for
>>>> the cases that can be dealt with transparently.  Then again it could use
>>>> a Boolean argument with "null" meaning "do the best you can" and false
>>>> "don't even use Zip64 if you think it is safe".
>>> I don't get what you mean by "do the best you can."  Does that mean
>>> turn it on when needed if somehow you know it is needed, per entry,
>>> I assume?
>> Actually I was thinking about what the method would mean for the other
>> combinations as well.  A "null" value doesn't make sense for the
>> specific case I'm asking about - here we need to decide what the default
>> should be: Zip64 or not.  For the other cases "null" could mean "use
>> Zip64 if you think you need it" i.e. what is omplemented right now,
>> "true" could mean "always use it" and "false" could mean "never use it,
>> throw an exception if you recognize it would be required".

> I guess an alternative would be an enum with values "allowed,"
> "always," and "never," which would work for the "other" cases; but
> maybe be a little unnatural for the simple case above, where I guess
> you would have to either throw on the "allowed" setting or view it
> as synonymous with "always".

I'm still not used to the fact that Compress is at Java5 now, so I
didn't think of enums.  It feels better than the null Boolean to me.
"asNeeded" could be an alternative to "allowed".

> As a test of whether or not it will work, I would recommend writing
> the javadoc and test cases first (which of course we all do any way ;)

Absolutely ...

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Need API Feedback/Advice for ZipArchiveOutputStream ans ZIP64

Posted by Stefan Bodewig <bo...@apache.org>.
On 2011-08-09, Phil Steitz wrote:

> On 8/8/11 11:29 PM, Stefan Bodewig wrote:
>> On 2011-08-08, Phil Steitz wrote:

>>> As a test of whether or not it will work, I would recommend writing
>>> the javadoc and test cases first (which of course we all do any way ;)
>> Javadocs are there now with svn rev 1155223, in particular
>> <http://svn.apache.org/viewvc/commons/proper/compress/trunk/src/main/java/org/apache/commons/compress/archivers/zip/ZipArchiveOutputStream.java?r1=1154667&r2=1155223>

>> It would be good if anybody who didn't have to dive as deeply inside the
>> ZIP format as myself could read it to ensure it makes some sense.
>> Proof-reading by a native speaker certainly won't hurt either.

> Looks clear enough,

Thanks.

> but it would be good to specify what happens (i.e. what exception or
> failure mode happens) when the setting is not valid.  Looks like only
> asNeeded needs a comment for this.

Yes, I've added some @throws clauses in later commits.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Need API Feedback/Advice for ZipArchiveOutputStream ans ZIP64

Posted by Phil Steitz <ph...@gmail.com>.
On 8/8/11 11:29 PM, Stefan Bodewig wrote:
> On 2011-08-08, Phil Steitz wrote:
>
>> As a test of whether or not it will work, I would recommend writing
>> the javadoc and test cases first (which of course we all do any way ;)
> Javadocs are there now with svn rev 1155223, in particular
> <http://svn.apache.org/viewvc/commons/proper/compress/trunk/src/main/java/org/apache/commons/compress/archivers/zip/ZipArchiveOutputStream.java?r1=1154667&r2=1155223>
>
> It would be good if anybody who didn't have to dive as deeply inside the
> ZIP format as myself could read it to ensure it makes some sense.
> Proof-reading by a native speaker certainly won't hurt either.

Looks clear enough, but it would be good to specify what happens
(i.e. what exception or failure mode happens) when the setting is
not valid.  Looks like only asNeeded needs a comment for this.

Phil
>
> Later I intend to add a similar blurb with recommendations to the ZIP
> page at the website.
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Need API Feedback/Advice for ZipArchiveOutputStream ans ZIP64

Posted by Stefan Bodewig <bo...@apache.org>.
On 2011-08-08, Phil Steitz wrote:

> As a test of whether or not it will work, I would recommend writing
> the javadoc and test cases first (which of course we all do any way ;)

Javadocs are there now with svn rev 1155223, in particular
<http://svn.apache.org/viewvc/commons/proper/compress/trunk/src/main/java/org/apache/commons/compress/archivers/zip/ZipArchiveOutputStream.java?r1=1154667&r2=1155223>

It would be good if anybody who didn't have to dive as deeply inside the
ZIP format as myself could read it to ensure it makes some sense.
Proof-reading by a native speaker certainly won't hurt either.

Later I intend to add a similar blurb with recommendations to the ZIP
page at the website.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Need API Feedback/Advice for ZipArchiveOutputStream ans ZIP64

Posted by Phil Steitz <ph...@gmail.com>.
On 8/6/11 9:30 PM, Stefan Bodewig wrote:
> On 2011-08-06, Phil Steitz wrote:
>
>> On 8/5/11 9:40 PM, Stefan Bodewig wrote:
>>> This means ZipArchiveOutputStream must decide whether it is going to use
>>> the ZIP64 format before it knows whether it would actually need it or
>>> not.  If it signals it is going to use ZIP64 then an implementation that
>>> doesn't support ZIP64 (like Compress 1.2 or java.util.zip) may fail to
>>> read the archive, which is bad if the entry turns out to be smaller than
>>> 4GiB.  If it doesn't signal ZIP64 it can't write big entries at all.
>>> This decision can be made at the granularity of a single entry.  I.e. it
>>> is possible to not use ZIP64 for the majority of entries and enable it
>>> for individual entries.
>>> IMHO there is no right or wrong decision here that the library could
>>> make.  The user-code will have to decide whether ZIP64 should be enabled
>>> or not.  The main questions to me are whether we want to attach this
>>> decision to the stream or the entry itself and what the default should
>>> be.
>> Can you think of practical use cases where setting at the entry
>> level is needed?
> If there is a single entry that uses Zip64 features inside the archive
> then an implementation that doesn't support it is most likely going to
> choke on it anyway.  Compress 1.2 does.
>
> One - maybe contrieved - use case could come up for one of the other
> combinations ZipArchiveOutputStream has to support.  Writing entries of
> unknown size to a seekable stream.  Here each entry gets 20 extra bytes
> compared to Compress 1.2 that you could avoid by turning off Zip64
> support in general and selectively turning it on for entries.  OTOH,
> this implies you'd at least know whether the size was smaller or bigger
> than 4GiB, which is not that likely if you don't know the exact size.
>
> So no, no compelling use case.
>
>>> InfoZIP's ZIP has decided to make it an option for the whole archive
>>> (the command line doesn't offer much flexibility here) and make it
>>> default to ZIP64.
>>> My current thinking is that java.util.zip is a likely candidate for the
>>> receiving end of ZIPs we create, so it may be better to turn ZIP64 off
>>> by default, but I'm not sure.
>>> I'm leaning towards adding a setUseZip64(boolean) method at the level of
>>> ZipArchiveOutputStream and make it default to false.  This method could
>>> be called in between putArchiveEntry calls to make it apply selectively
>>> to indiviual entries.
>> Sounds reasonable.
>>> The name is totally open for debate since as it stands it sounds as if
>>> you could turn off all Zip64 features which I wouldn't want to do for
>>> the cases that can be dealt with transparently.  Then again it could use
>>> a Boolean argument with "null" meaning "do the best you can" and false
>>> "don't even use Zip64 if you think it is safe".
>> I don't get what you mean by "do the best you can."  Does that mean
>> turn it on when needed if somehow you know it is needed, per entry,
>> I assume?
> Actually I was thinking about what the method would mean for the other
> combinations as well.  A "null" value doesn't make sense for the
> specific case I'm asking about - here we need to decide what the default
> should be: Zip64 or not.  For the other cases "null" could mean "use
> Zip64 if you think you need it" i.e. what is omplemented right now,
> "true" could mean "always use it" and "false" could mean "never use it,
> throw an exception if you recognize it would be required".

I guess an alternative would be an enum with values "allowed," 
"always," and "never," which would work for the "other" cases; but
maybe be a little unnatural for the simple case above, where I guess
you would have to either throw on the "allowed" setting or view it
as synonymous with "always."   So probably best to do as you suggest
- Boolean with null meaningful in some combinations and not allowed
or meaningless in others.  Just make sure to clearly document how it
works.  As a test of whether or not it will work, I would recommend
writing the javadoc and test cases first (which of course we all do
any way ;)

Phil
>
>> Libraries that try to be too smart tend to be hard on both users and
>> maintainers,
> Completely agreed.
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Need API Feedback/Advice for ZipArchiveOutputStream ans ZIP64

Posted by Stefan Bodewig <bo...@apache.org>.
On 2011-08-06, Phil Steitz wrote:

> On 8/5/11 9:40 PM, Stefan Bodewig wrote:
>> This means ZipArchiveOutputStream must decide whether it is going to use
>> the ZIP64 format before it knows whether it would actually need it or
>> not.  If it signals it is going to use ZIP64 then an implementation that
>> doesn't support ZIP64 (like Compress 1.2 or java.util.zip) may fail to
>> read the archive, which is bad if the entry turns out to be smaller than
>> 4GiB.  If it doesn't signal ZIP64 it can't write big entries at all.

>> This decision can be made at the granularity of a single entry.  I.e. it
>> is possible to not use ZIP64 for the majority of entries and enable it
>> for individual entries.

>> IMHO there is no right or wrong decision here that the library could
>> make.  The user-code will have to decide whether ZIP64 should be enabled
>> or not.  The main questions to me are whether we want to attach this
>> decision to the stream or the entry itself and what the default should
>> be.

> Can you think of practical use cases where setting at the entry
> level is needed?

If there is a single entry that uses Zip64 features inside the archive
then an implementation that doesn't support it is most likely going to
choke on it anyway.  Compress 1.2 does.

One - maybe contrieved - use case could come up for one of the other
combinations ZipArchiveOutputStream has to support.  Writing entries of
unknown size to a seekable stream.  Here each entry gets 20 extra bytes
compared to Compress 1.2 that you could avoid by turning off Zip64
support in general and selectively turning it on for entries.  OTOH,
this implies you'd at least know whether the size was smaller or bigger
than 4GiB, which is not that likely if you don't know the exact size.

So no, no compelling use case.

>> InfoZIP's ZIP has decided to make it an option for the whole archive
>> (the command line doesn't offer much flexibility here) and make it
>> default to ZIP64.

>> My current thinking is that java.util.zip is a likely candidate for the
>> receiving end of ZIPs we create, so it may be better to turn ZIP64 off
>> by default, but I'm not sure.

>> I'm leaning towards adding a setUseZip64(boolean) method at the level of
>> ZipArchiveOutputStream and make it default to false.  This method could
>> be called in between putArchiveEntry calls to make it apply selectively
>> to indiviual entries.

> Sounds reasonable.

>> The name is totally open for debate since as it stands it sounds as if
>> you could turn off all Zip64 features which I wouldn't want to do for
>> the cases that can be dealt with transparently.  Then again it could use
>> a Boolean argument with "null" meaning "do the best you can" and false
>> "don't even use Zip64 if you think it is safe".

> I don't get what you mean by "do the best you can."  Does that mean
> turn it on when needed if somehow you know it is needed, per entry,
> I assume?

Actually I was thinking about what the method would mean for the other
combinations as well.  A "null" value doesn't make sense for the
specific case I'm asking about - here we need to decide what the default
should be: Zip64 or not.  For the other cases "null" could mean "use
Zip64 if you think you need it" i.e. what is omplemented right now,
"true" could mean "always use it" and "false" could mean "never use it,
throw an exception if you recognize it would be required".

> Libraries that try to be too smart tend to be hard on both users and
> maintainers,

Completely agreed.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Need API Feedback/Advice for ZipArchiveOutputStream ans ZIP64

Posted by Phil Steitz <ph...@gmail.com>.
On 8/5/11 9:40 PM, Stefan Bodewig wrote:
> Hi,
>
> there are eight possible permutations of compressed/uncompressed entries
> that get written to a seekable-/non-seekable stream whose size is either
> known or unknown prior to writing them.
>
> One of them is prohibited (uncompressed/non-seekable/unknown size) and
> has been prohibited before, no change here.
>
> For six of the remaining seven permutations ZipArchiveOutputStream
> should be generating archives that transparently enable ZIP64 features
> for entries if and only if they are too big to be stored without ZIP64.
> I.e. the resulting archive will either be readable by an implementation
> that doesn't support ZIP64 or it contains files that would be too big
> for such an implementation anyway.  The price we pay for some cases are
> an additional 20 bytes per entry that are never used by anybody.
>
> The only case that isn't covered so far is compressed / non-seekable
> output / input of unknown size.
>
> Such entries are stored using a feature that is called the "data
> descriptor".  There are two different formats of the data descriptor for
> ZIP64 and not-ZIP64 archives and the archive writer has to signal which
> type of descriptor it is going to write before it starts writing the
> entry's data.
>
> This means ZipArchiveOutputStream must decide whether it is going to use
> the ZIP64 format before it knows whether it would actually need it or
> not.  If it signals it is going to use ZIP64 then an implementation that
> doesn't support ZIP64 (like Compress 1.2 or java.util.zip) may fail to
> read the archive, which is bad if the entry turns out to be smaller than
> 4GiB.  If it doesn't signal ZIP64 it can't write big entries at all.
>
> This decision can be made at the granularity of a single entry.  I.e. it
> is possible to not use ZIP64 for the majority of entries and enable it
> for individual entries.
>
> IMHO there is no right or wrong decision here that the library could
> make.  The user-code will have to decide whether ZIP64 should be enabled
> or not.  The main questions to me are whether we want to attach this
> decision to the stream or the entry itself and what the default should
> be.

Can you think of practical use cases where setting at the entry
level is needed?
>
> InfoZIP's ZIP has decided to make it an option for the whole archive
> (the command line doesn't offer much flexibility here) and make it
> default to ZIP64.
>
> My current thinking is that java.util.zip is a likely candidate for the
> receiving end of ZIPs we create, so it may be better to turn ZIP64 off
> by default, but I'm not sure.
>
> I'm leaning towards adding a setUseZip64(boolean) method at the level of
> ZipArchiveOutputStream and make it default to false.  This method could
> be called in between putArchiveEntry calls to make it apply selectively
> to indiviual entries.

Sounds reasonable.
>
> The name is totally open for debate since as it stands it sounds as if
> you could turn off all Zip64 features which I wouldn't want to do for
> the cases that can be dealt with transparently.  Then again it could use
> a Boolean argument with "null" meaning "do the best you can" and false
> "don't even use Zip64 if you think it is safe".

I don't get what you mean by "do the best you can."  Does that mean
turn it on when needed if somehow you know it is needed, per entry,
I assume?

Libraries that try to be too smart tend to be hard on both users and
maintainers, so IIUC what is going on here, I would recommend KISS -
simple boolean property.  

Phil
>
> Any ideas?
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Need API Feedback/Advice for ZipArchiveOutputStream ans ZIP64

Posted by Stefan Bodewig <bo...@apache.org>.
On 2011-08-06, Torsten Curdt wrote:


>> The only case that isn't covered so far is compressed / non-seekable
>> output / input of unknown size.

> WTF

> How does the input stream distinguish between entry data and stream
> data if the length is not know up front?

The DEFLATE algorithm signals when the stream is terminated so the input
streams knows when its done.  The data descriptor following the entry
data merely serves to verify the data that has just been read.

This already works for "normal ZIP", we just need to support the
different formats of data descriptors properly.

Actually ZipArchiveInputStream even supports non-compressed entries of
unknown size if you ask it to - but it is highly recommended that you
don't.  It does so by assuming that whenever it encounters anything that
looks like a "data descriptor", a "local file header" or a "central
directory entry" then it must have reached the next entry.  This is
neither reliable nor does it perform well, but it is required to support
some archives found in the wild (some Microsoft XPS documents IIRC).

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] Need API Feedback/Advice for ZipArchiveOutputStream ans ZIP64

Posted by Torsten Curdt <tc...@vafer.org>.
> The only case that isn't covered so far is compressed / non-seekable
> output / input of unknown size.

WTF

How does the input stream distinguish between entry data and stream
data if the length is not know up front?

cheers,
Torsten

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org