You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Stefan Bodewig <bo...@apache.org> on 2011/08/04 16:48:26 UTC

[compress] ZIP64: API imposed limits vs limits of the format

Hi all,

ZIP64 support in trunk is almost complete except for a case that is
pretty easy to implement but where I'll ask for API feedback once I've
managed to read up on how the InfoZIP people handle it.

There are a few places where our implementation doesn't allow for the
full range the ZIP format would support.  Some are easy to fix, some
hard and I'm asking for feedback whether you consider it worth the
effort to fix them at all.

OK, here we go.

Total Size of the archive
=========================

There is no formal limit inside the format in particular since ZIP
archives can be split into multiple pieces.  For each individual piece
the last "local file header" can not have an offset of more than 2^64-1
bytes from the start of the file.

We don't support split archives at all so the size is limited to
one file.

ZipArchiveInputStream should work on arbitrary sizes.

ZipFile relies on RandomAccessFile so any archive can't be bigger than
the maximum size supported by RandomAccessFile.  In particular the seek
method expects a long as argument so the hard limit would be an archive
size of 2^63-1 bytes.  In practice I expect RandomAccessFile to not
support files that big on many platforms.

This is a "hard" case IMHO, I don't see how we could implement ZipFile
without RandomAccessFile in any efficient way.

ZipArchiveOutputStream has two modes.  If it writes to a file it will
use RandomAccessFile internally otherwise it writes to a stream.  In
file mode the same limits apply that apply to ZipFile.

For the streaming mode offsets are currently stored as longs but that
could be changed to BigIntegers easily so we could reach 2^64-1 at the
expense of memory consumption and maybe even some performance issues
(the offsets are not really used in calculations so I don't expect any
major impact).

Size of an individual entry (compressed or not)
===============================================

The format supports an unsigned 64 bit integer as size, ArchiveEntry's
get/setSize methods use long - this means there is a factor of 2.

We could easily add an additional setter/getter for size that uses
BigInteger, the infrastructure to support it would be there.  OTOH it is
questionable whether we'd support anything > Long.MAX_VALUE in practice
because of the previous point anyway.

Number of files entries the archive
===================================

This used to be an unsingned 16 bit integer and has grown to an
unsigned 64 bit integer with ZIP64.

ZipArchiveInputStream should work with arbitrary many entries.

ZipArchiveOutputStream uses a LinkedList to store all entries as it has
to keep track of the metadata in order to write the central directory.
It also uses an additional HashMap that could be removed easily by
storing the data together with the entries themselves.  LinkedList won't
allow more than Integer.MAX_VALUE entries which leaves us quite a bit
away from the theoretical limit of the format.

I'm confident that even I would manage to write an efficient singly
linked list that is only ever appended to and that is iterated over
exactly once from head to tail.  I'd even manage to keep track of the
size inside a long or BigInteger (if deemed necessary) in a O(1)
operation ;-)

So ZipArchiveOutputStream could easily be fixed if we wanted to.
Whether it is worth the effort is a different question when the size of
the file is still limited to a single "disk" archive.

ZipFile is a totally different beast.  It contains several maps
internally and I don't really see how to implement things like

           ZipArchiveEntry getEntry(String name)

efficiently without a map.  I don't see myself writing an efficient map
with a capacity of Long.MAX_VALUE or bigger, either.

And even if we had one, there'd still be the "archive size" limit.

We could stick with documenting the limits of ZipFile properly.  In
practice I doubt many people will have to deal with archives of 2^63
bytes or more.  And even archives with 2^32 entries or more should be
rare - in which case people could fall back to ZipArchiveInputStream.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] ZIP64: API imposed limits vs limits of the format

Posted by Stefan Bodewig <bo...@apache.org>.
On 2011-08-04, Lasse Collin wrote:

> On 2011-08-04 Stefan Bodewig wrote:
>> There are a few places where our implementation doesn't allow for the
>> full range the ZIP format would support.  Some are easy to fix, some
>> hard and I'm asking for feedback whether you consider it worth the
>> effort to fix them at all.

> I guess that these are enough for the foreseeable future:

>     Max archive size:             Long.MAX_VALUE
>     Max size of individual entry: Long.MAX_VALUE
>     Max number of file entries:   Integer.MAX_VALUE

This is what we currently have in all three classes,
ZipArchiveInputStream can go beyond that in all three cases in theory
(for the individual entry we'd need to store sizes in ZipArchiveEntry as
BigInt but no other change would be required).

> Java APIs don't suppport bigger files and I guess that so big files
> won't be common even if file system sizes allowed them. If you write
> ten terabytes per second, it will still take well over a week to
> create an archive of 2^63-1 bytes.

8-)

> I don't know how much memory one file entry needs, but let's assume
> it takes only 50 bytes, including the overhead of the linked list
> etc. Keeping a list of 2^31-1 files will then need 100 GiB of RAM.

As the code currently stands a single entry is likely way bigger than 50
bytes.  Even if we tried to keep only the data that is needed for the
central directory and store that as raw bytes rather than Java objects a
single central directory entry would still need the file name plus at
least about twenty five bytes per entry.

This doesn't look as if it was worth implementing a linked list, I
should have calculated the total memory needed myself, I guess.

> Even if the number of files is limited to Integer.MAX_VALUE, it can be
> good to think about the memory usage of the data structures used for
> the file entries.

True, I'll think about what I currently store inside the linked list in
ZipArchiveOutputStream a bit more.  Inside of ZipFile it really doesn't
make much sense to store the data in any other way than ZipArchiveEntry
objects which are not really optimized for size (and inherit from
java.util.ZipEntry so out hands are pretty much tied).

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] ZIP64: API imposed limits vs limits of the format

Posted by Lasse Collin <la...@tukaani.org>.
On 2011-08-04 Stefan Bodewig wrote:
> There are a few places where our implementation doesn't allow for the
> full range the ZIP format would support.  Some are easy to fix, some
> hard and I'm asking for feedback whether you consider it worth the
> effort to fix them at all.

I guess that these are enough for the foreseeable future:

    Max archive size:             Long.MAX_VALUE
    Max size of individual entry: Long.MAX_VALUE
    Max number of file entries:   Integer.MAX_VALUE

Java APIs don't suppport bigger files and I guess that so big files
won't be common even if file system sizes allowed them. If you write
ten terabytes per second, it will still take well over a week to
create an archive of 2^63-1 bytes.

I don't know how much memory one file entry needs, but let's assume
it takes only 50 bytes, including the overhead of the linked list
etc. Keeping a list of 2^31-1 files will then need 100 GiB of RAM.
While it might be OK in some situations, I hope such archives won't
become common. ;-) Even if the number of files is limited to
Integer.MAX_VALUE, it can be good to think about the memory usage
of the data structures used for the file entries.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] ZIP64: API imposed limits vs limits of the format

Posted by Stefan Bodewig <bo...@apache.org>.
On 2011-08-04, Torsten Curdt wrote:

>> ZipFile relies on RandomAccessFile so any archive can't be bigger than
>> the maximum size supported by RandomAccessFile.  In particular the seek
>> method expects a long as argument so the hard limit would be an archive
>> size of 2^63-1 bytes.  In practice I expect RandomAccessFile to not
>> support files that big on many platforms.

> Yeah ... let's cross that bridge when people complain ;)

With that I can certainly live.

>> For the streaming mode offsets are currently stored as longs but that
>> could be changed to BigIntegers easily so we could reach 2^64-1 at the
>> expense of memory consumption and maybe even some performance issues
>> (the offsets are not really used in calculations so I don't expect any
>> major impact).

> No insights on the implementation but that might be worth changing so
> it's in line with the ZipFile impl

ZipFile is already limited to longs via RandomAccessFile.

>> I'm confident that even I would manage to write an efficient singly
>> linked list that is only ever appended to and that is iterated over
>> exactly once from head to tail.

> +1 for that then :)

Lasse's post showing that I'd need 100+ GB of RAM to take advantage of
my bigger LinkedList made me drop that plan 8-)

If anybody is really dealing with archives that big they likely don't
use Commons Compress and if they do then support for archives split into
multiple files might be more important.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] ZIP64: API imposed limits vs limits of the format

Posted by Torsten Curdt <tc...@vafer.org>.
> ZipFile relies on RandomAccessFile so any archive can't be bigger than
> the maximum size supported by RandomAccessFile.  In particular the seek
> method expects a long as argument so the hard limit would be an archive
> size of 2^63-1 bytes.  In practice I expect RandomAccessFile to not
> support files that big on many platforms.

Yeah ... let's cross that bridge when people complain ;)

> For the streaming mode offsets are currently stored as longs but that
> could be changed to BigIntegers easily so we could reach 2^64-1 at the
> expense of memory consumption and maybe even some performance issues
> (the offsets are not really used in calculations so I don't expect any
> major impact).

No insights on the implementation but that might be worth changing so
it's in line with the ZipFile impl

> Size of an individual entry (compressed or not)
> ===============================================
>
> The format supports an unsigned 64 bit integer as size, ArchiveEntry's
> get/setSize methods use long - this means there is a factor of 2.
>
> We could easily add an additional setter/getter for size that uses
> BigInteger, the infrastructure to support it would be there.  OTOH it is
> questionable whether we'd support anything > Long.MAX_VALUE in practice
> because of the previous point anyway.

Especially as this also just for one individual entry. Again - I think
I would not bother at this stage.
Nothing that cannot be added later.

> Number of files entries the archive
> ===================================
>
> This used to be an unsingned 16 bit integer and has grown to an
> unsigned 64 bit integer with ZIP64.
>
> ZipArchiveInputStream should work with arbitrary many entries.
>
> ZipArchiveOutputStream uses a LinkedList to store all entries as it has
> to keep track of the metadata in order to write the central directory.
> It also uses an additional HashMap that could be removed easily by
> storing the data together with the entries themselves.  LinkedList won't
> allow more than Integer.MAX_VALUE entries which leaves us quite a bit
> away from the theoretical limit of the format.

Hmmm.

> I'm confident that even I would manage to write an efficient singly
> linked list that is only ever appended to and that is iterated over
> exactly once from head to tail.

+1 for that then :)

> I don't see myself writing an efficient map
> with a capacity of Long.MAX_VALUE or bigger, either.

There must be something like that out there already.
Otherwise it could be another nice addition to Collections ;)

> We could stick with documenting the limits of ZipFile properly.  In
> practice I doubt many people will have to deal with archives of 2^63
> bytes or more.  And even archives with 2^32 entries or more should be
> rare - in which case people could fall back to ZipArchiveInputStream.

Hm. Yeah ...maybe just get it out before we start implementing new
collection classes.

Cool stuff!!

cheers,
Torsten

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org